首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
BSA-Dehaze: Multi-Scale Bitemporal Fusion and Size-Aware Decoder for Unsupervised Image Dehazing BSA-Dehaze:用于无监督图像去雾的多尺度双时间融合和尺寸感知解码器
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-19 DOI: 10.1016/j.imavis.2025.105819
Wujin Li , Qian Xing , Wei He , Longyuan Guo , Jianhui Wu , Minzhi Zhao , Siyuan Chen
Single-image dehazing plays a critical role in various autonomous vision systems. Early methods relied on hand-crafted optimization techniques, whereas recent approaches leverage deep neural networks trained on synthetic data, owing to the scarcity of real-world paired datasets. However, this often results in domain bias when applied to outdoor scenes. In this paper, we present BSA-Dehaze, an unsupervised single-image dehazing framework that integrates a Multi-Scale Bitemporal Fusion Module (MBFM) and a Size-Aware Decoder (SA-Decoder). The method operates without requiring ground-truth images. Our method reformulates dehazing as a haze-to-clear image translation task. BSA-Dehaze incorporates a novel Encoder-SA-Decoder built with ResNet blocks, designed to better preserve image details and edge sharpness. To enhance feature fusion and training efficiency, we introduce the MBFM. A multi-scale discriminator (MSD) is proposed, along with Hinge Loss and Dynamic Block-wise Contrastive Loss, to improve training stability and emphasize challenging samples. Ablation studies verify the contribution of each component. Experimental results on SOTS outdoor, BeDDE, and a real-world dataset demonstrate that our method surpasses existing approaches in both performance and efficiency, despite being trained on significantly less data.
单图像去雾在各种自主视觉系统中起着至关重要的作用。早期的方法依赖于手工优化技术,而由于现实世界配对数据集的稀缺性,最近的方法利用了在合成数据上训练的深度神经网络。然而,当应用于户外场景时,这通常会导致域偏差。在本文中,我们提出了BSA-Dehaze,一种集成了多尺度双时间融合模块(MBFM)和尺寸感知解码器(SA-Decoder)的无监督单图像去雾框架。该方法不需要地面真值图像。我们的方法将除雾重新定义为从雾到清晰的图像转换任务。BSA-Dehaze采用了一种新颖的编码器- sa -解码器,采用ResNet块构建,旨在更好地保留图像细节和边缘清晰度。为了提高特征融合和训练效率,我们引入了MBFM。提出了一种多尺度判别器(MSD),结合Hinge Loss和Dynamic Block-wise contrast Loss来提高训练稳定性和强调挑战性样本。消融研究证实了每个组成部分的贡献。在SOTS户外、BeDDE和真实数据集上的实验结果表明,我们的方法在性能和效率上都超过了现有的方法,尽管训练的数据要少得多。
{"title":"BSA-Dehaze: Multi-Scale Bitemporal Fusion and Size-Aware Decoder for Unsupervised Image Dehazing","authors":"Wujin Li ,&nbsp;Qian Xing ,&nbsp;Wei He ,&nbsp;Longyuan Guo ,&nbsp;Jianhui Wu ,&nbsp;Minzhi Zhao ,&nbsp;Siyuan Chen","doi":"10.1016/j.imavis.2025.105819","DOIUrl":"10.1016/j.imavis.2025.105819","url":null,"abstract":"<div><div>Single-image dehazing plays a critical role in various autonomous vision systems. Early methods relied on hand-crafted optimization techniques, whereas recent approaches leverage deep neural networks trained on synthetic data, owing to the scarcity of real-world paired datasets. However, this often results in domain bias when applied to outdoor scenes. In this paper, we present BSA-Dehaze, an unsupervised single-image dehazing framework that integrates a Multi-Scale Bitemporal Fusion Module (MBFM) and a Size-Aware Decoder (SA-Decoder). The method operates without requiring ground-truth images. Our method reformulates dehazing as a haze-to-clear image translation task. BSA-Dehaze incorporates a novel Encoder-SA-Decoder built with ResNet blocks, designed to better preserve image details and edge sharpness. To enhance feature fusion and training efficiency, we introduce the MBFM. A multi-scale discriminator (MSD) is proposed, along with Hinge Loss and Dynamic Block-wise Contrastive Loss, to improve training stability and emphasize challenging samples. Ablation studies verify the contribution of each component. Experimental results on SOTS outdoor, BeDDE, and a real-world dataset demonstrate that our method surpasses existing approaches in both performance and efficiency, despite being trained on significantly less data.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105819"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145600398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A comprehensive survey on magnetic resonance image reconstruction 磁共振图像重建技术综述
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-19 DOI: 10.1016/j.imavis.2025.105832
Xiaoyan Kui , Zijie Fan , Zexin Ji , Qinsong Li , Chengtao Liu , Weixin Si , Beiji Zou
Magnetic resonance imaging (MRI) reconstruction is a fundamental task aimed at recovering high-quality images from undersampled or low-quality MRI data. This process enhances diagnostic accuracy and optimizes clinical applications. In recent years, deep learning-based MRI reconstruction has made significant progress. Advancements include single-modality feature extraction using different network architectures, the integration of multimodal information, and the adoption of unsupervised or semi-supervised learning strategies. However, despite extensive research, MRI reconstruction remains a challenging problem that has yet to be fully resolved. This survey provides a systematic review of MRI reconstruction methods, covering key aspects such as data acquisition and preprocessing, publicly available datasets, single and multi-modal reconstruction models, training strategies, and evaluation metrics based on image reconstruction and downstream tasks. Additionally, we analyze the major challenges in this field and explore potential future directions.
磁共振成像(MRI)重建是一项旨在从欠采样或低质量MRI数据中恢复高质量图像的基本任务。这一过程提高了诊断的准确性,优化了临床应用。近年来,基于深度学习的MRI重建取得了重大进展。进步包括使用不同网络架构的单模态特征提取,多模态信息的集成,以及采用无监督或半监督学习策略。然而,尽管进行了广泛的研究,MRI重建仍然是一个具有挑战性的问题,尚未完全解决。本研究对MRI重建方法进行了系统回顾,涵盖了数据采集和预处理、公开可用数据集、单模态和多模态重建模型、训练策略以及基于图像重建和下游任务的评估指标等关键方面。此外,我们分析了该领域的主要挑战,并探讨了潜在的未来方向。
{"title":"A comprehensive survey on magnetic resonance image reconstruction","authors":"Xiaoyan Kui ,&nbsp;Zijie Fan ,&nbsp;Zexin Ji ,&nbsp;Qinsong Li ,&nbsp;Chengtao Liu ,&nbsp;Weixin Si ,&nbsp;Beiji Zou","doi":"10.1016/j.imavis.2025.105832","DOIUrl":"10.1016/j.imavis.2025.105832","url":null,"abstract":"<div><div>Magnetic resonance imaging (MRI) reconstruction is a fundamental task aimed at recovering high-quality images from undersampled or low-quality MRI data. This process enhances diagnostic accuracy and optimizes clinical applications. In recent years, deep learning-based MRI reconstruction has made significant progress. Advancements include single-modality feature extraction using different network architectures, the integration of multimodal information, and the adoption of unsupervised or semi-supervised learning strategies. However, despite extensive research, MRI reconstruction remains a challenging problem that has yet to be fully resolved. This survey provides a systematic review of MRI reconstruction methods, covering key aspects such as data acquisition and preprocessing, publicly available datasets, single and multi-modal reconstruction models, training strategies, and evaluation metrics based on image reconstruction and downstream tasks. Additionally, we analyze the major challenges in this field and explore potential future directions.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105832"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HF-D-FINE: High-resolution features enhanced D-FINE for tiny object detection in UAV image HF-D-FINE:高分辨率功能增强的D-FINE,用于无人机图像中的微小物体检测
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-19 DOI: 10.1016/j.imavis.2025.105834
Qianhua Hu, Liantao Wang
Real-time detection in UAV-captured imagery remains a formidable challenge, primarily attributed to the inherent tension between high detection performance and strict computational economy. To address this dilemma, we introduce HF-D-FINE, a novel object-detection paradigm that builds upon the D-FINE architecture and comprises three effective innovations. The HF Hybrid Encoder alleviates the loss of fine-grained detail by selectively injecting high-resolution cues from the backbone’s feature pyramid into the encoder, thereby enriching the representation of minute instances. Complementarily, the CAF module performs cross-scale feature fusion by integrating channel-attentive mechanisms and dynamic upsampling, enabling more expressive interactions between multi-level semantics and spatial cues. Finally, Outer-SNWD introduces aspect ratio consistency penalty factor and auxiliary boxes based on the advantages of Shape-IoU and NWD, making it more suitable for tiny object detection tasks. Collectively, these components substantially elevate tiny object detection accuracy while preserving low computational overhead. Extensive experiments on the widely-adopted aerial benchmarks VisDrone, AI-TOD, and UAVDT demonstrate that HF-D-FINE achieves superior accuracy with a tiny increase in FLOPs. In the VisDrone dataset, the AP value is increased by 3.2% compared with D-FINE-S, and the AP50 value is increased by 4.3%, confirming its efficacy and superiority for tiny object detection in UAV image.
无人机捕获图像的实时检测仍然是一个艰巨的挑战,主要归因于高检测性能和严格的计算经济性之间的内在紧张关系。为了解决这一困境,我们引入了HF-D-FINE,这是一种基于D-FINE架构的新型对象检测范式,包含三个有效的创新。高频混合编码器通过选择性地将骨干特征金字塔中的高分辨率线索注入编码器,从而丰富了微小实例的表示,从而减轻了细粒度细节的损失。此外,CAF模块通过集成通道关注机制和动态上采样来实现跨尺度特征融合,从而实现多层次语义和空间线索之间更具表现力的交互。最后,基于Shape-IoU和NWD的优点,Outer-SNWD引入了宽高比一致性惩罚因子和辅助盒,使其更适合微小目标检测任务。总的来说,这些组件大大提高了微小物体检测的精度,同时保持了较低的计算开销。在广泛采用的空中基准VisDrone, AI-TOD和UAVDT上进行的大量实验表明,HF-D-FINE在FLOPs略有增加的情况下实现了卓越的精度。在VisDrone数据集中,与D-FINE-S相比,AP值提高了3.2%,AP50值提高了4.3%,证实了其在无人机图像中微小目标检测的有效性和优越性。
{"title":"HF-D-FINE: High-resolution features enhanced D-FINE for tiny object detection in UAV image","authors":"Qianhua Hu,&nbsp;Liantao Wang","doi":"10.1016/j.imavis.2025.105834","DOIUrl":"10.1016/j.imavis.2025.105834","url":null,"abstract":"<div><div>Real-time detection in UAV-captured imagery remains a formidable challenge, primarily attributed to the inherent tension between high detection performance and strict computational economy. To address this dilemma, we introduce HF-D-FINE, a novel object-detection paradigm that builds upon the D-FINE architecture and comprises three effective innovations. The HF Hybrid Encoder alleviates the loss of fine-grained detail by selectively injecting high-resolution cues from the backbone’s feature pyramid into the encoder, thereby enriching the representation of minute instances. Complementarily, the CAF module performs cross-scale feature fusion by integrating channel-attentive mechanisms and dynamic upsampling, enabling more expressive interactions between multi-level semantics and spatial cues. Finally, Outer-SNWD introduces aspect ratio consistency penalty factor and auxiliary boxes based on the advantages of Shape-IoU and NWD, making it more suitable for tiny object detection tasks. Collectively, these components substantially elevate tiny object detection accuracy while preserving low computational overhead. Extensive experiments on the widely-adopted aerial benchmarks VisDrone, AI-TOD, and UAVDT demonstrate that HF-D-FINE achieves superior accuracy with a tiny increase in FLOPs. In the VisDrone dataset, the AP value is increased by 3.2% compared with D-FINE-S, and the AP<sub>50</sub> value is increased by 4.3%, confirming its efficacy and superiority for tiny object detection in UAV image.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105834"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advanced fusion of IoT and AI technologies for smart environments: Enhancing environmental perception and mobility solutions for visually impaired individuals 智能环境中物联网和人工智能技术的先进融合:增强视障人士的环境感知和移动解决方案
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-19 DOI: 10.1016/j.imavis.2025.105827
Nouf Nawar Alotaibi , Mrim M. Alnfiai , Mona Mohammed Alnahari , Salma Mohsen M. Alnefaie , Faiz Abdullah Alotaibi

Objective

To develop a robust proposed model that integrates multiple sensor modalities to enhance environmental perception and mobility for visually impaired individuals, improving their autonomy and safety in both indoor and outdoor settings.

Methods

The proposed system utilizes advanced IoT and AI technologies, integrating data from proximity, ambient light, and motion sensors through recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models. A comprehensive dataset was collected across diverse environments to train and evaluate the model's accuracy in real-time environmental context estimation and motion activity detection. This study employed a multidisciplinary approach, integrating the Internet of Things (IoT) and Artificial Intelligence (AI), to develop a proposed model for assisting visually impaired individuals. The study was conducted over six months (April 2024 to September 2024) in Saudi Arabia, utilizing resources from Najran University. Data collection involved deploying IoT devices across various indoor and outdoor environments, including residential areas, commercial spaces, and urban streets, to ensure diversity and real-world applicability. The system utilized proximity sensors, ambient light sensors, and motion detectors to gather data under different lighting, weather, and dynamic conditions. Recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models were employed to process the sensor inputs and provide real-time environmental context and motion detection. The study followed a rigorous training and validation process using the collected dataset, ensuring reliability and scalability across diverse scenarios. Ethical considerations were adhered to throughout the project, with no direct interaction with human subjects.

Results

The Proposed model demonstrated an accuracy of 85% in predicting environmental context and 82% in motion detection, achieving precision and F1-scores of 88% and 85%, respectively. Real-time implementation provided reliable, dynamic feedback on environmental changes and motion activities, significantly enhancing situational awareness.

Conclusion

The Proposed model effectively combines sensor data to deliver real-time, context-aware assistance for visually impaired individuals, improving their ability to navigate complex environments. The system offers a significant advancement in assistive technology and holds promise for broader applications with further enhancements.
目的开发一种集成多种传感器模式的鲁棒模型,以增强视障人士的环境感知和行动能力,提高他们在室内和室外环境中的自主性和安全性。该系统利用先进的物联网和人工智能技术,通过递归贝叶斯滤波、基于核的融合算法和概率图模型集成来自邻近、环境光和运动传感器的数据。在不同的环境中收集了一个全面的数据集,以训练和评估模型在实时环境上下文估计和运动活动检测中的准确性。本研究采用多学科方法,将物联网(IoT)和人工智能(AI)相结合,开发了一种辅助视障人士的拟议模型。该研究在沙特阿拉伯进行了六个月(2024年4月至2024年9月),利用了Najran大学的资源。数据收集涉及在各种室内和室外环境中部署物联网设备,包括住宅区、商业空间和城市街道,以确保多样性和现实世界的适用性。该系统利用接近传感器、环境光传感器和运动探测器来收集不同照明、天气和动态条件下的数据。采用递归贝叶斯滤波、基于核的融合算法和概率图形模型来处理传感器输入,并提供实时环境上下文和运动检测。该研究使用收集的数据集进行了严格的训练和验证过程,确保了在不同场景下的可靠性和可扩展性。整个项目都坚持伦理考虑,没有与人类受试者直接互动。结果该模型预测环境背景的准确率为85%,运动检测的准确率为82%,精度和f1分数分别达到88%和85%。实时实现为环境变化和运动活动提供了可靠的动态反馈,显著增强了态势感知能力。该模型有效地结合传感器数据,为视障人士提供实时、情境感知的帮助,提高他们在复杂环境中导航的能力。该系统提供了辅助技术的重大进步,并有望进一步增强更广泛的应用。
{"title":"Advanced fusion of IoT and AI technologies for smart environments: Enhancing environmental perception and mobility solutions for visually impaired individuals","authors":"Nouf Nawar Alotaibi ,&nbsp;Mrim M. Alnfiai ,&nbsp;Mona Mohammed Alnahari ,&nbsp;Salma Mohsen M. Alnefaie ,&nbsp;Faiz Abdullah Alotaibi","doi":"10.1016/j.imavis.2025.105827","DOIUrl":"10.1016/j.imavis.2025.105827","url":null,"abstract":"<div><h3>Objective</h3><div>To develop a robust proposed model that integrates multiple sensor modalities to enhance environmental perception and mobility for visually impaired individuals, improving their autonomy and safety in both indoor and outdoor settings.</div></div><div><h3>Methods</h3><div>The proposed system utilizes advanced IoT and AI technologies, integrating data from proximity, ambient light, and motion sensors through recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models. A comprehensive dataset was collected across diverse environments to train and evaluate the model's accuracy in real-time environmental context estimation and motion activity detection. This study employed a multidisciplinary approach, integrating the Internet of Things (IoT) and Artificial Intelligence (AI), to develop a proposed model for assisting visually impaired individuals. The study was conducted over six months (April 2024 to September 2024) in Saudi Arabia, utilizing resources from Najran University. Data collection involved deploying IoT devices across various indoor and outdoor environments, including residential areas, commercial spaces, and urban streets, to ensure diversity and real-world applicability. The system utilized proximity sensors, ambient light sensors, and motion detectors to gather data under different lighting, weather, and dynamic conditions. Recursive Bayesian filtering, kernel-based fusion algorithms, and probabilistic graphical models were employed to process the sensor inputs and provide real-time environmental context and motion detection. The study followed a rigorous training and validation process using the collected dataset, ensuring reliability and scalability across diverse scenarios. Ethical considerations were adhered to throughout the project, with no direct interaction with human subjects.</div></div><div><h3>Results</h3><div>The Proposed model demonstrated an accuracy of 85% in predicting environmental context and 82% in motion detection, achieving precision and F1-scores of 88% and 85%, respectively. Real-time implementation provided reliable, dynamic feedback on environmental changes and motion activities, significantly enhancing situational awareness.</div></div><div><h3>Conclusion</h3><div>The Proposed model effectively combines sensor data to deliver real-time, context-aware assistance for visually impaired individuals, improving their ability to navigate complex environments. The system offers a significant advancement in assistive technology and holds promise for broader applications with further enhancements.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105827"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A human layout consistency framework for image-based virtual try-on 基于图像的虚拟试戴的人工布局一致性框架
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-19 DOI: 10.1016/j.imavis.2025.105831
Rong Huang , Zhicheng Wang , Hao Liu , Aihua Dong
Image-based virtual try-on, commonly framed as a generative image-to-image translation task, has garnered significant research interest due to its elimination of the need for costly 3D scanning devices. In this field, image inpainting and cycle-consistency have been the dominant frameworks, but they still face challenges in cross-attribute adaptation and parameter sharing between try-on networks. This paper proposes a new framework, termed human layout consistency, based on the intuitive insight that a high-quality try-on result should align with a coherent human layout. Under the proposed framework, a try-on network is equipped with an upstream Human Layout Generator (HLG) and a downstream Human Layout Parser (HLP). The former generates an expected human layout as if the person were wearing the selected target garment, while the latter extracts an actual human layout parsed from the try-on result. The supervisory signals, free from the ground-truth image pairs, are constructed by assessing the consistencies between the expected and actual human layouts. We design a dual-phase training strategy, first warming up HLG and HLP, then training try-on network by incorporating the supervisory signals based on human layout consistency. On this basis, the proposed framework enables arbitrary selection of target garments during training, thereby endowing the try-on network with the cross-attribute adaptation. Moreover, the proposed framework operates with a single try-on network, rather than two physically separate ones, thereby avoiding the parameter-sharing issue. We conducted both qualitative and quantitative experiments on the benchmark VITON dataset. Experimental results demonstrate that our proposal can generate high-quality try-on results, outperforming baselines by a margin of 0.75% to 10.58%. Ablation and visualization results further reveal that the proposed method exhibits superior adaptability to cross-attribute translations, showcasing its potential for practical application.
基于图像的虚拟试戴,通常被认为是一种生成图像到图像的转换任务,由于它消除了对昂贵的3D扫描设备的需求,已经获得了重要的研究兴趣。在该领域中,图像绘制和循环一致性一直是占主导地位的框架,但它们在试戴网络之间的跨属性自适应和参数共享方面仍然面临挑战。本文提出了一个新的框架,称为人类布局一致性,基于直观的洞察力,高质量的试戴结果应该与连贯的人类布局保持一致。在该框架下,试戴网络配备了上游人工布局生成器(HLG)和下游人工布局解析器(HLP)。前者生成预期的人体布局,就好像这个人穿着所选择的目标服装一样,而后者则从试穿结果中提取解析后的实际人体布局。监控信号不依赖于真实图像对,通过评估预期布局和实际布局之间的一致性来构建。我们设计了一种双阶段训练策略,首先预热HLG和HLP,然后结合基于人类布局一致性的监控信号来训练试戴网络。在此基础上,提出的框架可以在训练过程中任意选择目标服装,从而使试穿网络具有跨属性自适应能力。此外,所提出的框架在单个试上线网络中运行,而不是两个物理上分开的网络,从而避免了参数共享问题。我们对基准的VITON数据集进行了定性和定量实验。实验结果表明,我们的提议可以产生高质量的试戴结果,比基线高出0.75%至10.58%。烧蚀和可视化结果进一步表明,该方法对跨属性翻译具有良好的适应性,显示了其实际应用潜力。
{"title":"A human layout consistency framework for image-based virtual try-on","authors":"Rong Huang ,&nbsp;Zhicheng Wang ,&nbsp;Hao Liu ,&nbsp;Aihua Dong","doi":"10.1016/j.imavis.2025.105831","DOIUrl":"10.1016/j.imavis.2025.105831","url":null,"abstract":"<div><div>Image-based virtual try-on, commonly framed as a generative image-to-image translation task, has garnered significant research interest due to its elimination of the need for costly 3D scanning devices. In this field, image inpainting and cycle-consistency have been the dominant frameworks, but they still face challenges in cross-attribute adaptation and parameter sharing between try-on networks. This paper proposes a new framework, termed human layout consistency, based on the intuitive insight that a high-quality try-on result should align with a coherent human layout. Under the proposed framework, a try-on network is equipped with an upstream Human Layout Generator (HLG) and a downstream Human Layout Parser (HLP). The former generates an expected human layout as if the person were wearing the selected target garment, while the latter extracts an actual human layout parsed from the try-on result. The supervisory signals, free from the ground-truth image pairs, are constructed by assessing the consistencies between the expected and actual human layouts. We design a dual-phase training strategy, first warming up HLG and HLP, then training try-on network by incorporating the supervisory signals based on human layout consistency. On this basis, the proposed framework enables arbitrary selection of target garments during training, thereby endowing the try-on network with the cross-attribute adaptation. Moreover, the proposed framework operates with a single try-on network, rather than two physically separate ones, thereby avoiding the parameter-sharing issue. We conducted both qualitative and quantitative experiments on the benchmark VITON dataset. Experimental results demonstrate that our proposal can generate high-quality try-on results, outperforming baselines by a margin of 0.75% to 10.58%. Ablation and visualization results further reveal that the proposed method exhibits superior adaptability to cross-attribute translations, showcasing its potential for practical application.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105831"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-modal cooperative fusion network for dual-stream RGB-D salient object detection 双流RGB-D显著目标检测的多模态协同融合网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-19 DOI: 10.1016/j.imavis.2025.105835
Jingyu Wu , Fuming Sun , Haojie Li , Mingyu Lu
Most existing RGB-D salient object detection tasks use convolution operations to design complex fusion modules for cross-modal information fusion. How to correctly integrate RGB and depth features into multi-modal features is important to salient object detection (SOD). Due to the differences between different modal features, the salient object detection model is seriously hindered in achieving better performance. To address the issues mentioned above, we design a multi-modal cooperative fusion network (MCFNet) to achieve RGB-D SOD. Firstly, we propose an edge feature refinement module to remove interference information in shallow features and improve the edge accuracy of SOD. Secondly, a depth optimization module is designed to optimize erroneous estimates in the depth maps, which effectively reduces the impact of noise and improves the performance of the model. Finally, we construct a progressive fusion module that gradually integrates RGB and depth features in a layered manner to achieve an efficient fusion of cross-modal features. Experimental results on six datasets show that our MCFNet performs better than other state-of-the-art (SOTA) methods, which provide new ideas for salient object detection tasks.
现有的RGB-D显著目标检测任务大多采用卷积运算来设计复杂的融合模块,实现跨模态信息融合。如何正确地将RGB和深度特征整合到多模态特征中,对于显著目标检测(SOD)具有重要意义。由于不同模态特征之间存在差异,严重阻碍了显著目标检测模型获得更好的性能。为了解决上述问题,我们设计了一个多模态协同融合网络(MCFNet)来实现RGB-D SOD。首先,提出边缘特征细化模块,去除浅层特征中的干扰信息,提高SOD边缘精度;其次,设计深度优化模块,对深度图中的错误估计进行优化,有效降低了噪声的影响,提高了模型的性能;最后,我们构建了一个递进融合模块,以分层的方式逐步融合RGB和深度特征,以实现高效的跨模态特征融合。在六个数据集上的实验结果表明,我们的MCFNet比其他最先进的方法(SOTA)性能更好,为显著目标检测任务提供了新的思路。
{"title":"Multi-modal cooperative fusion network for dual-stream RGB-D salient object detection","authors":"Jingyu Wu ,&nbsp;Fuming Sun ,&nbsp;Haojie Li ,&nbsp;Mingyu Lu","doi":"10.1016/j.imavis.2025.105835","DOIUrl":"10.1016/j.imavis.2025.105835","url":null,"abstract":"<div><div>Most existing RGB-D salient object detection tasks use convolution operations to design complex fusion modules for cross-modal information fusion. How to correctly integrate RGB and depth features into multi-modal features is important to salient object detection (SOD). Due to the differences between different modal features, the salient object detection model is seriously hindered in achieving better performance. To address the issues mentioned above, we design a multi-modal cooperative fusion network (MCFNet) to achieve RGB-D SOD. Firstly, we propose an edge feature refinement module to remove interference information in shallow features and improve the edge accuracy of SOD. Secondly, a depth optimization module is designed to optimize erroneous estimates in the depth maps, which effectively reduces the impact of noise and improves the performance of the model. Finally, we construct a progressive fusion module that gradually integrates RGB and depth features in a layered manner to achieve an efficient fusion of cross-modal features. Experimental results on six datasets show that our MCFNet performs better than other state-of-the-art (SOTA) methods, which provide new ideas for salient object detection tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105835"},"PeriodicalIF":4.2,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CFE-PVTSeg:Cross-domain frequency-enhanced pyramid vision transformer segmentation network CFE-PVTSeg:跨域频率增强金字塔视觉变压器分割网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-17 DOI: 10.1016/j.imavis.2025.105824
Niu Guo, Yi Liu, Pengcheng Zhang, Jiaqi Kang, Zhiguo Gui, Lei Wang
Current polyp segmentation methods predominantly rely on either standalone Convolutional Neural Networks (CNNs) or Transformer architectures, which exhibit inherent limitations in balancing global–local contextual relationships and preserving high-frequency structural details. To address these challenges, this study proposes a Cross-domain Frequency-enhanced Pyramid Vision Transformer Segmentation Network (CFE-PVTSeg). In the encoder, the network achieves hierarchical feature enhancement by integrating Transformer encoders with wavelet transforms: it separately extracts multi-scale spatial features (based on Pyramid Vision Transformer) and frequency-domain features (based on Discrete Wavelet Transform), reinforcing high-frequency components through a cross-domain fusion mechanism. Simultaneously, deformable convolutions with enhanced adaptability are combined with regular convolutions for stability to aggregate boundary-sensitive features that accommodate the irregular morphological variations of polyps. In the decoder, an innovative Multi-Scale Feature Uncertainty Enhancement (MS-FUE) module is designed, which leverages an uncertainty map derived from the encoder to adaptively weight and refine upsampled features, thereby effectively suppressing uncertain components while enhancing the propagation of reliable information. Finally, through a multi-level fusion strategy, the model outputs refined features that deeply integrate high-level semantics with low-level spatial details. Extensive experiments on five public benchmark datasets (Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, ETIS, and CVC-300) demonstrate that CFE-PVTSeg achieves superior robustness and segmentation accuracy compared to existing methods when handling challenging scenarios such as scale variations and blurred boundaries. Ablation studies further validate the effectiveness of both the proposed cross-domain enhanced encoder and the uncertainty-driven decoder, particularly in suppressing feature noise and improving morphological adaptability to polyps with heterogeneous appearance characteristics.
当前的息肉分割方法主要依赖于独立的卷积神经网络(cnn)或Transformer架构,它们在平衡全局-局部上下文关系和保留高频结构细节方面存在固有的局限性。为了解决这些挑战,本研究提出了一种跨域频率增强金字塔视觉变压器分割网络(CFE-PVTSeg)。在编码器中,网络将Transformer编码器与小波变换相结合,实现层次化特征增强:分别提取多尺度空间特征(基于Pyramid Vision Transformer)和频域特征(基于离散小波变换),通过跨域融合机制增强高频成分。同时,具有增强适应性的可变形卷积与具有稳定性的规则卷积相结合,以聚合适应息肉不规则形态变化的边界敏感特征。在解码器中,设计了一种创新的多尺度特征不确定性增强(MS-FUE)模块,该模块利用来自编码器的不确定性映射自适应加权和细化上采样特征,从而有效地抑制不确定性成分,同时增强可靠信息的传播。最后,通过多层次的融合策略,模型输出精细化的特征,将高层语义与低层空间细节深度融合。在五个公共基准数据集(Kvasir-SEG、CVC-ClinicDB、CVC-ColonDB、ETIS和CVC-300)上进行的大量实验表明,在处理规模变化和模糊边界等具有挑战性的场景时,与现有方法相比,CFE-PVTSeg具有更好的鲁棒性和分割精度。消融研究进一步验证了所提出的跨域增强编码器和不确定性驱动解码器的有效性,特别是在抑制特征噪声和提高对具有异质外观特征的息肉的形态适应性方面。
{"title":"CFE-PVTSeg:Cross-domain frequency-enhanced pyramid vision transformer segmentation network","authors":"Niu Guo,&nbsp;Yi Liu,&nbsp;Pengcheng Zhang,&nbsp;Jiaqi Kang,&nbsp;Zhiguo Gui,&nbsp;Lei Wang","doi":"10.1016/j.imavis.2025.105824","DOIUrl":"10.1016/j.imavis.2025.105824","url":null,"abstract":"<div><div>Current polyp segmentation methods predominantly rely on either standalone Convolutional Neural Networks (CNNs) or Transformer architectures, which exhibit inherent limitations in balancing global–local contextual relationships and preserving high-frequency structural details. To address these challenges, this study proposes a Cross-domain Frequency-enhanced Pyramid Vision Transformer Segmentation Network (CFE-PVTSeg). In the encoder, the network achieves hierarchical feature enhancement by integrating Transformer encoders with wavelet transforms: it separately extracts multi-scale spatial features (based on Pyramid Vision Transformer) and frequency-domain features (based on Discrete Wavelet Transform), reinforcing high-frequency components through a cross-domain fusion mechanism. Simultaneously, deformable convolutions with enhanced adaptability are combined with regular convolutions for stability to aggregate boundary-sensitive features that accommodate the irregular morphological variations of polyps. In the decoder, an innovative Multi-Scale Feature Uncertainty Enhancement (MS-FUE) module is designed, which leverages an uncertainty map derived from the encoder to adaptively weight and refine upsampled features, thereby effectively suppressing uncertain components while enhancing the propagation of reliable information. Finally, through a multi-level fusion strategy, the model outputs refined features that deeply integrate high-level semantics with low-level spatial details. Extensive experiments on five public benchmark datasets (Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, ETIS, and CVC-300) demonstrate that CFE-PVTSeg achieves superior robustness and segmentation accuracy compared to existing methods when handling challenging scenarios such as scale variations and blurred boundaries. Ablation studies further validate the effectiveness of both the proposed cross-domain enhanced encoder and the uncertainty-driven decoder, particularly in suppressing feature noise and improving morphological adaptability to polyps with heterogeneous appearance characteristics.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105824"},"PeriodicalIF":4.2,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Better early detector for high-performance detection transformer 更好的早期检测器用于高性能变压器检测
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-17 DOI: 10.1016/j.imavis.2025.105829
Bin Hu , Bencheng Liao , Jiyang Qi , Shusheng Yang , Wenyu Liu
Transformers are revolutionizing the landscape of artificial intelligence, unifying the architecture for natural language processing, computer vision, and more. In this paper, we explore how far a Transformer-based architecture can go for object detection - a fundamental task in computer vision and applicable across a range of engineering applications. We found that introducing an early detector can improve the performance of detection transformers, allowing them to know where to focus. To this end, we propose a novel attention map to feature map auxiliary loss and a novel local bipartite matching strategy to cost-freely get a BEtter early detector for high-performance detection TRansformer (BETR). On the COCO dataset, BETR adds no more than 6 million parameters to the Swin Transformer backbone, achieving the highest AP and latency among existing fully Transformer-based detectors across different model scales. As a Transformer detector, BETR also demonstrates accuracy, speed, and parameters on par with previous state-of-the-art CNN-based GFLV2 framework for the first time.
变形金刚正在彻底改变人工智能的面貌,统一了自然语言处理、计算机视觉等方面的体系结构。在本文中,我们探讨了基于变压器的体系结构在目标检测方面能走多远——这是计算机视觉中的一项基本任务,适用于一系列工程应用。我们发现,引入早期检测器可以提高检测变压器的性能,使它们知道聚焦在哪里。为此,我们提出了一种新的关注图来弥补特征图的辅助损失,并提出了一种新的局部二部匹配策略来无成本地获得高性能检测变压器(BETR)的更好的早期检测器。在COCO数据集上,BETR向Swin Transformer主干添加了不超过600万个参数,在不同模型尺度的现有完全基于Transformer的检测器中实现了最高的AP和延迟。作为变压器探测器,BETR还首次展示了与以前最先进的基于cnn的GFLV2框架相当的准确性,速度和参数。
{"title":"Better early detector for high-performance detection transformer","authors":"Bin Hu ,&nbsp;Bencheng Liao ,&nbsp;Jiyang Qi ,&nbsp;Shusheng Yang ,&nbsp;Wenyu Liu","doi":"10.1016/j.imavis.2025.105829","DOIUrl":"10.1016/j.imavis.2025.105829","url":null,"abstract":"<div><div>Transformers are revolutionizing the landscape of artificial intelligence, unifying the architecture for natural language processing, computer vision, and more. In this paper, we explore how far a Transformer-based architecture can go for object detection - a fundamental task in computer vision and applicable across a range of engineering applications. We found that introducing an early detector can improve the performance of detection transformers, allowing them to know where to focus. To this end, we propose a novel attention map to feature map auxiliary loss and a novel local bipartite matching strategy to cost-freely get a BEtter early detector for high-performance detection TRansformer (BETR). On the COCO dataset, BETR adds no more than 6 million parameters to the Swin Transformer backbone, achieving the highest AP and latency among existing fully Transformer-based detectors across different model scales. As a Transformer detector, BETR also demonstrates accuracy, speed, and parameters on par with previous state-of-the-art CNN-based GFLV2 framework for the first time.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105829"},"PeriodicalIF":4.2,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MedSetFeat++: An attention-enriched set feature framework for few-shot medical image classification MedSetFeat++:一个用于少镜头医学图像分类的注意力富集集特征框架
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-15 DOI: 10.1016/j.imavis.2025.105825
Ankit Kumar Titoriya, Maheshwari Prasad Singh, Amit Kumar Singh
Few-shot learning (FSL) has emerged as a promising solution to address the challenge of limited annotated data in medical image classification. However, traditional FSL methods often extract features from only one convolutional layer. This limits their ability to capture detailed spatial, semantic, and contextual information, which is important for accurate classification in complex medical scenarios. To overcome these limitations, this study introduces MedSetFeat++, an improved set feature learning framework with enhanced attention mechanisms, tailored for few-shot medical image classification. It extends the SetFeat architecture by incorporating several key innovations. It uses a multi-head attention mechanism with projections at multiple scales for the query, key, and value, allowing for more detailed feature interactions across different levels. It also includes learnable positional embeddings to preserve spatial information. An adaptive head gating method is added to control the flow of attention in a dynamic way. Additionally, a Convolutional Block Attention Module (CBAM) based attention module is used to improve focus on the most relevant regions in the data. To evaluate the performance and generalization of MedSetFeat++, extensive experiments were conducted using three different medical imaging datasets: HAM10000, BreakHis at 400× magnification, and Kvasir. Under a 2-way 10-shot 15-query setting, the model achieves 92.17% accuracy on HAM10000, 70.89% on BreakHis, and 73.46% on Kvasir. The proposed model outperforms state-of-the-art methods in multiple 2-way classification tasks under 1-shot, 5-shot, and 10-shot settings. These results establish MedSetFeat++ as a strong and adaptable framework for improving performance in few-shot medical image classification.
少镜头学习(FSL)已成为解决医学图像分类中标注数据有限的挑战的一种有前途的解决方案。然而,传统的FSL方法往往只从一个卷积层中提取特征。这限制了它们捕捉详细的空间、语义和上下文信息的能力,而这些信息对于在复杂的医疗场景中进行准确分类非常重要。为了克服这些限制,本研究引入了MedSetFeat++,这是一种改进的集特征学习框架,具有增强的注意机制,专为少镜头医学图像分类而设计。它通过结合几个关键的创新扩展了SetFeat架构。它使用多头注意机制,并在多个尺度上对查询、键和值进行投影,从而允许在不同级别上进行更详细的功能交互。它还包括可学习的位置嵌入来保存空间信息。引入自适应头部门控方法,对注意力流进行动态控制。此外,基于卷积块注意模块(CBAM)的注意模块用于提高对数据中最相关区域的关注。为了评估MedSetFeat++的性能和通用性,我们使用了三种不同的医学成像数据集:HAM10000、400倍放大率下的BreakHis和Kvasir进行了广泛的实验。在双向10次查询设置下,模型在HAM10000上的准确率为92.17%,在BreakHis上的准确率为70.89%,在Kvasir上的准确率为73.46%。该模型在1枪、5枪和10枪设置下的多重双向分类任务中优于最先进的方法。这些结果表明,MedSetFeat++是一个强大且适应性强的框架,可用于提高少量医学图像分类的性能。
{"title":"MedSetFeat++: An attention-enriched set feature framework for few-shot medical image classification","authors":"Ankit Kumar Titoriya,&nbsp;Maheshwari Prasad Singh,&nbsp;Amit Kumar Singh","doi":"10.1016/j.imavis.2025.105825","DOIUrl":"10.1016/j.imavis.2025.105825","url":null,"abstract":"<div><div>Few-shot learning (FSL) has emerged as a promising solution to address the challenge of limited annotated data in medical image classification. However, traditional FSL methods often extract features from only one convolutional layer. This limits their ability to capture detailed spatial, semantic, and contextual information, which is important for accurate classification in complex medical scenarios. To overcome these limitations, this study introduces MedSetFeat++, an improved set feature learning framework with enhanced attention mechanisms, tailored for few-shot medical image classification. It extends the SetFeat architecture by incorporating several key innovations. It uses a multi-head attention mechanism with projections at multiple scales for the query, key, and value, allowing for more detailed feature interactions across different levels. It also includes learnable positional embeddings to preserve spatial information. An adaptive head gating method is added to control the flow of attention in a dynamic way. Additionally, a Convolutional Block Attention Module (CBAM) based attention module is used to improve focus on the most relevant regions in the data. To evaluate the performance and generalization of MedSetFeat++, extensive experiments were conducted using three different medical imaging datasets: HAM10000, BreakHis at 400<span><math><mo>×</mo></math></span> magnification, and Kvasir. Under a 2-way 10-shot 15-query setting, the model achieves 92.17% accuracy on HAM10000, 70.89% on BreakHis, and 73.46% on Kvasir. The proposed model outperforms state-of-the-art methods in multiple 2-way classification tasks under 1-shot, 5-shot, and 10-shot settings. These results establish MedSetFeat++ as a strong and adaptable framework for improving performance in few-shot medical image classification.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105825"},"PeriodicalIF":4.2,"publicationDate":"2025-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MOT-STM: Maritime Object Tracking: A Spatial-Temporal and Metadata-based approach 海事目标跟踪:一种基于时空和元数据的方法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-12 DOI: 10.1016/j.imavis.2025.105826
Vinayak S. Nageli , Arshad Jamal , Puneet Goyal , Rama Krishna Sai S Gorthi
Object Tracking and Re-Identification (Re-ID) in maritime environments using drone video streams presents significant challenges, especially in search and rescue operations. These challenges mainly arise from the small size of objects from high drone altitudes, sudden movements of the drone’s gimbal and limited appearance diversity of objects. The frequent occlusion in these challenging conditions makes Re-ID difficult in long-term tracking.
In this work, we present a novel framework, Maritime Object Tracking with Spatial–Temporal and Metadata-based modeling (MOT-STM), designed for robust tracking and re-identification of maritime objects in challenging environments. The proposed framework adapts multi-resolution spatial feature extraction using Cross-Stage Partial with Full-Stage (C2FDark) backbone combined with temporal modeling via Video Swin Transformer (VST), enabling effective spatio-temporal representation. This design enhances detection and significantly improves tracking performance in the maritime domain.
We also propose a metadata-driven Re-ID module named Metadata-Assisted Re-ID (MARe-ID), which leverages drone’s metadata such as Global Positioning System (GPS) coordinates, altitude and camera orientation to enhance long-term tracking. Unlike traditional appearance-based Re-ID, MARe-ID remains effective even in scenarios with limited visual diversity among the tracked objects and is generic enough to be integrated into any State-of-the-Art (SotA) multi-object tracking framework as a Re-ID module.
Through extensive experiments on the challenging SeaDronesSee dataset, we demonstrate that MOT-STM significantly outperforms existing methods in maritime object tracking. Our approach achieves a state-of-the-art performance attaining a HOTA score of 70.14% and an IDF1 score of 88.70%, showing the effectiveness and robustness of the proposed MOT-STM framework.
在海上环境中使用无人机视频流进行目标跟踪和重新识别(Re-ID)提出了重大挑战,特别是在搜索和救援行动中。这些挑战主要来自无人机高度的小尺寸物体,无人机框架的突然运动以及物体的有限外观多样性。在这些具有挑战性的条件下,频繁的遮挡使得Re-ID难以长期跟踪。在这项工作中,我们提出了一个新的框架,即基于时空和元数据建模的海事目标跟踪(MOT-STM),旨在在具有挑战性的环境中对海事目标进行鲁棒跟踪和重新识别。该框架采用跨阶段部分与全阶段(C2FDark)主干网进行多分辨率空间特征提取,并结合视频Swin变压器(VST)进行时间建模,实现了有效的时空表征。该设计增强了探测能力,显著提高了海事领域的跟踪性能。我们还提出了一种元数据驱动的Re-ID模块,称为元数据辅助Re-ID (MARe-ID),它利用无人机的元数据,如全球定位系统(GPS)坐标、高度和相机方向来增强长期跟踪。与传统的基于外观的Re-ID不同,MARe-ID即使在被跟踪对象之间的视觉多样性有限的情况下仍然有效,并且足够通用,可以作为Re-ID模块集成到任何最先进的(SotA)多目标跟踪框架中。通过在具有挑战性的SeaDronesSee数据集上进行大量实验,我们证明了MOT-STM在海上目标跟踪方面明显优于现有方法。我们的方法达到了最先进的性能,HOTA得分为70.14%,IDF1得分为88.70%,显示了所提出的MOT-STM框架的有效性和鲁棒性。
{"title":"MOT-STM: Maritime Object Tracking: A Spatial-Temporal and Metadata-based approach","authors":"Vinayak S. Nageli ,&nbsp;Arshad Jamal ,&nbsp;Puneet Goyal ,&nbsp;Rama Krishna Sai S Gorthi","doi":"10.1016/j.imavis.2025.105826","DOIUrl":"10.1016/j.imavis.2025.105826","url":null,"abstract":"<div><div>Object Tracking and Re-Identification (Re-ID) in maritime environments using drone video streams presents significant challenges, especially in search and rescue operations. These challenges mainly arise from the small size of objects from high drone altitudes, sudden movements of the drone’s gimbal and limited appearance diversity of objects. The frequent occlusion in these challenging conditions makes Re-ID difficult in long-term tracking.</div><div>In this work, we present a novel framework, Maritime Object Tracking with Spatial–Temporal and Metadata-based modeling (MOT-STM), designed for robust tracking and re-identification of maritime objects in challenging environments. The proposed framework adapts multi-resolution spatial feature extraction using Cross-Stage Partial with Full-Stage (C2FDark) backbone combined with temporal modeling via Video Swin Transformer (VST), enabling effective spatio-temporal representation. This design enhances detection and significantly improves tracking performance in the maritime domain.</div><div>We also propose a metadata-driven Re-ID module named Metadata-Assisted Re-ID (MARe-ID), which leverages drone’s metadata such as Global Positioning System (GPS) coordinates, altitude and camera orientation to enhance long-term tracking. Unlike traditional appearance-based Re-ID, MARe-ID remains effective even in scenarios with limited visual diversity among the tracked objects and is generic enough to be integrated into any State-of-the-Art (SotA) multi-object tracking framework as a Re-ID module.</div><div>Through extensive experiments on the challenging SeaDronesSee dataset, we demonstrate that MOT-STM significantly outperforms existing methods in maritime object tracking. Our approach achieves a state-of-the-art performance attaining a HOTA score of 70.14% and an IDF1 score of 88.70%, showing the effectiveness and robustness of the proposed MOT-STM framework.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105826"},"PeriodicalIF":4.2,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1