Pub Date : 2025-12-12DOI: 10.1016/j.cviu.2025.104606
Oluwakemi Akinwehinmi , Alberto Tena , Javier Mora , Francesc Solsona , Pedro Arnau del Amo
This paper presents a robust system for real-time object detection and counting in ecological video streams. It is based on the YOLOv8 architecture integrated within a multi-threaded video processing architecture. The system reduces latency and improves throughput by parallelizing object detection and preprocessing tasks. This leads to outperforming traditional single-threaded implementations in continuous video analysis.
The system also incorporates dynamic thresholding methods, fine-tuning, and data augmentation to enhance object detection reliability in dynamic natural environments. These mechanisms improve robustness to changing lighting, occlusions, and background complexity, common challenges in outdoor footage. The system is thoroughly evaluated through performance comparisons between multi-threaded and single-threaded implementations, environmental stress tests, and an ablation study.
Results demonstrate improved consistency in object detection and counting in dynamic environments, along with significant gains in processing speed. Designed for deployment on lightweight and low-power devices, the system is suitable for remote or resource-constrained settings.
While designed for biodiversity monitoring, the approach is applicable to other domains requiring efficient, real-time video analysis in unstructured environments.
{"title":"Real-time habitat mapping with YOLOv8: A multi-threaded approach to biodiversity preservation","authors":"Oluwakemi Akinwehinmi , Alberto Tena , Javier Mora , Francesc Solsona , Pedro Arnau del Amo","doi":"10.1016/j.cviu.2025.104606","DOIUrl":"10.1016/j.cviu.2025.104606","url":null,"abstract":"<div><div>This paper presents a robust system for real-time object detection and counting in ecological video streams. It is based on the YOLOv8 architecture integrated within a multi-threaded video processing architecture. The system reduces latency and improves throughput by parallelizing object detection and preprocessing tasks. This leads to outperforming traditional single-threaded implementations in continuous video analysis.</div><div>The system also incorporates dynamic thresholding methods, fine-tuning, and data augmentation to enhance object detection reliability in dynamic natural environments. These mechanisms improve robustness to changing lighting, occlusions, and background complexity, common challenges in outdoor footage. The system is thoroughly evaluated through performance comparisons between multi-threaded and single-threaded implementations, environmental stress tests, and an ablation study.</div><div>Results demonstrate improved consistency in object detection and counting in dynamic environments, along with significant gains in processing speed. Designed for deployment on lightweight and low-power devices, the system is suitable for remote or resource-constrained settings.</div><div>While designed for biodiversity monitoring, the approach is applicable to other domains requiring efficient, real-time video analysis in unstructured environments.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104606"},"PeriodicalIF":3.5,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-11DOI: 10.1016/j.cviu.2025.104607
Zhigang Liu , Fuyuan Xing , Hao Huang , Kexin Wang , Yuxuan Shao
Existing IoU-guided trackers suppress background distractors by weighting the classification scores with IoU predictions, which limits their effectiveness in complex tracking scenarios. In this paper, we propose a Distractor feature suppression Siamese network with Task-aware attention (SiamDT) for visual tracking. Firstly, we design a distractor feature suppression network that uses IoU scores to suppress distractor features in the classification feature, achieving distractor suppression at the feature level. At the same time, we design a task-aware attention network that reconstructs the cross-correlation feature by using a hybrid attention mechanism, which enhances the semantic representation capability of the features from the classification and regression branches across spatial and channel domains. Extensive experiments on benchmarks including OTB2013, OTB2015, UAV123, LaSOT, and GOT10k demonstrate that the proposed SiamDT achieves state-of-the-art tracking performance.
{"title":"Distractor suppression Siamese network with task-aware attention for visual tracking","authors":"Zhigang Liu , Fuyuan Xing , Hao Huang , Kexin Wang , Yuxuan Shao","doi":"10.1016/j.cviu.2025.104607","DOIUrl":"10.1016/j.cviu.2025.104607","url":null,"abstract":"<div><div>Existing IoU-guided trackers suppress background distractors by weighting the classification scores with IoU predictions, which limits their effectiveness in complex tracking scenarios. In this paper, we propose a Distractor feature suppression Siamese network with Task-aware attention (SiamDT) for visual tracking. Firstly, we design a distractor feature suppression network that uses IoU scores to suppress distractor features in the classification feature, achieving distractor suppression at the feature level. At the same time, we design a task-aware attention network that reconstructs the cross-correlation feature by using a hybrid attention mechanism, which enhances the semantic representation capability of the features from the classification and regression branches across spatial and channel domains. Extensive experiments on benchmarks including OTB2013, OTB2015, UAV123, LaSOT, and GOT10k demonstrate that the proposed SiamDT achieves state-of-the-art tracking performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104607"},"PeriodicalIF":3.5,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-08DOI: 10.1016/j.cviu.2025.104593
Paola Natalia Cañas , Alejandro H. Artiles , Marcos Nieto , Igor Rodríguez
Visual Language Models (VLMs) have demonstrated superior context understanding and generalization across various tasks compared to models tailored for specific tasks. However, due to their complexity and limited information on their training processes, estimating their performance on specific tasks often requires exhaustive testing, which can be costly and may not account for edge cases. To leverage the zero-shot capabilities of VLMs in safety-critical applications like Driver Monitoring Systems, it is crucial to characterize their knowledge and abilities to ensure consistent performance. This research proposes a methodology to explore and gain a deeper understanding of the functioning of these models in driver’s gaze estimation. It involves detailed task decomposition, identification of necessary data knowledge and abilities (e.g., understanding gaze concepts), and exploration through targeted prompting strategies. Applying this methodology to several VLMs (Idefics2, Qwen2-VL, Moondream, GPT-4o) revealed significant limitations, including sensitivity to prompt phrasing, vocabulary mismatches, reliance on image-relative spatial frames, and difficulties inferring non-visible elements. The findings from this evaluation have highlighted specific areas for improvement and guided the development of more effective prompting and fine-tuning strategies, resulting in enhanced performance comparable with traditional CNN-based approaches. This research is also useful for initial model filtering, for selecting the best model among alternatives and for understanding the model’s limitations and expected behaviors, thereby increasing reliability.
{"title":"Exploring visual language models for driver gaze estimation: A task-based approach to debugging AI","authors":"Paola Natalia Cañas , Alejandro H. Artiles , Marcos Nieto , Igor Rodríguez","doi":"10.1016/j.cviu.2025.104593","DOIUrl":"10.1016/j.cviu.2025.104593","url":null,"abstract":"<div><div>Visual Language Models (VLMs) have demonstrated superior context understanding and generalization across various tasks compared to models tailored for specific tasks. However, due to their complexity and limited information on their training processes, estimating their performance on specific tasks often requires exhaustive testing, which can be costly and may not account for edge cases. To leverage the zero-shot capabilities of VLMs in safety-critical applications like Driver Monitoring Systems, it is crucial to characterize their knowledge and abilities to ensure consistent performance. This research proposes a methodology to explore and gain a deeper understanding of the functioning of these models in driver’s gaze estimation. It involves detailed task decomposition, identification of necessary data knowledge and abilities (e.g., understanding gaze concepts), and exploration through targeted prompting strategies. Applying this methodology to several VLMs (Idefics2, Qwen2-VL, Moondream, GPT-4o) revealed significant limitations, including sensitivity to prompt phrasing, vocabulary mismatches, reliance on image-relative spatial frames, and difficulties inferring non-visible elements. The findings from this evaluation have highlighted specific areas for improvement and guided the development of more effective prompting and fine-tuning strategies, resulting in enhanced performance comparable with traditional CNN-based approaches. This research is also useful for initial model filtering, for selecting the best model among alternatives and for understanding the model’s limitations and expected behaviors, thereby increasing reliability.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104593"},"PeriodicalIF":3.5,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a vision-based framework and dataset for capturing and understanding human behavior in industrial assembly lines, focusing on car door manufacturing. The framework leverages advanced computer vision techniques to estimate workers’ locations and 3D poses and analyze work postures, actions, and task progress. A key contribution is the introduction of the CarDA dataset, which contains domain-relevant assembly actions captured in a realistic setting to support the analysis of the framework for human pose and action analysis. The dataset comprises time-synchronized multi-camera RGB-D videos, motion capture data recorded in a real car manufacturing environment, and annotations for EAWS-based ergonomic risk scores and assembly activities. Experimental results demonstrate the effectiveness of the proposed approach in classifying worker postures and robust performance in monitoring assembly task progress.
{"title":"A vision-based framework and dataset for human behavior understanding in industrial assembly lines","authors":"Konstantinos Papoutsakis , Nikolaos Bakalos , Athena Zacharia , Konstantinos Fragkoulis , Georgia Kapetadimitri , Maria Pateraki","doi":"10.1016/j.cviu.2025.104592","DOIUrl":"10.1016/j.cviu.2025.104592","url":null,"abstract":"<div><div>This paper introduces a vision-based framework and dataset for capturing and understanding human behavior in industrial assembly lines, focusing on car door manufacturing. The framework leverages advanced computer vision techniques to estimate workers’ locations and 3D poses and analyze work postures, actions, and task progress. A key contribution is the introduction of the CarDA dataset, which contains domain-relevant assembly actions captured in a realistic setting to support the analysis of the framework for human pose and action analysis. The dataset comprises time-synchronized multi-camera RGB-D videos, motion capture data recorded in a real car manufacturing environment, and annotations for EAWS-based ergonomic risk scores and assembly activities. Experimental results demonstrate the effectiveness of the proposed approach in classifying worker postures and robust performance in monitoring assembly task progress.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104592"},"PeriodicalIF":3.5,"publicationDate":"2025-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145736971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The infrared and visible images fusion (IVIF) is receiving increasing attention from both the research community and industry due to its excellent results in downstream applications. However, existing deep learning methods exhibit limitations in global feature modeling, balancing fusion performance with computational efficiency and effectively leveraging frequency-domain information. To address this limitation, we propose an end-to-end fusion network named the Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The FSATFusion contains the frequency-spatial attention Transformer (FSAT) module designed to effectively capture discriminate features from source images. The FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of extracting significant features from feature maps. Additionally, we propose an improved Transformer module (ITM) to enhance the ability to extract global context information of vanilla Transformer without incurring additional computational overhead. Across four public datasets (TNO, MSRS, RoadScene, and RGB–NIR), we conducted extensive qualitative comparisons and quantitative evaluations based on eight metrics against fourteen representative state-of-the-art fusion algorithms. Experimental results demonstrate that the proposed method outperforms state-of-the-art deep learning approaches (e.g., GANMcC, MDA, and EMMA) in terms of qualitative visual quality, objective metrics (e.g., achieving an average improvement of approximately 34% in , 5% in , and 4% in ), as well as computational efficiency. Furthermore, the fused images generated by our method exhibit superior applicability and performance in downstream object detection tasks. Our code is available at https://github.com/Lmmh058/FSATFusion.
{"title":"FSATFusion: Frequency-Spatial Attention Transformer for infrared and visible image fusion","authors":"Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui, Yuhan Lyu","doi":"10.1016/j.cviu.2025.104600","DOIUrl":"10.1016/j.cviu.2025.104600","url":null,"abstract":"<div><div>The infrared and visible images fusion (IVIF) is receiving increasing attention from both the research community and industry due to its excellent results in downstream applications. However, existing deep learning methods exhibit limitations in global feature modeling, balancing fusion performance with computational efficiency and effectively leveraging frequency-domain information. To address this limitation, we propose an end-to-end fusion network named the Frequency-Spatial Attention Transformer Fusion Network (FSATFusion). The FSATFusion contains the frequency-spatial attention Transformer (FSAT) module designed to effectively capture discriminate features from source images. The FSAT module includes a frequency-spatial attention mechanism (FSAM) capable of extracting significant features from feature maps. Additionally, we propose an improved Transformer module (ITM) to enhance the ability to extract global context information of vanilla Transformer without incurring additional computational overhead. Across four public datasets (TNO, MSRS, RoadScene, and RGB–NIR), we conducted extensive qualitative comparisons and quantitative evaluations based on eight metrics against fourteen representative state-of-the-art fusion algorithms. Experimental results demonstrate that the proposed method outperforms state-of-the-art deep learning approaches (e.g., GANMcC, MDA, and EMMA) in terms of qualitative visual quality, objective metrics (e.g., achieving an average improvement of approximately 34% in <span><math><mrow><mi>M</mi><mi>I</mi></mrow></math></span>, 5% in <span><math><msub><mrow><mi>Q</mi></mrow><mrow><mi>y</mi></mrow></msub></math></span>, and 4% in <span><math><mrow><mi>V</mi><mi>I</mi><mi>F</mi></mrow></math></span>), as well as computational efficiency. Furthermore, the fused images generated by our method exhibit superior applicability and performance in downstream object detection tasks. Our code is available at <span><span>https://github.com/Lmmh058/FSATFusion</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104600"},"PeriodicalIF":3.5,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145736970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1016/j.cviu.2025.104599
Junchang Jing , Yanyan Lv , Ming Li , Dong Liu , Zhiyong Zhang
Although existing face forgery detection methods have demonstrated remarkable performance, they still suffer a significant performance drop when confronted with samples generated by unseen manipulation techniques. This poor generalization performance arises from the detectors overfitting to specific datasets and failing to learn generalizable feature representations. To tackle this problem, we propose a novel adaptive spatial-frequency interactive network with Bi-directional adapter for generalizable face forgery detection. Specifically, we design an Adaptive Region Dynamic Convolution (ARDConv) module and an Adaptive Frequency Dynamic Filter (AFDF) module. The ARDConv module divides the spatial dimension into several regions based on the guided features of the input image, and employs the multi-head cross-attention mechanism to dynamically generate filters, effectively focusing on subtle texture artifacts in the spatial domain. The AFDF module applies frequency decomposition and dynamic convolution kernels in the frequency domain, which adaptively selecting frequency information to capture refined clues. Additionally, we present a dual-domain fusion module based on Bi-directional Adapter (BAT) to transfer domain-specific feature information from one domain to another. The advantage of this module lies in its ability to enable efficient feature fusion by fine-tuning only minimal BAT parameters. Our method exhibits exceptional generalization capabilities in cross-dataset evaluation, outperforming optimal approaches by 3.07% and 3.15% AUC improvements. Moreover, the proposed approach only utilizes 547K trainable parameters and 130M FLOPs, significantly reducing computational costs compared to other state-of-the-art face forgery detection methods. The code is released at https://github.com/lvyanyana/ASFI.
{"title":"Constructing adaptive spatial-frequency interactive network with bi-directional adapter for generalizable face forgery detection","authors":"Junchang Jing , Yanyan Lv , Ming Li , Dong Liu , Zhiyong Zhang","doi":"10.1016/j.cviu.2025.104599","DOIUrl":"10.1016/j.cviu.2025.104599","url":null,"abstract":"<div><div>Although existing face forgery detection methods have demonstrated remarkable performance, they still suffer a significant performance drop when confronted with samples generated by unseen manipulation techniques. This poor generalization performance arises from the detectors overfitting to specific datasets and failing to learn generalizable feature representations. To tackle this problem, we propose a novel adaptive spatial-frequency interactive network with Bi-directional adapter for generalizable face forgery detection. Specifically, we design an Adaptive Region Dynamic Convolution (ARDConv) module and an Adaptive Frequency Dynamic Filter (AFDF) module. The ARDConv module divides the spatial dimension into several regions based on the guided features of the input image, and employs the multi-head cross-attention mechanism to dynamically generate filters, effectively focusing on subtle texture artifacts in the spatial domain. The AFDF module applies frequency decomposition and dynamic convolution kernels in the frequency domain, which adaptively selecting frequency information to capture refined clues. Additionally, we present a dual-domain fusion module based on Bi-directional Adapter (BAT) to transfer domain-specific feature information from one domain to another. The advantage of this module lies in its ability to enable efficient feature fusion by fine-tuning only minimal BAT parameters. Our method exhibits exceptional generalization capabilities in cross-dataset evaluation, outperforming optimal approaches by 3.07% and 3.15% AUC improvements. Moreover, the proposed approach only utilizes 547K trainable parameters and 130M FLOPs, significantly reducing computational costs compared to other state-of-the-art face forgery detection methods. The code is released at <span><span>https://github.com/lvyanyana/ASFI</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104599"},"PeriodicalIF":3.5,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145736972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1016/j.cviu.2025.104596
Lei Zhang , Yongqiu Huang , Yingjun Du , Fang Lei , Zhiying Yang , Cees G.M. Snoek , Yehui Wang
This paper addresses the challenge of segmenting an object in an image based solely on a textual description, without requiring any training on specific object classes. In contrast to traditional methods that rely on generating numerous mask proposals, we introduce a novel patch-based approach. Our method computes the similarity between small image patches, extracted using a sliding window, and textual descriptions, producing a patch score map that identifies the regions most likely to contain the target object. This score map guides a segment-anything model to generate precise mask proposals. To further improve segmentation accuracy, we refine the textual prompts by generating detailed object descriptions using a multi-modal large language model. Our method’s effectiveness is validated through extensive experiments on the RefCOCO, RefCOCO+, and RefCOCOg datasets, where it outperforms state-of-the-art zero-shot referring image segmentation methods. Ablation studies confirm the key contributions of our patch-based segmentation and localized text prompt refinement, demonstrating their significant role in enhancing both precision and robustness.
{"title":"LoTeR: Localized text prompt refinement for zero-shot referring image segmentation","authors":"Lei Zhang , Yongqiu Huang , Yingjun Du , Fang Lei , Zhiying Yang , Cees G.M. Snoek , Yehui Wang","doi":"10.1016/j.cviu.2025.104596","DOIUrl":"10.1016/j.cviu.2025.104596","url":null,"abstract":"<div><div>This paper addresses the challenge of segmenting an object in an image based solely on a textual description, without requiring any training on specific object classes. In contrast to traditional methods that rely on generating numerous mask proposals, we introduce a novel patch-based approach. Our method computes the similarity between small image patches, extracted using a sliding window, and textual descriptions, producing a patch score map that identifies the regions most likely to contain the target object. This score map guides a segment-anything model to generate precise mask proposals. To further improve segmentation accuracy, we refine the textual prompts by generating detailed object descriptions using a multi-modal large language model. Our method’s effectiveness is validated through extensive experiments on the RefCOCO, RefCOCO+, and RefCOCOg datasets, where it outperforms state-of-the-art zero-shot referring image segmentation methods. Ablation studies confirm the key contributions of our patch-based segmentation and localized text prompt refinement, demonstrating their significant role in enhancing both precision and robustness.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104596"},"PeriodicalIF":3.5,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1016/j.cviu.2025.104598
Hongcheng Xue , Tong Gao , Zhan Tang , Yuantian Xia , Longhe Wang , Lin Li
To address the challenge of balancing detection accuracy and efficiency for small objects in complex aerial scenes, we propose a Configurable Global Context Reconstruction Hybrid Detector (GCRH) to enhance overall detection performance. The GCRH framework consists of three key components. First, the Efficient Re-parameterized Encoder (ERE) reduces the computational overhead of multi-head self-attention through re-parameterization while maintaining the integrity and independence of global–local feature interactions. Second, the Global-Aware Feature Pyramid Network (GAFPN) reconstructs and injects global contextual semantics, cascading selective feature fusion to distribute this semantic information across feature layers, thereby alleviating small-object feature degradation and cross-level semantic inconsistency. Finally, two configurable model variants are provided, allowing the control of high-resolution feature layers to balance detection accuracy and inference efficiency. Experiments on the VisDrone2019 and TinyPerson datasets demonstrate that GCRH achieves an effective trade-off between precision and efficiency, validating its applicability to small object detection in aerial imagery. The code is available at: https://github.com/Mundane-X/GCRH.
{"title":"A configurable global context reconstruction hybrid detector for enhanced small object detection in UAV aerial imagery","authors":"Hongcheng Xue , Tong Gao , Zhan Tang , Yuantian Xia , Longhe Wang , Lin Li","doi":"10.1016/j.cviu.2025.104598","DOIUrl":"10.1016/j.cviu.2025.104598","url":null,"abstract":"<div><div>To address the challenge of balancing detection accuracy and efficiency for small objects in complex aerial scenes, we propose a Configurable Global Context Reconstruction Hybrid Detector (GCRH) to enhance overall detection performance. The GCRH framework consists of three key components. First, the Efficient Re-parameterized Encoder (ERE) reduces the computational overhead of multi-head self-attention through re-parameterization while maintaining the integrity and independence of global–local feature interactions. Second, the Global-Aware Feature Pyramid Network (GAFPN) reconstructs and injects global contextual semantics, cascading selective feature fusion to distribute this semantic information across feature layers, thereby alleviating small-object feature degradation and cross-level semantic inconsistency. Finally, two configurable model variants are provided, allowing the control of high-resolution feature layers to balance detection accuracy and inference efficiency. Experiments on the VisDrone2019 and TinyPerson datasets demonstrate that GCRH achieves an effective trade-off between precision and efficiency, validating its applicability to small object detection in aerial imagery. The code is available at: <span><span>https://github.com/Mundane-X/GCRH</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104598"},"PeriodicalIF":3.5,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145736973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1016/j.cviu.2025.104595
András Kalapos, Bálint Gyires-Tóth
Joint Embedding Predictive Architectures present a novel intermediate approach to visual self-supervised learning combining mechanisms from instance discrimination and masked modeling. CNN-JEPA adapts this approach to convolutional neural networks and demonstrates its computational efficiency and accuracy on image classification benchmarks. In this study, we investigate CNN-JEPA, adapt it for semantic segmentation, and propose a learning objective that improves image-level representation learning through a joint embedding predictive architecture. We conduct an extensive evaluation, comparing it with other SSL methods by analyzing data efficiency and computational demands across downstream classification and segmentation benchmarks. Our results show that its classification and segmentation accuracy outperforms similar masked modeling methods such as I-JEPA and SparK with a ResNet-50 or a similarly sized ViT-Small encoder. Furthermore, CNN-JEPA requires fewer computational resources during pretraining, demonstrates excellent data efficiency in data-limited downstream segmentation, and achieves competitive accuracy with successful instance discrimination-based SSL methods for pretraining encoders on ImageNet.
{"title":"Exploring joint embedding predictive architectures for pretraining convolutional neural networks","authors":"András Kalapos, Bálint Gyires-Tóth","doi":"10.1016/j.cviu.2025.104595","DOIUrl":"10.1016/j.cviu.2025.104595","url":null,"abstract":"<div><div>Joint Embedding Predictive Architectures present a novel intermediate approach to visual self-supervised learning combining mechanisms from instance discrimination and masked modeling. CNN-JEPA adapts this approach to convolutional neural networks and demonstrates its computational efficiency and accuracy on image classification benchmarks. In this study, we investigate CNN-JEPA, adapt it for semantic segmentation, and propose a learning objective that improves image-level representation learning through a joint embedding predictive architecture. We conduct an extensive evaluation, comparing it with other SSL methods by analyzing data efficiency and computational demands across downstream classification and segmentation benchmarks. Our results show that its classification and segmentation accuracy outperforms similar masked modeling methods such as I-JEPA and SparK with a ResNet-50 or a similarly sized ViT-Small encoder. Furthermore, CNN-JEPA requires fewer computational resources during pretraining, demonstrates excellent data efficiency in data-limited downstream segmentation, and achieves competitive accuracy with successful instance discrimination-based SSL methods for pretraining encoders on ImageNet.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104595"},"PeriodicalIF":3.5,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145736966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-04DOI: 10.1016/j.cviu.2025.104602
Winter Clinckemaillie, Jelle Vanhaeverbeke, Maarten Slembrouck, Steven Verstockt
Advanced computer vision and machine learning technologies transform how we experience sports events. This work enriches helicopter footage of cycling races with dynamic, in-scene, pose-aligned augmented reality (AR) overlays (e.g., rider name, speed, wind direction) that remain visually attached to each rider. To achieve this, we propose a multi-stage pipeline: cyclists are first detected and tracked, followed by team recognition using a one-shot learning approach based on Siamese neural networks, which achieves a classification accuracy of 85% on a test set composed of unseen teams during training. This design allows easy adaptation and reuse across different races and seasons, enabling frequent jersey and team changes with minimal effort. We introduce a pose-based AR overlay that anchors rider labels to moving cyclists without fixed field landmarks or homography, enabling dynamic overlays in unconstrained cycling broadcasts. Real-time feasibility is demonstrated through runtime profiling and TensorRT optimizations. Finally, a user study evaluates the readability, informativeness, visual stability, and engagement of our AR-enhanced broadcasts. The combination of advanced computer vision, AR, and user-centered evaluation showcases new possibilities for improving live sports broadcasts, particularly in challenging environments like road cycling.
{"title":"An end-to-end pipeline for team-aware, pose-aligned augmented reality in cycling broadcasts","authors":"Winter Clinckemaillie, Jelle Vanhaeverbeke, Maarten Slembrouck, Steven Verstockt","doi":"10.1016/j.cviu.2025.104602","DOIUrl":"10.1016/j.cviu.2025.104602","url":null,"abstract":"<div><div>Advanced computer vision and machine learning technologies transform how we experience sports events. This work enriches helicopter footage of cycling races with dynamic, in-scene, pose-aligned augmented reality (AR) overlays (e.g., rider name, speed, wind direction) that remain visually attached to each rider. To achieve this, we propose a multi-stage pipeline: cyclists are first detected and tracked, followed by team recognition using a one-shot learning approach based on Siamese neural networks, which achieves a classification accuracy of 85% on a test set composed of unseen teams during training. This design allows easy adaptation and reuse across different races and seasons, enabling frequent jersey and team changes with minimal effort. We introduce a pose-based AR overlay that anchors rider labels to moving cyclists without fixed field landmarks or homography, enabling dynamic overlays in unconstrained cycling broadcasts. Real-time feasibility is demonstrated through runtime profiling and TensorRT optimizations. Finally, a user study evaluates the readability, informativeness, visual stability, and engagement of our AR-enhanced broadcasts. The combination of advanced computer vision, AR, and user-centered evaluation showcases new possibilities for improving live sports broadcasts, particularly in challenging environments like road cycling.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104602"},"PeriodicalIF":3.5,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145736969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}