Pub Date : 2026-01-12DOI: 10.1016/j.cviu.2026.104648
Hang Su , Hong-Bo Zhang , Jia-Yin Luo , Jing-Hua Liu , Zhen-Zhen Sun , Ji-Xiang Du
Transformer-based methods have shown strong potential in human-object interaction (HOI) detection, yet challenges remain due to task interference and unstable query initialization. To overcome these limitations, the paper adopts a cascaded-parallel decoders architecture that balances the efficiency of one-stage models with the task decoupling advantages of two-stage designs. A key component is an anchor-guided query generator, which explicitly incorporates human-object spatial information into query initialization. This provides queries with strong spatial awareness, stabilizes training, and significantly improves human and object localization-a crucial prerequisite for accurate HOI detection. In addition, interaction features are further refined by modeling multi-relational cues within each triplet, facilitating more reliable verb classification. Extensive experiments on the HICO-DET and V-COCO datasets demonstrate that the proposed method achieves superior performance compared to state-of-the-art approaches.
{"title":"Cascaded-parallel decoders and anchor-guided query generator for human-object interaction","authors":"Hang Su , Hong-Bo Zhang , Jia-Yin Luo , Jing-Hua Liu , Zhen-Zhen Sun , Ji-Xiang Du","doi":"10.1016/j.cviu.2026.104648","DOIUrl":"10.1016/j.cviu.2026.104648","url":null,"abstract":"<div><div>Transformer-based methods have shown strong potential in human-object interaction (HOI) detection, yet challenges remain due to task interference and unstable query initialization. To overcome these limitations, the paper adopts a cascaded-parallel decoders architecture that balances the efficiency of one-stage models with the task decoupling advantages of two-stage designs. A key component is an anchor-guided query generator, which explicitly incorporates human-object spatial information into query initialization. This provides queries with strong spatial awareness, stabilizes training, and significantly improves human and object localization-a crucial prerequisite for accurate HOI detection. In addition, interaction features are further refined by modeling multi-relational cues within each triplet, facilitating more reliable verb classification. Extensive experiments on the HICO-DET and V-COCO datasets demonstrate that the proposed method achieves superior performance compared to state-of-the-art approaches.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104648"},"PeriodicalIF":3.5,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1016/j.cviu.2026.104657
Hao Xu , Arbind Agrahari Baniya , Sam Wells , Mohamed Reda Bouadjenek , Richard Dazeley , Sunil Aryal
Ball tracking is a fundamental problem in computer vision, particularly in sports analytics, where it underpins tasks such as analyzing ball movement in soccer and basketball or detecting bounce locations in tennis and table tennis. Most existing methods are developed and evaluated on resource-rich, commercial sports footage with ideal camera angles, high-resolution imagery, and multiple viewpoints. In contrast, many other sports contexts, including semi-professional leagues, local amateur competitions, and Paralympic sports, lack these resources. Footage in these settings often comes from single, fixed, and suboptimal viewpoints, where occlusion becomes a dominant challenge for automated tracking. Existing methods frequently fall short in such conditions because their architectures and training strategies do not explicitly account for prolonged or full occlusion. To address this gap, we present the Table Tennis Australia (TTA) dataset, the first professionally annotated Paralympic table tennis benchmark with dense visibility labels, captured under realistic single-view conditions. With 2,396 occluded instances (including 998 fully occluded), TTA is the most occlusion-rich publicly available dataset to date. Alongside the dataset, we propose the Temporal Occlusion Tracking Network (TOTNet), a novel tracking system designed to maintain localization accuracy even under extended occlusion. Through comprehensive experiments on four sports tracking datasets, TOTNet achieves state-of-the-art performance, with substantial gains in full-occlusion scenarios. We release the dataset, code, and evaluation scripts to foster reproducibility and future research in occlusion robust tracking for low resource sports; all materials are available at https://github.com/AugustRushG/TOTNet.
{"title":"TOTNet: Occlusion-aware temporal tracking for robust ball detection in sports videos","authors":"Hao Xu , Arbind Agrahari Baniya , Sam Wells , Mohamed Reda Bouadjenek , Richard Dazeley , Sunil Aryal","doi":"10.1016/j.cviu.2026.104657","DOIUrl":"10.1016/j.cviu.2026.104657","url":null,"abstract":"<div><div>Ball tracking is a fundamental problem in computer vision, particularly in sports analytics, where it underpins tasks such as analyzing ball movement in soccer and basketball or detecting bounce locations in tennis and table tennis. Most existing methods are developed and evaluated on resource-rich, commercial sports footage with ideal camera angles, high-resolution imagery, and multiple viewpoints. In contrast, many other sports contexts, including semi-professional leagues, local amateur competitions, and Paralympic sports, lack these resources. Footage in these settings often comes from single, fixed, and suboptimal viewpoints, where occlusion becomes a dominant challenge for automated tracking. Existing methods frequently fall short in such conditions because their architectures and training strategies do not explicitly account for prolonged or full occlusion. To address this gap, we present the <strong>Table Tennis Australia (TTA) dataset</strong>, the first professionally annotated Paralympic table tennis benchmark with dense visibility labels, captured under realistic single-view conditions. With <strong>2,396</strong> occluded instances (including 998 fully occluded), TTA is the most occlusion-rich publicly available dataset to date. Alongside the dataset, we propose the <strong>Temporal Occlusion Tracking Network (TOTNet)</strong>, a novel tracking system designed to maintain localization accuracy even under extended occlusion. Through comprehensive experiments on four sports tracking datasets, TOTNet achieves state-of-the-art performance, with substantial gains in full-occlusion scenarios. We release the dataset, code, and evaluation scripts to foster reproducibility and future research in occlusion robust tracking for low resource sports; all materials are available at <span><span>https://github.com/AugustRushG/TOTNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104657"},"PeriodicalIF":3.5,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1016/j.cviu.2026.104656
Junhao Sun, Lanfei Zhao
Action Quality Assessment (AQA) aims to quantitatively evaluate the execution quality of complex human actions, which poses significant challenges due to the need for jointly modeling spatio-temporal dynamics and semantic structures. Existing approaches typically rely on static single-branch architectures, limiting their capacity to balance local fine-grained details and global rhythmic dependencies, especially in high-complexity scenarios. To address these limitations, we propose a novel Spatio-Temporal Adaptive Recalibration (STAR) Block, which enables highly discriminative representation learning via a multi-dimensional modeling strategy. Specifically, we first design a Multi-Scale Context Encoder to capture subtle local cues by leveraging parallel convolutions across spatial, temporal, and joint domains, enhancing the perception of motion details and short-term dynamics. Second, we introduce an Axial Attention-Based Global Dependency Modeling Module, which efficiently captures long-range temporal relationships while preserving the original spatio-temporal structure, thus reinforcing the understanding of phase coherence and motion rhythm. Third, a Dynamic Attention-Guided Adaptive Feature Fusion mechanism is proposed to integrate multi-path temporal semantics by assigning adaptive weights to local and global representations, enabling dynamic equilibrium in temporal modeling. Across multiple metrics, our STAR Block delivers remarkably superior performance with significant margins over state-of-the-art methods, achieving an average Spearman’s improvement of 1.56% on AQA-7, 0.57% on MTL-AQA with DD supervision, and near-perfect 99.52% accuracy on FR-FS, as proven by extensive evaluations.
{"title":"STAR Block: Adaptive spatio-temporal recalibration for action quality assessment","authors":"Junhao Sun, Lanfei Zhao","doi":"10.1016/j.cviu.2026.104656","DOIUrl":"10.1016/j.cviu.2026.104656","url":null,"abstract":"<div><div>Action Quality Assessment (AQA) aims to quantitatively evaluate the execution quality of complex human actions, which poses significant challenges due to the need for jointly modeling spatio-temporal dynamics and semantic structures. Existing approaches typically rely on static single-branch architectures, limiting their capacity to balance local fine-grained details and global rhythmic dependencies, especially in high-complexity scenarios. To address these limitations, we propose a novel Spatio-Temporal Adaptive Recalibration (STAR) Block, which enables highly discriminative representation learning via a multi-dimensional modeling strategy. Specifically, we first design a Multi-Scale Context Encoder to capture subtle local cues by leveraging parallel convolutions across spatial, temporal, and joint domains, enhancing the perception of motion details and short-term dynamics. Second, we introduce an Axial Attention-Based Global Dependency Modeling Module, which efficiently captures long-range temporal relationships while preserving the original spatio-temporal structure, thus reinforcing the understanding of phase coherence and motion rhythm. Third, a Dynamic Attention-Guided Adaptive Feature Fusion mechanism is proposed to integrate multi-path temporal semantics by assigning adaptive weights to local and global representations, enabling dynamic equilibrium in temporal modeling. Across multiple metrics, our STAR Block delivers remarkably superior performance with significant margins over state-of-the-art methods, achieving an average Spearman’s <span><math><mi>ρ</mi></math></span> improvement of 1.56% on AQA-7, 0.57% on MTL-AQA with DD supervision, and near-perfect 99.52% accuracy on FR-FS, as proven by extensive evaluations.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104656"},"PeriodicalIF":3.5,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-10DOI: 10.1016/j.cviu.2026.104650
David Tejero-Ruiz , David Solís-Martín , Francisco J. Pérez-Grau , Joaquín Borrego-Díaz
Indoor UAV navigation faces significant challenges due to GPS signal absence and limitations of conventional visual-inertial systems under challenging lighting and motion conditions. This paper presents an event-based visual-inertial odometry system that addresses these limitations through intermediate frame reconstruction from event streams combined with established odometry algorithms. The approach leverages event cameras’ unique characteristics — microsecond temporal resolution, high dynamic range (120 dB), and motion blur immunity — to maintain stable navigation performance under conditions that cause conventional systems to fail. The system achieves real-time operation at 30 Hz frame reconstruction and 20 Hz pose estimation on embedded hardware, consuming 15 W power while adding only 50 g to the UAV platform. Experimental validation in controlled indoor environments demonstrates mean absolute pose errors of 26–42 cm across different operational conditions, comparable to conventional visual-inertial systems. Critically, the system maintains stable performance during rapid lighting transitions, showing only 59% performance degradation compared to baseline conditions, while conventional cameras typically experience complete tracking failure. The results establish event-based visual-inertial odometry as a viable alternative for indoor UAV navigation, particularly in applications requiring environmental robustness over marginal accuracy improvements under optimal conditions.
{"title":"Indoor UAV navigation using event cameras and intermediate frame reconstruction","authors":"David Tejero-Ruiz , David Solís-Martín , Francisco J. Pérez-Grau , Joaquín Borrego-Díaz","doi":"10.1016/j.cviu.2026.104650","DOIUrl":"10.1016/j.cviu.2026.104650","url":null,"abstract":"<div><div>Indoor UAV navigation faces significant challenges due to GPS signal absence and limitations of conventional visual-inertial systems under challenging lighting and motion conditions. This paper presents an event-based visual-inertial odometry system that addresses these limitations through intermediate frame reconstruction from event streams combined with established odometry algorithms. The approach leverages event cameras’ unique characteristics — microsecond temporal resolution, high dynamic range (120 dB), and motion blur immunity — to maintain stable navigation performance under conditions that cause conventional systems to fail. The system achieves real-time operation at 30 Hz frame reconstruction and 20 Hz pose estimation on embedded hardware, consuming 15 W power while adding only 50 g to the UAV platform. Experimental validation in controlled indoor environments demonstrates mean absolute pose errors of 26–42 cm across different operational conditions, comparable to conventional visual-inertial systems. Critically, the system maintains stable performance during rapid lighting transitions, showing only 59% performance degradation compared to baseline conditions, while conventional cameras typically experience complete tracking failure. The results establish event-based visual-inertial odometry as a viable alternative for indoor UAV navigation, particularly in applications requiring environmental robustness over marginal accuracy improvements under optimal conditions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104650"},"PeriodicalIF":3.5,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1016/j.cviu.2026.104637
Yiming Yang, Feng Guo, Pei Niu
Real-time object detection is pivotal in traffic-related Unmanned Aerial Vehicles (UAV) applications. However, UAV imagery presents significant challenges due to the predominance of small objects and complex backgrounds. Traditional backbones generally perform aggressive early-stage downsampling, causing the loss of fine-grained features. To address these issues, we propose UAVDet, a real-time detection model that combines Convolutional Neural Network (CNN) and Mamba architectures. First, we revisit the conventional backbone design by reconfiguring its depth and width, with a focus on preserving fine-grained details crucial for small object detection. Second, we propose the Cross Stage Partial Mamba (CSPMB) module, which integrates the Mamba structure into the CNN framework to enhance global feature representation and improve robustness against complex background interference. Third, we design Tiny-focused Feature Pyramid Network (TFPN) by rebalancing the feature fusion flow and replacing the large-object detection head with a tiny-object detection head, which significantly improves the perception of small objects. Comprehensive experiments on the VisDrone dataset show that our method improves AP and AP by 4.5% and 5.0%, respectively, while reducing parameters by 84.9% compared to the baseline. It also reaches 53 FPS on an RTX 4090, exceeding the 30 FPS real-time threshold. Additional evaluations on UAVDT and DroneVehicle further verify the method’s robust generalization. These results indicate the effectiveness of the developed method in UAV image detection.
{"title":"UAVDet: A CNN–Mamba hybrid network for efficient small object detection in UAV imagery","authors":"Yiming Yang, Feng Guo, Pei Niu","doi":"10.1016/j.cviu.2026.104637","DOIUrl":"10.1016/j.cviu.2026.104637","url":null,"abstract":"<div><div>Real-time object detection is pivotal in traffic-related Unmanned Aerial Vehicles (UAV) applications. However, UAV imagery presents significant challenges due to the predominance of small objects and complex backgrounds. Traditional backbones generally perform aggressive early-stage downsampling, causing the loss of fine-grained features. To address these issues, we propose UAVDet, a real-time detection model that combines Convolutional Neural Network (CNN) and Mamba architectures. First, we revisit the conventional backbone design by reconfiguring its depth and width, with a focus on preserving fine-grained details crucial for small object detection. Second, we propose the Cross Stage Partial Mamba (CSPMB) module, which integrates the Mamba structure into the CNN framework to enhance global feature representation and improve robustness against complex background interference. Third, we design Tiny-focused Feature Pyramid Network (TFPN) by rebalancing the feature fusion flow and replacing the large-object detection head with a tiny-object detection head, which significantly improves the perception of small objects. Comprehensive experiments on the VisDrone dataset show that our method improves AP and AP<span><math><msub><mrow></mrow><mrow><mi>S</mi></mrow></msub></math></span> by 4.5% and 5.0%, respectively, while reducing parameters by 84.9% compared to the baseline. It also reaches 53 FPS on an RTX 4090, exceeding the 30 FPS real-time threshold. Additional evaluations on UAVDT and DroneVehicle further verify the method’s robust generalization. These results indicate the effectiveness of the developed method in UAV image detection.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104637"},"PeriodicalIF":3.5,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1016/j.cviu.2026.104636
Weiwei Duan, Luping Ji, Shengjia Chen, Jianghong Huang
Due to the low signal-to-noise ratio and weak visual contrast, infrared small targets are often submerged in the background. Therefore, it is crucial to preserve target information while extracting distinctive features that distinguish them from the background. However, existing methods generally rely on convolutions and transformers in isolation, which limits their ability to capture robust target features in complex scenes. To address this issue, we propose a new local–global feature collaborative learning (LGFC) framework. It could adequately integrate the local spatial features with the global context of targets in a unified manner. Specifically, we develop an enhanced Gaussian-mask Vision Transformer group with Global Gaussian Attention and Local Window Attention to extract refined global features. The local coarse features obtained from the convolution encoder are then coordinated with the refined global features through Local–Global Collaborating. Moreover, to avoid feature loss during decoding, we propose a level-wise decoding strategy with Cross-layer Feature Interaction to to mitigate information loss in deep networks. Additionally, we introduce a Coarse-to-Fine Refinement post-processing mechanism to improve the precision of target contours. The extensive experiments on three public datasets (NUAA-SIRST, IRSTD-1K and SIRST-AUG) demonstrate the superiority and generalization ability of our proposed LGFC framework for infrared small target detection, outperforming state-of-the-art methods by approximately 2.3% in F1-score on each dataset.
{"title":"Local–global collaborative feature learning with level-wise decoding for infrared small target detection","authors":"Weiwei Duan, Luping Ji, Shengjia Chen, Jianghong Huang","doi":"10.1016/j.cviu.2026.104636","DOIUrl":"10.1016/j.cviu.2026.104636","url":null,"abstract":"<div><div>Due to the low signal-to-noise ratio and weak visual contrast, infrared small targets are often submerged in the background. Therefore, it is crucial to preserve target information while extracting distinctive features that distinguish them from the background. However, existing methods generally rely on convolutions and transformers in isolation, which limits their ability to capture robust target features in complex scenes. To address this issue, we propose a new local–global feature collaborative learning (LGFC) framework. It could adequately integrate the local spatial features with the global context of targets in a unified manner. Specifically, we develop an enhanced <em>Gaussian-mask Vision Transformer</em> group with <em>Global Gaussian Attention</em> and <em>Local Window Attention</em> to extract refined global features. The local coarse features obtained from the convolution encoder are then coordinated with the refined global features through <em>Local–Global Collaborating</em>. Moreover, to avoid feature loss during decoding, we propose a level-wise decoding strategy with <em>Cross-layer Feature Interaction</em> to to mitigate information loss in deep networks. Additionally, we introduce a <em>Coarse-to-Fine Refinement</em> post-processing mechanism to improve the precision of target contours. The extensive experiments on three public datasets (NUAA-SIRST, IRSTD-1K and SIRST-AUG) demonstrate the superiority and generalization ability of our proposed LGFC framework for infrared small target detection, outperforming state-of-the-art methods by approximately 2.3% in F1-score on each dataset.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104636"},"PeriodicalIF":3.5,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1016/j.cviu.2026.104638
Zhuo-Ming Du , Hong-An Li , Qian Yu , Wen-He Chen , Fei-long Han
Accurate estimation and correction of global illuminant color, known as color constancy, is crucial for computational photography and computer vision but remains challenging under complex lighting conditions. We propose CSNet, an end-to-end framework that improves color constancy through a novel content-guided feature fusion approach. The input image is first decomposed into three precomputed components: mean intensity, variation magnitude, and variation direction. These components are dynamically reweighted by the Content-Weighting Network (CWN), which generates spatially varying weight maps by leveraging both local and global image features. The reweighted components are fused via the Adaptive Fusion Module (AFM) to produce an HDR-like intermediate representation. This representation is then processed by the Illumination Prediction Network (IPN), which applies semantic-aware weighting to estimate the global illuminant color as an RGB triplet. Extensive experiments on standard benchmarks demonstrate that CSNet achieves state-of-the-art performance, offering robust and visually consistent results under diverse lighting conditions. These advantages make CSNet a powerful tool for applications such as automatic photo correction and augmented reality.
{"title":"CSNet: A content and structure-aware approach for color constancy","authors":"Zhuo-Ming Du , Hong-An Li , Qian Yu , Wen-He Chen , Fei-long Han","doi":"10.1016/j.cviu.2026.104638","DOIUrl":"10.1016/j.cviu.2026.104638","url":null,"abstract":"<div><div>Accurate estimation and correction of global illuminant color, known as color constancy, is crucial for computational photography and computer vision but remains challenging under complex lighting conditions. We propose CSNet, an end-to-end framework that improves color constancy through a novel content-guided feature fusion approach. The input image is first decomposed into three precomputed components: mean intensity, variation magnitude, and variation direction. These components are dynamically reweighted by the Content-Weighting Network (CWN), which generates spatially varying weight maps by leveraging both local and global image features. The reweighted components are fused via the Adaptive Fusion Module (AFM) to produce an HDR-like intermediate representation. This representation is then processed by the Illumination Prediction Network (IPN), which applies semantic-aware weighting to estimate the global illuminant color as an RGB triplet. Extensive experiments on standard benchmarks demonstrate that CSNet achieves state-of-the-art performance, offering robust and visually consistent results under diverse lighting conditions. These advantages make CSNet a powerful tool for applications such as automatic photo correction and augmented reality.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104638"},"PeriodicalIF":3.5,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1016/j.cviu.2026.104635
Bowen Xu , Yaru Sui , Longxin Liu, Zhenlong Ma, Yunlong Shi, Wentong Li, Xiaoqiang Ji
Face liveness detection algorithms are widely used in anti-spoofing applications, which guarantee the accuracy and security of face recognition systems. However, with the continuous development of technologies such as 3D printing and artificial intelligence, traditional face-liveness detection algorithms struggle to withstand spoofing attacks effectively. In this paper, we propose a multi-feature fusion algorithm using only facial video for face liveness detection. Initially, we design a dual-channel network named DC-Net. It can extract robust remote photoplethysmography signals directly from 5-second facial videos, as well as fine global texture features from the keyframes of the image sequence. Subsequently, a fusion module based on the attention mechanism is used to carry out feature-level fusion. Ultimately, we use the fully connected layer for binary classification. Our methodology was validated using the REPLAY-ATTACK dataset and the 3DMAD dataset, demonstrating that for printing attacks, screen replay attacks, and 3D mask attacks, our approach attained an accuracy of 99.79% and 100% on both datasets, respectively. Meanwhile, cross-dataset testing was conducted on the CASIA-FASD and HKBU-MARs V1+ datasets, achieving HTER of 25.56% and 0.00%, respectively. This indicates that the algorithm has good accuracy and robustness in dealing with spoofing attacks in many different scenarios, which provides important ideas and technical support for the design and implementation of reliable face recognition systems.
{"title":"A dual-channel model based on multi-feature fusion for face liveness detection","authors":"Bowen Xu , Yaru Sui , Longxin Liu, Zhenlong Ma, Yunlong Shi, Wentong Li, Xiaoqiang Ji","doi":"10.1016/j.cviu.2026.104635","DOIUrl":"10.1016/j.cviu.2026.104635","url":null,"abstract":"<div><div>Face liveness detection algorithms are widely used in anti-spoofing applications, which guarantee the accuracy and security of face recognition systems. However, with the continuous development of technologies such as 3D printing and artificial intelligence, traditional face-liveness detection algorithms struggle to withstand spoofing attacks effectively. In this paper, we propose a multi-feature fusion algorithm using only facial video for face liveness detection. Initially, we design a dual-channel network named DC-Net. It can extract robust remote photoplethysmography signals directly from 5-second facial videos, as well as fine global texture features from the keyframes of the image sequence. Subsequently, a fusion module based on the attention mechanism is used to carry out feature-level fusion. Ultimately, we use the fully connected layer for binary classification. Our methodology was validated using the REPLAY-ATTACK dataset and the 3DMAD dataset, demonstrating that for printing attacks, screen replay attacks, and 3D mask attacks, our approach attained an accuracy of 99.79% and 100% on both datasets, respectively. Meanwhile, cross-dataset testing was conducted on the CASIA-FASD and HKBU-MARs V1+ datasets, achieving HTER of 25.56% and 0.00%, respectively. This indicates that the algorithm has good accuracy and robustness in dealing with spoofing attacks in many different scenarios, which provides important ideas and technical support for the design and implementation of reliable face recognition systems.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104635"},"PeriodicalIF":3.5,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1016/j.cviu.2025.104625
Yang Zhang , Tao Qin , Yimin Zhou
The ocean scenes are usually intricate and complex, with low signal-to-noise ratios for tiny or distant objects and susceptible to interference from underwater background and lighting conditions, which makes the general object detection methods more difficult to be directly applied in the ocean scenes. To solve the above problems, a YOLO-OCEAN method is proposed the YOLOv5 as the baseline model. An ultra-small-scale feature layer, multi-branch feature enhancement with cross-scale fusion, a visual-transformer bridge, a CSP-connected SPPF block and dynamic activation are incorporated into the backbone and neck to improve the detection performance. More Efficient Intersection over Union regression loss function is applied to the detection head structure. Moreover, the model is re-parameterized and lightweighted to enhance the detection speed of the model. Comparison experiments have been performed with other object detection baseline models where 86.6% [email protected] and 5.1 ms inference time are achieved with the proposed YOLO-OC method, proving the real-time detection capability for small objects in ocean scenes with high accuracy.
{"title":"A YOLO-OC real-time small object detection in ocean scenes","authors":"Yang Zhang , Tao Qin , Yimin Zhou","doi":"10.1016/j.cviu.2025.104625","DOIUrl":"10.1016/j.cviu.2025.104625","url":null,"abstract":"<div><div>The ocean scenes are usually intricate and complex, with low signal-to-noise ratios for tiny or distant objects and susceptible to interference from underwater background and lighting conditions, which makes the general object detection methods more difficult to be directly applied in the ocean scenes. To solve the above problems, a YOLO-OCEAN method is proposed the YOLOv5 as the baseline model. An ultra-small-scale feature layer, multi-branch feature enhancement with cross-scale fusion, a visual-transformer bridge, a CSP-connected SPPF block and dynamic activation are incorporated into the backbone and neck to improve the detection performance. More Efficient Intersection over Union regression loss function is applied to the detection head structure. Moreover, the model is re-parameterized and lightweighted to enhance the detection speed of the model. Comparison experiments have been performed with other object detection baseline models where 86.6% [email protected] and 5.1 ms inference time are achieved with the proposed YOLO-OC method, proving the real-time detection capability for small objects in ocean scenes with high accuracy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104625"},"PeriodicalIF":3.5,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-06DOI: 10.1016/j.cviu.2025.104629
Ofer Idan, Yoli Shavit, Yosi Keller
Relative pose regressors (RPRs) determine the pose of a query image by estimating its relative translation and rotation to a reference pose-labeled camera. Unlike other regression-based localization techniques confined to a scene’s absolute parameters, RPRs learn residuals, making them adaptable to new environments. However, RPRs have exhibited limited generalization to scenes not utilized during training (“unseen scenes”). In this work, we explore the ability of RPRs to localize in unseen scenes and propose algorithmic modifications to enhance their generalization. These modifications include attention-based aggregation of coarse feature maps, dynamic adaptation of model weights, and geometry-aware optimization. Our proposed approach improves the localization accuracy of RPRs in unseen scenes by a notable margin across multiple indoor and outdoor benchmarks and under various conditions while maintaining comparable performance in scenes used during training. We assess the contribution of each component through ablation studies and further analyze the uncertainty of our model in unseen scenes. Our Code and pre-trained models are available at https://github.com/yolish/relformer.
{"title":"Beyond familiar landscapes: Exploring the limits of relative pose regressors in new environments","authors":"Ofer Idan, Yoli Shavit, Yosi Keller","doi":"10.1016/j.cviu.2025.104629","DOIUrl":"10.1016/j.cviu.2025.104629","url":null,"abstract":"<div><div>Relative pose regressors (RPRs) determine the pose of a query image by estimating its relative translation and rotation to a reference pose-labeled camera. Unlike other regression-based localization techniques confined to a scene’s absolute parameters, RPRs learn residuals, making them adaptable to new environments. However, RPRs have exhibited limited generalization to scenes not utilized during training (“unseen scenes”). In this work, we explore the ability of RPRs to localize in unseen scenes and propose algorithmic modifications to enhance their generalization. These modifications include attention-based aggregation of coarse feature maps, dynamic adaptation of model weights, and geometry-aware optimization. Our proposed approach improves the localization accuracy of RPRs in unseen scenes by a notable margin across multiple indoor and outdoor benchmarks and under various conditions while maintaining comparable performance in scenes used during training. We assess the contribution of each component through ablation studies and further analyze the uncertainty of our model in unseen scenes. Our Code and pre-trained models are available at <span><span>https://github.com/yolish/relformer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104629"},"PeriodicalIF":3.5,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}