Pub Date : 2026-02-01Epub Date: 2026-01-07DOI: 10.1016/j.cviu.2026.104638
Zhuo-Ming Du , Hong-An Li , Qian Yu , Wen-He Chen , Fei-long Han
Accurate estimation and correction of global illuminant color, known as color constancy, is crucial for computational photography and computer vision but remains challenging under complex lighting conditions. We propose CSNet, an end-to-end framework that improves color constancy through a novel content-guided feature fusion approach. The input image is first decomposed into three precomputed components: mean intensity, variation magnitude, and variation direction. These components are dynamically reweighted by the Content-Weighting Network (CWN), which generates spatially varying weight maps by leveraging both local and global image features. The reweighted components are fused via the Adaptive Fusion Module (AFM) to produce an HDR-like intermediate representation. This representation is then processed by the Illumination Prediction Network (IPN), which applies semantic-aware weighting to estimate the global illuminant color as an RGB triplet. Extensive experiments on standard benchmarks demonstrate that CSNet achieves state-of-the-art performance, offering robust and visually consistent results under diverse lighting conditions. These advantages make CSNet a powerful tool for applications such as automatic photo correction and augmented reality.
{"title":"CSNet: A content and structure-aware approach for color constancy","authors":"Zhuo-Ming Du , Hong-An Li , Qian Yu , Wen-He Chen , Fei-long Han","doi":"10.1016/j.cviu.2026.104638","DOIUrl":"10.1016/j.cviu.2026.104638","url":null,"abstract":"<div><div>Accurate estimation and correction of global illuminant color, known as color constancy, is crucial for computational photography and computer vision but remains challenging under complex lighting conditions. We propose CSNet, an end-to-end framework that improves color constancy through a novel content-guided feature fusion approach. The input image is first decomposed into three precomputed components: mean intensity, variation magnitude, and variation direction. These components are dynamically reweighted by the Content-Weighting Network (CWN), which generates spatially varying weight maps by leveraging both local and global image features. The reweighted components are fused via the Adaptive Fusion Module (AFM) to produce an HDR-like intermediate representation. This representation is then processed by the Illumination Prediction Network (IPN), which applies semantic-aware weighting to estimate the global illuminant color as an RGB triplet. Extensive experiments on standard benchmarks demonstrate that CSNet achieves state-of-the-art performance, offering robust and visually consistent results under diverse lighting conditions. These advantages make CSNet a powerful tool for applications such as automatic photo correction and augmented reality.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104638"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2026-01-12DOI: 10.1016/j.cviu.2026.104656
Junhao Sun, Lanfei Zhao
Action Quality Assessment (AQA) aims to quantitatively evaluate the execution quality of complex human actions, which poses significant challenges due to the need for jointly modeling spatio-temporal dynamics and semantic structures. Existing approaches typically rely on static single-branch architectures, limiting their capacity to balance local fine-grained details and global rhythmic dependencies, especially in high-complexity scenarios. To address these limitations, we propose a novel Spatio-Temporal Adaptive Recalibration (STAR) Block, which enables highly discriminative representation learning via a multi-dimensional modeling strategy. Specifically, we first design a Multi-Scale Context Encoder to capture subtle local cues by leveraging parallel convolutions across spatial, temporal, and joint domains, enhancing the perception of motion details and short-term dynamics. Second, we introduce an Axial Attention-Based Global Dependency Modeling Module, which efficiently captures long-range temporal relationships while preserving the original spatio-temporal structure, thus reinforcing the understanding of phase coherence and motion rhythm. Third, a Dynamic Attention-Guided Adaptive Feature Fusion mechanism is proposed to integrate multi-path temporal semantics by assigning adaptive weights to local and global representations, enabling dynamic equilibrium in temporal modeling. Across multiple metrics, our STAR Block delivers remarkably superior performance with significant margins over state-of-the-art methods, achieving an average Spearman’s improvement of 1.56% on AQA-7, 0.57% on MTL-AQA with DD supervision, and near-perfect 99.52% accuracy on FR-FS, as proven by extensive evaluations.
{"title":"STAR Block: Adaptive spatio-temporal recalibration for action quality assessment","authors":"Junhao Sun, Lanfei Zhao","doi":"10.1016/j.cviu.2026.104656","DOIUrl":"10.1016/j.cviu.2026.104656","url":null,"abstract":"<div><div>Action Quality Assessment (AQA) aims to quantitatively evaluate the execution quality of complex human actions, which poses significant challenges due to the need for jointly modeling spatio-temporal dynamics and semantic structures. Existing approaches typically rely on static single-branch architectures, limiting their capacity to balance local fine-grained details and global rhythmic dependencies, especially in high-complexity scenarios. To address these limitations, we propose a novel Spatio-Temporal Adaptive Recalibration (STAR) Block, which enables highly discriminative representation learning via a multi-dimensional modeling strategy. Specifically, we first design a Multi-Scale Context Encoder to capture subtle local cues by leveraging parallel convolutions across spatial, temporal, and joint domains, enhancing the perception of motion details and short-term dynamics. Second, we introduce an Axial Attention-Based Global Dependency Modeling Module, which efficiently captures long-range temporal relationships while preserving the original spatio-temporal structure, thus reinforcing the understanding of phase coherence and motion rhythm. Third, a Dynamic Attention-Guided Adaptive Feature Fusion mechanism is proposed to integrate multi-path temporal semantics by assigning adaptive weights to local and global representations, enabling dynamic equilibrium in temporal modeling. Across multiple metrics, our STAR Block delivers remarkably superior performance with significant margins over state-of-the-art methods, achieving an average Spearman’s <span><math><mi>ρ</mi></math></span> improvement of 1.56% on AQA-7, 0.57% on MTL-AQA with DD supervision, and near-perfect 99.52% accuracy on FR-FS, as proven by extensive evaluations.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104656"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-12-27DOI: 10.1016/j.cviu.2025.104627
Zhibo Wang , Amir Nazemi , Stephie Liu , Sirisha Rambhatla , Yuhao Chen , David Clausi
Accurate registration of ice hockey rinks from broadcast video frames is fundamental to sports analytics, as it aligns the rink template and broadcast frame into a unified coordinate system for consistent player analysis. Existing approaches, including keypoint- and segmentation-based methods, often yield suboptimal homography estimation due to insufficient attention to rink boundaries. To address this, we propose a segmentation-based framework that explicitly introduces the rink boundary as a new segmentation class. To further improve accuracy, we introduce three components that enhance boundary awareness: (i) a boundary-aware loss to strengthen boundary representation, (ii) a dynamic class-weighted mechanism in homography estimation to emphasize informative regions, and (iii) a self-distillation strategy to enrich feature diversity. Experiments on the NHL and SHL datasets demonstrate that our method significantly outperforms both baselines, achieving improvements of and in IoUpart and IoUwhole on the NHL dataset, and and on the SHL dataset, respectively. Ablation studies further confirm the contribution of each component, establishing a robust solution for rink registration and a strong foundation for downstream sports vision tasks.
{"title":"Boundary-aware semantic segmentation for ice hockey rink registration","authors":"Zhibo Wang , Amir Nazemi , Stephie Liu , Sirisha Rambhatla , Yuhao Chen , David Clausi","doi":"10.1016/j.cviu.2025.104627","DOIUrl":"10.1016/j.cviu.2025.104627","url":null,"abstract":"<div><div>Accurate registration of ice hockey rinks from broadcast video frames is fundamental to sports analytics, as it aligns the rink template and broadcast frame into a unified coordinate system for consistent player analysis. Existing approaches, including keypoint- and segmentation-based methods, often yield suboptimal homography estimation due to insufficient attention to rink boundaries. To address this, we propose a segmentation-based framework that explicitly introduces the rink boundary as a new segmentation class. To further improve accuracy, we introduce three components that enhance boundary awareness: (i) a boundary-aware loss to strengthen boundary representation, (ii) a dynamic class-weighted mechanism in homography estimation to emphasize informative regions, and (iii) a self-distillation strategy to enrich feature diversity. Experiments on the NHL and SHL datasets demonstrate that our method significantly outperforms both baselines, achieving improvements of <span><math><mrow><mo>+</mo><mn>2</mn><mo>.</mo><mn>84</mn></mrow></math></span> and <span><math><mrow><mo>+</mo><mn>3</mn><mo>.</mo><mn>48</mn></mrow></math></span> in IoU<sub>part</sub> and IoU<sub>whole</sub> on the NHL dataset, and <span><math><mrow><mo>+</mo><mn>1</mn><mo>.</mo><mn>53</mn></mrow></math></span> and <span><math><mrow><mo>+</mo><mn>5</mn><mo>.</mo><mn>85</mn></mrow></math></span> on the SHL dataset, respectively. Ablation studies further confirm the contribution of each component, establishing a robust solution for rink registration and a strong foundation for downstream sports vision tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104627"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2026-01-21DOI: 10.1016/j.cviu.2026.104661
Jaeyoon Lee , Hojoon Jung , Jongwon Choi
We present 3DFit, which is a novel 3D-aware virtual try-on framework that synthesizes realistic try-on images using only 2D inputs. Unlike previous methods that either ignore 3D body geometry or rely entirely on 3D clothing models, 3DFit utilizes 3D human meshes estimated from 2D images and adaptively transforms 3D clothing templates guided by 2D clothing images. We further introduce a warping strategy that integrates 3D information into 2D clothing images using a set of pre-designed 3D templates, enabling efficient adaptation to various body shapes and poses. As a result, our method supports accurate and personalized virtual try-on experiences. Experimental results on the VITON-HD dataset demonstrate that 3DFit outperforms existing methods in preserving garment structure and maintaining high visual quality across a wide range of body types and poses.
{"title":"3D-aware virtual try-on using only 2D inputs","authors":"Jaeyoon Lee , Hojoon Jung , Jongwon Choi","doi":"10.1016/j.cviu.2026.104661","DOIUrl":"10.1016/j.cviu.2026.104661","url":null,"abstract":"<div><div>We present 3DFit, which is a novel 3D-aware virtual try-on framework that synthesizes realistic try-on images using only 2D inputs. Unlike previous methods that either ignore 3D body geometry or rely entirely on 3D clothing models, 3DFit utilizes 3D human meshes estimated from 2D images and adaptively transforms 3D clothing templates guided by 2D clothing images. We further introduce a warping strategy that integrates 3D information into 2D clothing images using a set of pre-designed 3D templates, enabling efficient adaptation to various body shapes and poses. As a result, our method supports accurate and personalized virtual try-on experiences. Experimental results on the VITON-HD dataset demonstrate that 3DFit outperforms existing methods in preserving garment structure and maintaining high visual quality across a wide range of body types and poses.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104661"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2026-01-21DOI: 10.1016/j.cviu.2026.104634
Shiyue Chen , Yanchao Liu , Ziyue Wang , Xina Cheng , Takeshi Ikenaga
Freestyle skiing big air requires precise athlete–ski coordination to determine both technical difficulty and execution quality. Accurate action quality assessment in this discipline therefore necessitates explicit modeling of human–object interactions. However, most existing methods rely on video-level or human-centric representations, overlooking structured athlete-ski relationships and limiting evaluation of control and stability. To address this, we construct a freestyle skiing big air dataset with fine-grained annotations, including frame-level athlete-ski bounding boxes and performance-related metadata. Based on this dataset, we propose an interaction-aware framework that captures athlete–ski coordination by combining instance-level appearance and positional features through spatiotemporal reasoning. Furthermore, to avoid commonly used uniform sampling diluting performance-critical moments in long sequences, we introduce a training-free entropy-based sampling strategy that exploits athlete–ski geometric dynamics to identify performance-critical moments such as take-off, rotation, and landing, thereby reducing redundancy. Together, these designs address where to look and when to focus in big air assessment. Extensive experiments demonstrate that our method achieves a Spearman’s rank correlation of 0.7173 on the proposed dataset, outperforming state-of-the-art methods.
{"title":"Interaction-aware representation learning for action quality assessment in freestyle skiing big air","authors":"Shiyue Chen , Yanchao Liu , Ziyue Wang , Xina Cheng , Takeshi Ikenaga","doi":"10.1016/j.cviu.2026.104634","DOIUrl":"10.1016/j.cviu.2026.104634","url":null,"abstract":"<div><div>Freestyle skiing big air requires precise athlete–ski coordination to determine both technical difficulty and execution quality. Accurate action quality assessment in this discipline therefore necessitates explicit modeling of human–object interactions. However, most existing methods rely on video-level or human-centric representations, overlooking structured athlete-ski relationships and limiting evaluation of control and stability. To address this, we construct a freestyle skiing big air dataset with fine-grained annotations, including frame-level athlete-ski bounding boxes and performance-related metadata. Based on this dataset, we propose an interaction-aware framework that captures athlete–ski coordination by combining instance-level appearance and positional features through spatiotemporal reasoning. Furthermore, to avoid commonly used uniform sampling diluting performance-critical moments in long sequences, we introduce a training-free entropy-based sampling strategy that exploits athlete–ski geometric dynamics to identify performance-critical moments such as take-off, rotation, and landing, thereby reducing redundancy. Together, these designs address <em>where to look</em> and <em>when to focus</em> in big air assessment. Extensive experiments demonstrate that our method achieves a Spearman’s rank correlation of 0.7173 on the proposed dataset, outperforming state-of-the-art methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104634"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146078741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2026-01-02DOI: 10.1016/j.cviu.2025.104619
Nerea Gallego , Carlos Plou , Miguel Marcos , Pablo Urcola , Luis Montesano , Eduardo Montijano , Ruben Martinez-Cantin , Ana C. Murillo
Sleep is fundamental to health, and society is more and more aware of the impact and relevance of sleep disorders. Traditional diagnostic methods, like polysomnography, are intrusive and resource-intensive. Instead, research is focusing on developing novel, less intrusive or portable methods that combine intelligent sensors with activity recognition for diagnosis support and scoring. Event cameras offer a promising alternative for automated, in-home sleep activity recognition due to their excellent low-light performance and low power consumption. This work introduces EventSleep2-data, a significant extension to the EventSleep dataset, featuring 10 complete night recordings (around 7 h each) of volunteers sleeping in their homes. Unlike the original short and controlled recordings, this new dataset captures natural, full-night sleep sessions under realistic conditions. This new data incorporates challenging real-world scene variations, an efficient movement-triggered sparse data recording pipeline, and synchronized 2-channel EEG data for a subset of recordings. We also present EventSleep2-net, a novel event-based sleep activity recognition approach with a dual-head architecture to simultaneously analyze motion classes and static poses. The model is specifically designed to handle the motion-triggered, sparse nature of complete night recordings. Unlike the original EventSleep architecture, EventSleep2-net can predict both movement and static poses even during long periods with no events. We demonstrate state-of-the-art performance on both EventSleep1-data, the original dataset, and EventSleep2-data, with comprehensive ablation studies validating our design decisions. Together, EventSleep2-data and EventSleep2-net overcome the limitations of the previous setup and enable continuous, full-night analysis for real-world sleep monitoring, significantly advancing the potential of event-based vision for sleep disorder studies. Code and data are publicly available on the webpage: https://sites.google.com/unizar.es/eventsleep.
{"title":"EventSleep2: Sleep activity recognition on complete night sleep recordings with an event camera","authors":"Nerea Gallego , Carlos Plou , Miguel Marcos , Pablo Urcola , Luis Montesano , Eduardo Montijano , Ruben Martinez-Cantin , Ana C. Murillo","doi":"10.1016/j.cviu.2025.104619","DOIUrl":"10.1016/j.cviu.2025.104619","url":null,"abstract":"<div><div>Sleep is fundamental to health, and society is more and more aware of the impact and relevance of sleep disorders. Traditional diagnostic methods, like polysomnography, are intrusive and resource-intensive. Instead, research is focusing on developing novel, less intrusive or portable methods that combine intelligent sensors with activity recognition for diagnosis support and scoring. Event cameras offer a promising alternative for automated, in-home sleep activity recognition due to their excellent low-light performance and low power consumption. This work introduces <strong>EventSleep2-data</strong>, a significant extension to the EventSleep dataset, featuring 10 complete night recordings (around 7 h each) of volunteers sleeping in their homes. Unlike the original short and controlled recordings, this new dataset captures natural, full-night sleep sessions under realistic conditions. This new data incorporates challenging real-world scene variations, an efficient movement-triggered sparse data recording pipeline, and synchronized 2-channel EEG data for a subset of recordings. We also present <strong>EventSleep2-net</strong>, a novel event-based sleep activity recognition approach with a dual-head architecture to simultaneously analyze motion classes and static poses. The model is specifically designed to handle the motion-triggered, sparse nature of complete night recordings. Unlike the original EventSleep architecture, EventSleep2-net can predict both movement and static poses even during long periods with no events. We demonstrate state-of-the-art performance on both EventSleep1-data, the original dataset, and EventSleep2-data, with comprehensive ablation studies validating our design decisions. Together, EventSleep2-data and EventSleep2-net overcome the limitations of the previous setup and enable continuous, full-night analysis for real-world sleep monitoring, significantly advancing the potential of event-based vision for sleep disorder studies. Code and data are publicly available on the webpage: <span><span>https://sites.google.com/unizar.es/eventsleep</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104619"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2026-01-07DOI: 10.1016/j.cviu.2026.104635
Bowen Xu , Yaru Sui , Longxin Liu, Zhenlong Ma, Yunlong Shi, Wentong Li, Xiaoqiang Ji
Face liveness detection algorithms are widely used in anti-spoofing applications, which guarantee the accuracy and security of face recognition systems. However, with the continuous development of technologies such as 3D printing and artificial intelligence, traditional face-liveness detection algorithms struggle to withstand spoofing attacks effectively. In this paper, we propose a multi-feature fusion algorithm using only facial video for face liveness detection. Initially, we design a dual-channel network named DC-Net. It can extract robust remote photoplethysmography signals directly from 5-second facial videos, as well as fine global texture features from the keyframes of the image sequence. Subsequently, a fusion module based on the attention mechanism is used to carry out feature-level fusion. Ultimately, we use the fully connected layer for binary classification. Our methodology was validated using the REPLAY-ATTACK dataset and the 3DMAD dataset, demonstrating that for printing attacks, screen replay attacks, and 3D mask attacks, our approach attained an accuracy of 99.79% and 100% on both datasets, respectively. Meanwhile, cross-dataset testing was conducted on the CASIA-FASD and HKBU-MARs V1+ datasets, achieving HTER of 25.56% and 0.00%, respectively. This indicates that the algorithm has good accuracy and robustness in dealing with spoofing attacks in many different scenarios, which provides important ideas and technical support for the design and implementation of reliable face recognition systems.
{"title":"A dual-channel model based on multi-feature fusion for face liveness detection","authors":"Bowen Xu , Yaru Sui , Longxin Liu, Zhenlong Ma, Yunlong Shi, Wentong Li, Xiaoqiang Ji","doi":"10.1016/j.cviu.2026.104635","DOIUrl":"10.1016/j.cviu.2026.104635","url":null,"abstract":"<div><div>Face liveness detection algorithms are widely used in anti-spoofing applications, which guarantee the accuracy and security of face recognition systems. However, with the continuous development of technologies such as 3D printing and artificial intelligence, traditional face-liveness detection algorithms struggle to withstand spoofing attacks effectively. In this paper, we propose a multi-feature fusion algorithm using only facial video for face liveness detection. Initially, we design a dual-channel network named DC-Net. It can extract robust remote photoplethysmography signals directly from 5-second facial videos, as well as fine global texture features from the keyframes of the image sequence. Subsequently, a fusion module based on the attention mechanism is used to carry out feature-level fusion. Ultimately, we use the fully connected layer for binary classification. Our methodology was validated using the REPLAY-ATTACK dataset and the 3DMAD dataset, demonstrating that for printing attacks, screen replay attacks, and 3D mask attacks, our approach attained an accuracy of 99.79% and 100% on both datasets, respectively. Meanwhile, cross-dataset testing was conducted on the CASIA-FASD and HKBU-MARs V1+ datasets, achieving HTER of 25.56% and 0.00%, respectively. This indicates that the algorithm has good accuracy and robustness in dealing with spoofing attacks in many different scenarios, which provides important ideas and technical support for the design and implementation of reliable face recognition systems.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104635"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2026-01-07DOI: 10.1016/j.cviu.2025.104625
Yang Zhang , Tao Qin , Yimin Zhou
The ocean scenes are usually intricate and complex, with low signal-to-noise ratios for tiny or distant objects and susceptible to interference from underwater background and lighting conditions, which makes the general object detection methods more difficult to be directly applied in the ocean scenes. To solve the above problems, a YOLO-OCEAN method is proposed the YOLOv5 as the baseline model. An ultra-small-scale feature layer, multi-branch feature enhancement with cross-scale fusion, a visual-transformer bridge, a CSP-connected SPPF block and dynamic activation are incorporated into the backbone and neck to improve the detection performance. More Efficient Intersection over Union regression loss function is applied to the detection head structure. Moreover, the model is re-parameterized and lightweighted to enhance the detection speed of the model. Comparison experiments have been performed with other object detection baseline models where 86.6% [email protected] and 5.1 ms inference time are achieved with the proposed YOLO-OC method, proving the real-time detection capability for small objects in ocean scenes with high accuracy.
{"title":"A YOLO-OC real-time small object detection in ocean scenes","authors":"Yang Zhang , Tao Qin , Yimin Zhou","doi":"10.1016/j.cviu.2025.104625","DOIUrl":"10.1016/j.cviu.2025.104625","url":null,"abstract":"<div><div>The ocean scenes are usually intricate and complex, with low signal-to-noise ratios for tiny or distant objects and susceptible to interference from underwater background and lighting conditions, which makes the general object detection methods more difficult to be directly applied in the ocean scenes. To solve the above problems, a YOLO-OCEAN method is proposed the YOLOv5 as the baseline model. An ultra-small-scale feature layer, multi-branch feature enhancement with cross-scale fusion, a visual-transformer bridge, a CSP-connected SPPF block and dynamic activation are incorporated into the backbone and neck to improve the detection performance. More Efficient Intersection over Union regression loss function is applied to the detection head structure. Moreover, the model is re-parameterized and lightweighted to enhance the detection speed of the model. Comparison experiments have been performed with other object detection baseline models where 86.6% [email protected] and 5.1 ms inference time are achieved with the proposed YOLO-OC method, proving the real-time detection capability for small objects in ocean scenes with high accuracy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104625"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2026-01-10DOI: 10.1016/j.cviu.2026.104650
David Tejero-Ruiz , David Solís-Martín , Francisco J. Pérez-Grau , Joaquín Borrego-Díaz
Indoor UAV navigation faces significant challenges due to GPS signal absence and limitations of conventional visual-inertial systems under challenging lighting and motion conditions. This paper presents an event-based visual-inertial odometry system that addresses these limitations through intermediate frame reconstruction from event streams combined with established odometry algorithms. The approach leverages event cameras’ unique characteristics — microsecond temporal resolution, high dynamic range (120 dB), and motion blur immunity — to maintain stable navigation performance under conditions that cause conventional systems to fail. The system achieves real-time operation at 30 Hz frame reconstruction and 20 Hz pose estimation on embedded hardware, consuming 15 W power while adding only 50 g to the UAV platform. Experimental validation in controlled indoor environments demonstrates mean absolute pose errors of 26–42 cm across different operational conditions, comparable to conventional visual-inertial systems. Critically, the system maintains stable performance during rapid lighting transitions, showing only 59% performance degradation compared to baseline conditions, while conventional cameras typically experience complete tracking failure. The results establish event-based visual-inertial odometry as a viable alternative for indoor UAV navigation, particularly in applications requiring environmental robustness over marginal accuracy improvements under optimal conditions.
{"title":"Indoor UAV navigation using event cameras and intermediate frame reconstruction","authors":"David Tejero-Ruiz , David Solís-Martín , Francisco J. Pérez-Grau , Joaquín Borrego-Díaz","doi":"10.1016/j.cviu.2026.104650","DOIUrl":"10.1016/j.cviu.2026.104650","url":null,"abstract":"<div><div>Indoor UAV navigation faces significant challenges due to GPS signal absence and limitations of conventional visual-inertial systems under challenging lighting and motion conditions. This paper presents an event-based visual-inertial odometry system that addresses these limitations through intermediate frame reconstruction from event streams combined with established odometry algorithms. The approach leverages event cameras’ unique characteristics — microsecond temporal resolution, high dynamic range (120 dB), and motion blur immunity — to maintain stable navigation performance under conditions that cause conventional systems to fail. The system achieves real-time operation at 30 Hz frame reconstruction and 20 Hz pose estimation on embedded hardware, consuming 15 W power while adding only 50 g to the UAV platform. Experimental validation in controlled indoor environments demonstrates mean absolute pose errors of 26–42 cm across different operational conditions, comparable to conventional visual-inertial systems. Critically, the system maintains stable performance during rapid lighting transitions, showing only 59% performance degradation compared to baseline conditions, while conventional cameras typically experience complete tracking failure. The results establish event-based visual-inertial odometry as a viable alternative for indoor UAV navigation, particularly in applications requiring environmental robustness over marginal accuracy improvements under optimal conditions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104650"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-12-30DOI: 10.1016/j.cviu.2025.104618
Elio Musacchio , Lucia Siciliani , Pierpaolo Basile , Giovanni Semeraro
The growing popularity of Large Vision-Language Models has highlighted and intensified one of the most well-known challenges in the field of Large Language Models: training is mainly, and most of the time exclusively, conducted on English data. Consequently, the resulting models are more prone to error in non-English tasks, and this issue is exacerbated in multimodal settings that are even more complex and use task-specific datasets. Given this, research on Large Language Models has turned toward adapting them to non-English languages. However, the scarcity of open and curated resources for these languages poses a significant limitation. In this work, we aim to tackle the aforementioned challenge by exploring Large Vision-Language Models adaptation to non-English languages, using machine translation to overcome the lack of curated data. We also analyze how the evaluation of the results is influenced when training a vision-to-text adapter across different languages, examining the performance variations and challenges associated with multilingual adaptation. Finally, we highlight the importance of using open resources to ensure transparency and reproducibility of the results. Following this philosophy, we provide open access to the entire codebase of the adaptation pipeline, along with the trained models and dataset, to foster further research.1
{"title":"Extending Large Language Models to multimodality for non-English languages","authors":"Elio Musacchio , Lucia Siciliani , Pierpaolo Basile , Giovanni Semeraro","doi":"10.1016/j.cviu.2025.104618","DOIUrl":"10.1016/j.cviu.2025.104618","url":null,"abstract":"<div><div>The growing popularity of Large Vision-Language Models has highlighted and intensified one of the most well-known challenges in the field of Large Language Models: training is mainly, and most of the time exclusively, conducted on English data. Consequently, the resulting models are more prone to error in non-English tasks, and this issue is exacerbated in multimodal settings that are even more complex and use task-specific datasets. Given this, research on Large Language Models has turned toward adapting them to non-English languages. However, the scarcity of open and curated resources for these languages poses a significant limitation. In this work, we aim to tackle the aforementioned challenge by exploring Large Vision-Language Models adaptation to non-English languages, using machine translation to overcome the lack of curated data. We also analyze how the evaluation of the results is influenced when training a vision-to-text adapter across different languages, examining the performance variations and challenges associated with multilingual adaptation. Finally, we highlight the importance of using open resources to ensure transparency and reproducibility of the results. Following this philosophy, we provide open access to the entire codebase of the adaptation pipeline, along with the trained models and dataset, to foster further research.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104618"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}