Pub Date : 2026-01-07DOI: 10.1016/j.cviu.2025.104625
Yang Zhang , Tao Qin , Yimin Zhou
The ocean scenes are usually intricate and complex, with low signal-to-noise ratios for tiny or distant objects and susceptible to interference from underwater background and lighting conditions, which makes the general object detection methods more difficult to be directly applied in the ocean scenes. To solve the above problems, a YOLO-OCEAN method is proposed the YOLOv5 as the baseline model. An ultra-small-scale feature layer, multi-branch feature enhancement with cross-scale fusion, a visual-transformer bridge, a CSP-connected SPPF block and dynamic activation are incorporated into the backbone and neck to improve the detection performance. More Efficient Intersection over Union regression loss function is applied to the detection head structure. Moreover, the model is re-parameterized and lightweighted to enhance the detection speed of the model. Comparison experiments have been performed with other object detection baseline models where 86.6% [email protected] and 5.1 ms inference time are achieved with the proposed YOLO-OC method, proving the real-time detection capability for small objects in ocean scenes with high accuracy.
{"title":"A YOLO-OC real-time small object detection in ocean scenes","authors":"Yang Zhang , Tao Qin , Yimin Zhou","doi":"10.1016/j.cviu.2025.104625","DOIUrl":"10.1016/j.cviu.2025.104625","url":null,"abstract":"<div><div>The ocean scenes are usually intricate and complex, with low signal-to-noise ratios for tiny or distant objects and susceptible to interference from underwater background and lighting conditions, which makes the general object detection methods more difficult to be directly applied in the ocean scenes. To solve the above problems, a YOLO-OCEAN method is proposed the YOLOv5 as the baseline model. An ultra-small-scale feature layer, multi-branch feature enhancement with cross-scale fusion, a visual-transformer bridge, a CSP-connected SPPF block and dynamic activation are incorporated into the backbone and neck to improve the detection performance. More Efficient Intersection over Union regression loss function is applied to the detection head structure. Moreover, the model is re-parameterized and lightweighted to enhance the detection speed of the model. Comparison experiments have been performed with other object detection baseline models where 86.6% [email protected] and 5.1 ms inference time are achieved with the proposed YOLO-OC method, proving the real-time detection capability for small objects in ocean scenes with high accuracy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104625"},"PeriodicalIF":3.5,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-06DOI: 10.1016/j.cviu.2025.104629
Ofer Idan, Yoli Shavit, Yosi Keller
Relative pose regressors (RPRs) determine the pose of a query image by estimating its relative translation and rotation to a reference pose-labeled camera. Unlike other regression-based localization techniques confined to a scene’s absolute parameters, RPRs learn residuals, making them adaptable to new environments. However, RPRs have exhibited limited generalization to scenes not utilized during training (“unseen scenes”). In this work, we explore the ability of RPRs to localize in unseen scenes and propose algorithmic modifications to enhance their generalization. These modifications include attention-based aggregation of coarse feature maps, dynamic adaptation of model weights, and geometry-aware optimization. Our proposed approach improves the localization accuracy of RPRs in unseen scenes by a notable margin across multiple indoor and outdoor benchmarks and under various conditions while maintaining comparable performance in scenes used during training. We assess the contribution of each component through ablation studies and further analyze the uncertainty of our model in unseen scenes. Our Code and pre-trained models are available at https://github.com/yolish/relformer.
{"title":"Beyond familiar landscapes: Exploring the limits of relative pose regressors in new environments","authors":"Ofer Idan, Yoli Shavit, Yosi Keller","doi":"10.1016/j.cviu.2025.104629","DOIUrl":"10.1016/j.cviu.2025.104629","url":null,"abstract":"<div><div>Relative pose regressors (RPRs) determine the pose of a query image by estimating its relative translation and rotation to a reference pose-labeled camera. Unlike other regression-based localization techniques confined to a scene’s absolute parameters, RPRs learn residuals, making them adaptable to new environments. However, RPRs have exhibited limited generalization to scenes not utilized during training (“unseen scenes”). In this work, we explore the ability of RPRs to localize in unseen scenes and propose algorithmic modifications to enhance their generalization. These modifications include attention-based aggregation of coarse feature maps, dynamic adaptation of model weights, and geometry-aware optimization. Our proposed approach improves the localization accuracy of RPRs in unseen scenes by a notable margin across multiple indoor and outdoor benchmarks and under various conditions while maintaining comparable performance in scenes used during training. We assess the contribution of each component through ablation studies and further analyze the uncertainty of our model in unseen scenes. Our Code and pre-trained models are available at <span><span>https://github.com/yolish/relformer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104629"},"PeriodicalIF":3.5,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-06DOI: 10.1016/j.cviu.2025.104632
Hugo Rodet, Lama Séoud
Describing human movement is key to many applications, ranging from medicine to 3D animation. Morphology is an important factor influencing how people move, but as of yet it is seldom accounted for in human-centric tasks like motion generation. In this study, we first assess the diversity of body shapes in real human motion datasets, then demonstrate the benefits of morphology-aware motion generation. We reveal biases in the data regarding body shape, in particular for body fat and gender representation. Considering the incompleteness of even the largest motion-capture datasets, proving quantitatively that morphology influences motion is difficult using existing tools: we thus propose a new metric relying on 3D body mesh self-collision, and use it to demonstrate that individuals with varied body mass indices also differ in their movements. One consequence is that generic, morphology-agnostic generated poses tend to be unsuitable for the body models they are used with, and we show that it tends to increase self-collision artifacts. Building upon these results, we show that morphology-aware motion generation reduces mesh self-collision artifacts despite not being trained for it explicitly, even when using a common backbone and a naive conditioning strategy. Morphology-aware generation can also be seamlessly integrated to most pose and motion generation architectures with little-to-no extra computational cost and without compromising generation diversity of realism.
{"title":"Body shape diversity in the training data and consequences on motion generation","authors":"Hugo Rodet, Lama Séoud","doi":"10.1016/j.cviu.2025.104632","DOIUrl":"10.1016/j.cviu.2025.104632","url":null,"abstract":"<div><div>Describing human movement is key to many applications, ranging from medicine to 3D animation. Morphology is an important factor influencing how people move, but as of yet it is seldom accounted for in human-centric tasks like motion generation. In this study, we first assess the diversity of body shapes in real human motion datasets, then demonstrate the benefits of morphology-aware motion generation. We reveal biases in the data regarding body shape, in particular for body fat and gender representation. Considering the incompleteness of even the largest motion-capture datasets, proving quantitatively that morphology influences motion is difficult using existing tools: we thus propose a new metric relying on 3D body mesh self-collision, and use it to demonstrate that individuals with varied body mass indices also differ in their movements. One consequence is that generic, morphology-agnostic generated poses tend to be unsuitable for the body models they are used with, and we show that it tends to increase self-collision artifacts. Building upon these results, we show that morphology-aware motion generation reduces mesh self-collision artifacts despite not being trained for it explicitly, even when using a common backbone and a naive conditioning strategy. Morphology-aware generation can also be seamlessly integrated to most pose and motion generation architectures with little-to-no extra computational cost and without compromising generation diversity of realism.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104632"},"PeriodicalIF":3.5,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-06DOI: 10.1016/j.cviu.2025.104633
Yi Wu, Weiwei Wang, Yujie Wang, Kaige Cui
Single-image super-resolution (SISR) has achieved substantial progress, enabling high-fidelity restoration from low-resolution inputs. However, the high computational cost of existing methods remains a major challenge, limiting their deployment on resource-constrained edge devices. To address this issue, we propose a lightweight Large Kernel Information Interaction Network (LKIN), which effectively balances computational efficiency and reconstruction quality. Our approach integrates multi-scale large receptive fields, information distillation, and attention mechanisms to enhance feature representation and improve super-resolution performance. Specifically, we replace the conventional BSConv with a large kernel network, allowing the model to capture long-range dependencies more effectively while reducing the reliance on deeper architectures. Additionally, we introduce a Multi-Scale Feature Enhancement (MSFE) module, which leverages efficient convolutions and attention mechanisms to refine extracted features while eliminating redundant operations. Extensive experiments are conducted on standard benchmarks (Set5, Set14, BSD100, Urban100, Manga109) at 2, 3, and 4 upscaling factors. We evaluate performance using PSNR and SSIM. Compared with representative lightweight CNN-based methods (e.g., IMDN, BSRN, CARN) and Transformer-based approaches (e.g., SwinIR-light, SRFormer, ESRT, NGSwin), LKIN achieves up to +0.15 dB PSNR improvements over the strongest baseline while reducing parameters by 18%.
{"title":"Large Kernel Information-interaction Network for single image super-resolution","authors":"Yi Wu, Weiwei Wang, Yujie Wang, Kaige Cui","doi":"10.1016/j.cviu.2025.104633","DOIUrl":"10.1016/j.cviu.2025.104633","url":null,"abstract":"<div><div>Single-image super-resolution (SISR) has achieved substantial progress, enabling high-fidelity restoration from low-resolution inputs. However, the high computational cost of existing methods remains a major challenge, limiting their deployment on resource-constrained edge devices. To address this issue, we propose a lightweight Large Kernel Information Interaction Network (LKIN), which effectively balances computational efficiency and reconstruction quality. Our approach integrates multi-scale large receptive fields, information distillation, and attention mechanisms to enhance feature representation and improve super-resolution performance. Specifically, we replace the conventional BSConv with a large kernel network, allowing the model to capture long-range dependencies more effectively while reducing the reliance on deeper architectures. Additionally, we introduce a Multi-Scale Feature Enhancement (MSFE) module, which leverages efficient convolutions and attention mechanisms to refine extracted features while eliminating redundant operations. Extensive experiments are conducted on standard benchmarks (Set5, Set14, BSD100, Urban100, Manga109) at <span><math><mo>×</mo></math></span> 2, <span><math><mo>×</mo></math></span> 3, and <span><math><mo>×</mo></math></span> 4 upscaling factors. We evaluate performance using PSNR and SSIM. Compared with representative lightweight CNN-based methods (e.g., IMDN, BSRN, CARN) and Transformer-based approaches (e.g., SwinIR-light, SRFormer, ESRT, NGSwin), LKIN achieves up to +0.15 dB PSNR improvements over the strongest baseline while reducing parameters by 18%.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104633"},"PeriodicalIF":3.5,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1016/j.cviu.2025.104631
Xu-Hua Yang, Dong Wei, Wangjie Li, Hongxiang Hu
Sign language retrieval aims to enhance communication between the deaf and hearing individuals. Due to the scarcity of sign language video data, researchers often use contrastive learning-based data augmentation methods to mitigate data sparsity. However, the pairwise metric learning paradigm fails to properly account for differences among various augmentations and may even erroneously learn the distinctions between augmentation methods. Moreover, existing sign language retrieval studies are susceptible to spurious correlations between cross-modal data and often overlook associations across different granularities. To address these limitations, we propose a Causality-Inspired Multi-Grained Cross-Modal Sign Language Retrieval method (CMCM) that enhances cross-modal retrieval capabilities by eliminating both observable and unobservable confounders. First, CMCM performs varying degrees of augmentation on the original videos and employs backdoor adjustment to mitigate confounders among augmented data, obtaining highly stable video representations invariant to confounding factors. Next, we propose a cross-modal causal-attention Gaussian network that employs front-door causal intervention to eliminate implicit confounders and parameterize their Gaussian distribution for fine-grained alignment. Finally, we design a temporal-motion covariance pooling method to capture global features of sign language sequences, facilitating coarse-grained cross-modal feature alignment. Extensive experiments on three public datasets demonstrate that CMCM achieves highly competitive retrieval accuracy. The code is available at: https://github.com/vddong-zjut/CMCM.
{"title":"Causality-inspired multi-grained cross-modal sign language retrieval","authors":"Xu-Hua Yang, Dong Wei, Wangjie Li, Hongxiang Hu","doi":"10.1016/j.cviu.2025.104631","DOIUrl":"10.1016/j.cviu.2025.104631","url":null,"abstract":"<div><div>Sign language retrieval aims to enhance communication between the deaf and hearing individuals. Due to the scarcity of sign language video data, researchers often use contrastive learning-based data augmentation methods to mitigate data sparsity. However, the pairwise metric learning paradigm fails to properly account for differences among various augmentations and may even erroneously learn the distinctions between augmentation methods. Moreover, existing sign language retrieval studies are susceptible to spurious correlations between cross-modal data and often overlook associations across different granularities. To address these limitations, we propose a Causality-Inspired Multi-Grained Cross-Modal Sign Language Retrieval method (CMCM) that enhances cross-modal retrieval capabilities by eliminating both observable and unobservable confounders. First, CMCM performs varying degrees of augmentation on the original videos and employs backdoor adjustment to mitigate confounders among augmented data, obtaining highly stable video representations invariant to confounding factors. Next, we propose a cross-modal causal-attention Gaussian network that employs front-door causal intervention to eliminate implicit confounders and parameterize their Gaussian distribution for fine-grained alignment. Finally, we design a temporal-motion covariance pooling method to capture global features of sign language sequences, facilitating coarse-grained cross-modal feature alignment. Extensive experiments on three public datasets demonstrate that CMCM achieves highly competitive retrieval accuracy. The code is available at: <span><span>https://github.com/vddong-zjut/CMCM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104631"},"PeriodicalIF":3.5,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1016/j.cviu.2025.104619
Nerea Gallego , Carlos Plou , Miguel Marcos , Pablo Urcola , Luis Montesano , Eduardo Montijano , Ruben Martinez-Cantin , Ana C. Murillo
Sleep is fundamental to health, and society is more and more aware of the impact and relevance of sleep disorders. Traditional diagnostic methods, like polysomnography, are intrusive and resource-intensive. Instead, research is focusing on developing novel, less intrusive or portable methods that combine intelligent sensors with activity recognition for diagnosis support and scoring. Event cameras offer a promising alternative for automated, in-home sleep activity recognition due to their excellent low-light performance and low power consumption. This work introduces EventSleep2-data, a significant extension to the EventSleep dataset, featuring 10 complete night recordings (around 7 h each) of volunteers sleeping in their homes. Unlike the original short and controlled recordings, this new dataset captures natural, full-night sleep sessions under realistic conditions. This new data incorporates challenging real-world scene variations, an efficient movement-triggered sparse data recording pipeline, and synchronized 2-channel EEG data for a subset of recordings. We also present EventSleep2-net, a novel event-based sleep activity recognition approach with a dual-head architecture to simultaneously analyze motion classes and static poses. The model is specifically designed to handle the motion-triggered, sparse nature of complete night recordings. Unlike the original EventSleep architecture, EventSleep2-net can predict both movement and static poses even during long periods with no events. We demonstrate state-of-the-art performance on both EventSleep1-data, the original dataset, and EventSleep2-data, with comprehensive ablation studies validating our design decisions. Together, EventSleep2-data and EventSleep2-net overcome the limitations of the previous setup and enable continuous, full-night analysis for real-world sleep monitoring, significantly advancing the potential of event-based vision for sleep disorder studies. Code and data are publicly available on the webpage: https://sites.google.com/unizar.es/eventsleep.
{"title":"EventSleep2: Sleep activity recognition on complete night sleep recordings with an event camera","authors":"Nerea Gallego , Carlos Plou , Miguel Marcos , Pablo Urcola , Luis Montesano , Eduardo Montijano , Ruben Martinez-Cantin , Ana C. Murillo","doi":"10.1016/j.cviu.2025.104619","DOIUrl":"10.1016/j.cviu.2025.104619","url":null,"abstract":"<div><div>Sleep is fundamental to health, and society is more and more aware of the impact and relevance of sleep disorders. Traditional diagnostic methods, like polysomnography, are intrusive and resource-intensive. Instead, research is focusing on developing novel, less intrusive or portable methods that combine intelligent sensors with activity recognition for diagnosis support and scoring. Event cameras offer a promising alternative for automated, in-home sleep activity recognition due to their excellent low-light performance and low power consumption. This work introduces <strong>EventSleep2-data</strong>, a significant extension to the EventSleep dataset, featuring 10 complete night recordings (around 7 h each) of volunteers sleeping in their homes. Unlike the original short and controlled recordings, this new dataset captures natural, full-night sleep sessions under realistic conditions. This new data incorporates challenging real-world scene variations, an efficient movement-triggered sparse data recording pipeline, and synchronized 2-channel EEG data for a subset of recordings. We also present <strong>EventSleep2-net</strong>, a novel event-based sleep activity recognition approach with a dual-head architecture to simultaneously analyze motion classes and static poses. The model is specifically designed to handle the motion-triggered, sparse nature of complete night recordings. Unlike the original EventSleep architecture, EventSleep2-net can predict both movement and static poses even during long periods with no events. We demonstrate state-of-the-art performance on both EventSleep1-data, the original dataset, and EventSleep2-data, with comprehensive ablation studies validating our design decisions. Together, EventSleep2-data and EventSleep2-net overcome the limitations of the previous setup and enable continuous, full-night analysis for real-world sleep monitoring, significantly advancing the potential of event-based vision for sleep disorder studies. Code and data are publicly available on the webpage: <span><span>https://sites.google.com/unizar.es/eventsleep</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104619"},"PeriodicalIF":3.5,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.cviu.2025.104623
Ruiqi Cheng , Hai-Miao Hu , Chongze Wang , Xuan Gong
Due to the complexity of real-world environments, self-localization remains critical yet unresolved challenges for individuals with visual impairments during travel. Visual appearance variations in the context of assistive technology, such as season changes, illumination changes, viewpoint changes, and dynamic occlusions, significantly hinder the performance of place recognition. This paper proposes a novel assistive visual localization method to address these challenges. In order to extract landmark-related features from images with appearance variations, the dual constraints of place classification and feature distillation are proposed based on large-scale place recognition and human matting datasets. Additionally, online sequential matching is employed for place recognition, leveraging temporal consistency embedded in multi-frame sequences to further eliminate erroneous localization results. Evaluated on the large-scale SF-XL dataset augmented with human matting, the proposed image feature model achieves a 3% improvement in Recall@1 compared to state-of-the-art approaches using similar backbone architectures, which indicates the better performance of image retrieval under the assistive occlusion scenarios. More importantly, in real-world validation using self-collected assistive datasets, the proposed visual localization pipeline incorporating sequential matching achieves scores over 0.85 and shows advantages over existing sequential place recognition methods. The implementation codes of the proposed algorithm, along with a real-world testing dataset for assistive localization, are released at https://github.com/chengricky/AssistivePlace.
{"title":"Place recognition for visual assistive localization under challenging visual appearance variations","authors":"Ruiqi Cheng , Hai-Miao Hu , Chongze Wang , Xuan Gong","doi":"10.1016/j.cviu.2025.104623","DOIUrl":"10.1016/j.cviu.2025.104623","url":null,"abstract":"<div><div>Due to the complexity of real-world environments, self-localization remains critical yet unresolved challenges for individuals with visual impairments during travel. Visual appearance variations in the context of assistive technology, such as season changes, illumination changes, viewpoint changes, and dynamic occlusions, significantly hinder the performance of place recognition. This paper proposes a novel assistive visual localization method to address these challenges. In order to extract landmark-related features from images with appearance variations, the dual constraints of place classification and feature distillation are proposed based on large-scale place recognition and human matting datasets. Additionally, online sequential matching is employed for place recognition, leveraging temporal consistency embedded in multi-frame sequences to further eliminate erroneous localization results. Evaluated on the large-scale SF-XL dataset augmented with human matting, the proposed image feature model achieves a 3% improvement in Recall@1 compared to state-of-the-art approaches using similar backbone architectures, which indicates the better performance of image retrieval under the assistive occlusion scenarios. More importantly, in real-world validation using self-collected assistive datasets, the proposed visual localization pipeline incorporating sequential matching achieves <span><math><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span> scores over 0.85 and shows advantages over existing sequential place recognition methods. The implementation codes of the proposed algorithm, along with a real-world testing dataset for assistive localization, are released at <span><span>https://github.com/chengricky/AssistivePlace</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104623"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145883901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1016/j.cviu.2025.104624
Naman Goyal, Major Singh Goraya, Tajinder Singh
Recent advancements in eye gaze classification have significant implications for enhancing human robot interaction. Existing benchmark datasets such as UnityEyes often exhibit class imbalance issues, negatively impacting classification efficacy. Addressing this challenge, a balanced dataset, termed reformed-UE, containing 500 images per class across eight distinct gaze directions is introduced. A novel hyperparameter optimized deep learning model, designated , is proposed for image-based gaze direction classification. Additionally, the balanced, large-scale MRL dataset enables rigorous generalization testing of the model. Comparative evaluations involving state of the art models including MobileNetV2, InceptionNetV3, AttentionCNN, MobileViT, Hybrid PCCR and Swin Transformers are conducted. The model achieves superior performance metrics, registering a validation accuracy of 93.75%, exceeding competing models by approximately 4 to 5 percentage points. Furthermore, attains higher precision (0.93), recall (0.93), and F1score (0.93), significantly reducing classification errors, particularly in the challenging TopRight class. Interpretability analyses employing Gradient weighted Class Activation Mapping (GradCAM) heatmaps provide further confirmation of the model’s proficiency in identifying essential latent features critical for accurate classification.
{"title":"CNNs vs Transformers: Confirmatory factor analysis for eye gaze classification with explainable AI","authors":"Naman Goyal, Major Singh Goraya, Tajinder Singh","doi":"10.1016/j.cviu.2025.104624","DOIUrl":"10.1016/j.cviu.2025.104624","url":null,"abstract":"<div><div>Recent advancements in eye gaze classification have significant implications for enhancing human robot interaction. Existing benchmark datasets such as UnityEyes often exhibit class imbalance issues, negatively impacting classification efficacy. Addressing this challenge, a balanced dataset, termed reformed-UE, containing 500 images per class across eight distinct gaze directions is introduced. A novel hyperparameter optimized deep learning model, designated <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>image</mi></mrow></msub></math></span>, is proposed for image-based gaze direction classification. Additionally, the balanced, large-scale MRL dataset enables rigorous generalization testing of the <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>image</mi></mrow></msub></math></span> model. Comparative evaluations involving state of the art models including MobileNetV2, InceptionNetV3, AttentionCNN, MobileViT, Hybrid PCCR and Swin Transformers are conducted. The <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>image</mi></mrow></msub></math></span> model achieves superior performance metrics, registering a validation accuracy of 93.75%, exceeding competing models by approximately 4 to 5 percentage points. Furthermore, <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>image</mi></mrow></msub></math></span> attains higher precision (0.93), recall (0.93), and F1score (0.93), significantly reducing classification errors, particularly in the challenging TopRight class. Interpretability analyses employing Gradient weighted Class Activation Mapping (GradCAM) heatmaps provide further confirmation of the model’s proficiency in identifying essential latent features critical for accurate classification.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104624"},"PeriodicalIF":3.5,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1016/j.cviu.2025.104622
Carlos Plou , Lorenzo Mur-Labadia , Jose J. Guerrero, Ruben Martinez-Cantin, Ana C. Murillo
Video is a crucial perception component in both robotics and wearable devices, two key technologies to enable innovative assistive applications, such as navigation and procedure execution assistance tools. Video understanding tasks are essential to enable these systems to interpret and execute complex instructions in real-world environments. One such task is step grounding, which involves identifying the temporal boundaries of activities based on natural language descriptions in long, untrimmed videos. This paper introduces Bayesian-VSLNet, a probabilistic formulation of step grounding that predicts a likelihood distribution over segments and refines it through Bayesian inference with temporal-order priors. These priors disambiguate cyclic and repeated actions that frequently appear in procedural tasks, enabling precise step localization in long videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results in the Ego4D Goal-Step dataset, winning the Goal Step challenge at the EgoVis 2024 CVPR. Furthermore, experiments on additional benchmarks confirm the generality of our approach beyond Ego4D. In addition, we present qualitative results in a real-world robotics scenario, illustrating the potential of this task to improve human–robot interaction in practical applications. Code is released at https://github.com/cplou99/BayesianVSLNet.
{"title":"Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors","authors":"Carlos Plou , Lorenzo Mur-Labadia , Jose J. Guerrero, Ruben Martinez-Cantin, Ana C. Murillo","doi":"10.1016/j.cviu.2025.104622","DOIUrl":"10.1016/j.cviu.2025.104622","url":null,"abstract":"<div><div>Video is a crucial perception component in both robotics and wearable devices, two key technologies to enable innovative assistive applications, such as navigation and procedure execution assistance tools. Video understanding tasks are essential to enable these systems to interpret and execute complex instructions in real-world environments. One such task is step grounding, which involves identifying the temporal boundaries of activities based on natural language descriptions in long, untrimmed videos. This paper introduces Bayesian-VSLNet, a probabilistic formulation of step grounding that predicts a likelihood distribution over segments and refines it through Bayesian inference with temporal-order priors. These priors disambiguate cyclic and repeated actions that frequently appear in procedural tasks, enabling precise step localization in long videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results in the Ego4D Goal-Step dataset, winning the <em>Goal Step</em> challenge at the EgoVis 2024 CVPR. Furthermore, experiments on additional benchmarks confirm the generality of our approach beyond Ego4D. In addition, we present qualitative results in a real-world robotics scenario, illustrating the potential of this task to improve human–robot interaction in practical applications. Code is released at <span><span>https://github.com/cplou99/BayesianVSLNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104622"},"PeriodicalIF":3.5,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-30DOI: 10.1016/j.cviu.2025.104618
Elio Musacchio , Lucia Siciliani , Pierpaolo Basile , Giovanni Semeraro
The growing popularity of Large Vision-Language Models has highlighted and intensified one of the most well-known challenges in the field of Large Language Models: training is mainly, and most of the time exclusively, conducted on English data. Consequently, the resulting models are more prone to error in non-English tasks, and this issue is exacerbated in multimodal settings that are even more complex and use task-specific datasets. Given this, research on Large Language Models has turned toward adapting them to non-English languages. However, the scarcity of open and curated resources for these languages poses a significant limitation. In this work, we aim to tackle the aforementioned challenge by exploring Large Vision-Language Models adaptation to non-English languages, using machine translation to overcome the lack of curated data. We also analyze how the evaluation of the results is influenced when training a vision-to-text adapter across different languages, examining the performance variations and challenges associated with multilingual adaptation. Finally, we highlight the importance of using open resources to ensure transparency and reproducibility of the results. Following this philosophy, we provide open access to the entire codebase of the adaptation pipeline, along with the trained models and dataset, to foster further research.1
{"title":"Extending Large Language Models to multimodality for non-English languages","authors":"Elio Musacchio , Lucia Siciliani , Pierpaolo Basile , Giovanni Semeraro","doi":"10.1016/j.cviu.2025.104618","DOIUrl":"10.1016/j.cviu.2025.104618","url":null,"abstract":"<div><div>The growing popularity of Large Vision-Language Models has highlighted and intensified one of the most well-known challenges in the field of Large Language Models: training is mainly, and most of the time exclusively, conducted on English data. Consequently, the resulting models are more prone to error in non-English tasks, and this issue is exacerbated in multimodal settings that are even more complex and use task-specific datasets. Given this, research on Large Language Models has turned toward adapting them to non-English languages. However, the scarcity of open and curated resources for these languages poses a significant limitation. In this work, we aim to tackle the aforementioned challenge by exploring Large Vision-Language Models adaptation to non-English languages, using machine translation to overcome the lack of curated data. We also analyze how the evaluation of the results is influenced when training a vision-to-text adapter across different languages, examining the performance variations and challenges associated with multilingual adaptation. Finally, we highlight the importance of using open resources to ensure transparency and reproducibility of the results. Following this philosophy, we provide open access to the entire codebase of the adaptation pipeline, along with the trained models and dataset, to foster further research.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104618"},"PeriodicalIF":3.5,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}