首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
Body shape diversity in the training data and consequences on motion generation 训练数据中的形体多样性及其对运动生成的影响
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-06 DOI: 10.1016/j.cviu.2025.104632
Hugo Rodet, Lama Séoud
Describing human movement is key to many applications, ranging from medicine to 3D animation. Morphology is an important factor influencing how people move, but as of yet it is seldom accounted for in human-centric tasks like motion generation. In this study, we first assess the diversity of body shapes in real human motion datasets, then demonstrate the benefits of morphology-aware motion generation. We reveal biases in the data regarding body shape, in particular for body fat and gender representation. Considering the incompleteness of even the largest motion-capture datasets, proving quantitatively that morphology influences motion is difficult using existing tools: we thus propose a new metric relying on 3D body mesh self-collision, and use it to demonstrate that individuals with varied body mass indices also differ in their movements. One consequence is that generic, morphology-agnostic generated poses tend to be unsuitable for the body models they are used with, and we show that it tends to increase self-collision artifacts. Building upon these results, we show that morphology-aware motion generation reduces mesh self-collision artifacts despite not being trained for it explicitly, even when using a common backbone and a naive conditioning strategy. Morphology-aware generation can also be seamlessly integrated to most pose and motion generation architectures with little-to-no extra computational cost and without compromising generation diversity of realism.
描述人体运动是许多应用的关键,从医学到3D动画。形态学是影响人们运动方式的一个重要因素,但到目前为止,形态学在运动生成等以人为中心的任务中很少被考虑到。在这项研究中,我们首先评估了真实人体运动数据集中身体形状的多样性,然后展示了形态感知运动生成的好处。我们揭示了有关体型的数据偏差,特别是体脂和性别代表性。考虑到即使是最大的动作捕捉数据集的不完整性,使用现有工具定量证明形态学影响运动是困难的:因此,我们提出了一种依赖于3D身体网格自碰撞的新度量,并用它来证明不同体重指数的个体在运动中也不同。一个后果是,通用的、形态不可知的生成姿势往往不适合他们使用的身体模型,我们表明它倾向于增加自碰撞伪影。在这些结果的基础上,我们表明,尽管没有明确训练,但形态感知运动生成减少了网格自碰撞工件,即使使用共同主干和朴素条件反射策略也是如此。形态感知生成也可以无缝集成到大多数姿势和运动生成架构中,几乎没有额外的计算成本,也不会影响生成的真实性多样性。
{"title":"Body shape diversity in the training data and consequences on motion generation","authors":"Hugo Rodet,&nbsp;Lama Séoud","doi":"10.1016/j.cviu.2025.104632","DOIUrl":"10.1016/j.cviu.2025.104632","url":null,"abstract":"<div><div>Describing human movement is key to many applications, ranging from medicine to 3D animation. Morphology is an important factor influencing how people move, but as of yet it is seldom accounted for in human-centric tasks like motion generation. In this study, we first assess the diversity of body shapes in real human motion datasets, then demonstrate the benefits of morphology-aware motion generation. We reveal biases in the data regarding body shape, in particular for body fat and gender representation. Considering the incompleteness of even the largest motion-capture datasets, proving quantitatively that morphology influences motion is difficult using existing tools: we thus propose a new metric relying on 3D body mesh self-collision, and use it to demonstrate that individuals with varied body mass indices also differ in their movements. One consequence is that generic, morphology-agnostic generated poses tend to be unsuitable for the body models they are used with, and we show that it tends to increase self-collision artifacts. Building upon these results, we show that morphology-aware motion generation reduces mesh self-collision artifacts despite not being trained for it explicitly, even when using a common backbone and a naive conditioning strategy. Morphology-aware generation can also be seamlessly integrated to most pose and motion generation architectures with little-to-no extra computational cost and without compromising generation diversity of realism.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104632"},"PeriodicalIF":3.5,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Kernel Information-interaction Network for single image super-resolution 单图像超分辨率大核信息交互网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-06 DOI: 10.1016/j.cviu.2025.104633
Yi Wu, Weiwei Wang, Yujie Wang, Kaige Cui
Single-image super-resolution (SISR) has achieved substantial progress, enabling high-fidelity restoration from low-resolution inputs. However, the high computational cost of existing methods remains a major challenge, limiting their deployment on resource-constrained edge devices. To address this issue, we propose a lightweight Large Kernel Information Interaction Network (LKIN), which effectively balances computational efficiency and reconstruction quality. Our approach integrates multi-scale large receptive fields, information distillation, and attention mechanisms to enhance feature representation and improve super-resolution performance. Specifically, we replace the conventional BSConv with a large kernel network, allowing the model to capture long-range dependencies more effectively while reducing the reliance on deeper architectures. Additionally, we introduce a Multi-Scale Feature Enhancement (MSFE) module, which leverages efficient convolutions and attention mechanisms to refine extracted features while eliminating redundant operations. Extensive experiments are conducted on standard benchmarks (Set5, Set14, BSD100, Urban100, Manga109) at × 2, × 3, and × 4 upscaling factors. We evaluate performance using PSNR and SSIM. Compared with representative lightweight CNN-based methods (e.g., IMDN, BSRN, CARN) and Transformer-based approaches (e.g., SwinIR-light, SRFormer, ESRT, NGSwin), LKIN achieves up to +0.15 dB PSNR improvements over the strongest baseline while reducing parameters by 18%.
单图像超分辨率(SISR)已经取得了实质性进展,可以从低分辨率输入中实现高保真恢复。然而,现有方法的高计算成本仍然是一个主要挑战,限制了它们在资源受限的边缘设备上的部署。为了解决这个问题,我们提出了一个轻量级的大内核信息交互网络(LKIN),它有效地平衡了计算效率和重建质量。我们的方法集成了多尺度大接收场、信息蒸馏和注意机制,以增强特征表示和提高超分辨率性能。具体来说,我们用一个大型内核网络取代了传统的BSConv,允许模型更有效地捕获远程依赖关系,同时减少对更深层次架构的依赖。此外,我们引入了一个多尺度特征增强(MSFE)模块,它利用有效的卷积和注意机制来优化提取的特征,同时消除冗余操作。在标准基准(Set5, Set14, BSD100, Urban100, Manga109)上进行了x2, x3和4倍放大因子的大量实验。我们使用PSNR和SSIM来评估性能。与代表性的基于轻量级cnn的方法(如IMDN, BSRN, CARN)和基于变压器的方法(如SwinIR-light, SRFormer, ESRT, NGSwin)相比,LKIN在最强基线上实现了+0.15 dB的PSNR改进,同时减少了18%的参数。
{"title":"Large Kernel Information-interaction Network for single image super-resolution","authors":"Yi Wu,&nbsp;Weiwei Wang,&nbsp;Yujie Wang,&nbsp;Kaige Cui","doi":"10.1016/j.cviu.2025.104633","DOIUrl":"10.1016/j.cviu.2025.104633","url":null,"abstract":"<div><div>Single-image super-resolution (SISR) has achieved substantial progress, enabling high-fidelity restoration from low-resolution inputs. However, the high computational cost of existing methods remains a major challenge, limiting their deployment on resource-constrained edge devices. To address this issue, we propose a lightweight Large Kernel Information Interaction Network (LKIN), which effectively balances computational efficiency and reconstruction quality. Our approach integrates multi-scale large receptive fields, information distillation, and attention mechanisms to enhance feature representation and improve super-resolution performance. Specifically, we replace the conventional BSConv with a large kernel network, allowing the model to capture long-range dependencies more effectively while reducing the reliance on deeper architectures. Additionally, we introduce a Multi-Scale Feature Enhancement (MSFE) module, which leverages efficient convolutions and attention mechanisms to refine extracted features while eliminating redundant operations. Extensive experiments are conducted on standard benchmarks (Set5, Set14, BSD100, Urban100, Manga109) at <span><math><mo>×</mo></math></span> 2, <span><math><mo>×</mo></math></span> 3, and <span><math><mo>×</mo></math></span> 4 upscaling factors. We evaluate performance using PSNR and SSIM. Compared with representative lightweight CNN-based methods (e.g., IMDN, BSRN, CARN) and Transformer-based approaches (e.g., SwinIR-light, SRFormer, ESRT, NGSwin), LKIN achieves up to +0.15 dB PSNR improvements over the strongest baseline while reducing parameters by 18%.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104633"},"PeriodicalIF":3.5,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Causality-inspired multi-grained cross-modal sign language retrieval 因果关系启发的多粒度跨模态手语检索
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-02 DOI: 10.1016/j.cviu.2025.104631
Xu-Hua Yang, Dong Wei, Wangjie Li, Hongxiang Hu
Sign language retrieval aims to enhance communication between the deaf and hearing individuals. Due to the scarcity of sign language video data, researchers often use contrastive learning-based data augmentation methods to mitigate data sparsity. However, the pairwise metric learning paradigm fails to properly account for differences among various augmentations and may even erroneously learn the distinctions between augmentation methods. Moreover, existing sign language retrieval studies are susceptible to spurious correlations between cross-modal data and often overlook associations across different granularities. To address these limitations, we propose a Causality-Inspired Multi-Grained Cross-Modal Sign Language Retrieval method (CMCM) that enhances cross-modal retrieval capabilities by eliminating both observable and unobservable confounders. First, CMCM performs varying degrees of augmentation on the original videos and employs backdoor adjustment to mitigate confounders among augmented data, obtaining highly stable video representations invariant to confounding factors. Next, we propose a cross-modal causal-attention Gaussian network that employs front-door causal intervention to eliminate implicit confounders and parameterize their Gaussian distribution for fine-grained alignment. Finally, we design a temporal-motion covariance pooling method to capture global features of sign language sequences, facilitating coarse-grained cross-modal feature alignment. Extensive experiments on three public datasets demonstrate that CMCM achieves highly competitive retrieval accuracy. The code is available at: https://github.com/vddong-zjut/CMCM.
手语检索的目的是增强聋人与正常人之间的交流。由于手语视频数据的稀缺性,研究人员经常使用基于对比学习的数据增强方法来缓解数据稀疏性。然而,两两度量学习范式未能正确解释各种增强之间的差异,甚至可能错误地学习增强方法之间的区别。此外,现有的手语检索研究容易受到跨模态数据之间虚假相关性的影响,往往忽略了不同粒度之间的关联。为了解决这些限制,我们提出了一种因果启发的多粒度跨模态手语检索方法(CMCM),该方法通过消除可观察和不可观察的混杂因素来增强跨模态检索能力。首先,CMCM对原始视频进行不同程度的增强,并采用后门调整来减轻增强数据之间的混杂因素,获得高度稳定的视频表示,不受混杂因素的影响。接下来,我们提出了一个跨模态因果注意高斯网络,该网络采用前门因果干预来消除隐式混杂因素,并将其高斯分布参数化以进行细粒度对齐。最后,我们设计了一种时间-运动协方差池方法来捕获手语序列的全局特征,促进粗粒度的跨模态特征对齐。在三个公共数据集上的大量实验表明,CMCM具有很强的检索精度。代码可从https://github.com/vddong-zjut/CMCM获得。
{"title":"Causality-inspired multi-grained cross-modal sign language retrieval","authors":"Xu-Hua Yang,&nbsp;Dong Wei,&nbsp;Wangjie Li,&nbsp;Hongxiang Hu","doi":"10.1016/j.cviu.2025.104631","DOIUrl":"10.1016/j.cviu.2025.104631","url":null,"abstract":"<div><div>Sign language retrieval aims to enhance communication between the deaf and hearing individuals. Due to the scarcity of sign language video data, researchers often use contrastive learning-based data augmentation methods to mitigate data sparsity. However, the pairwise metric learning paradigm fails to properly account for differences among various augmentations and may even erroneously learn the distinctions between augmentation methods. Moreover, existing sign language retrieval studies are susceptible to spurious correlations between cross-modal data and often overlook associations across different granularities. To address these limitations, we propose a Causality-Inspired Multi-Grained Cross-Modal Sign Language Retrieval method (CMCM) that enhances cross-modal retrieval capabilities by eliminating both observable and unobservable confounders. First, CMCM performs varying degrees of augmentation on the original videos and employs backdoor adjustment to mitigate confounders among augmented data, obtaining highly stable video representations invariant to confounding factors. Next, we propose a cross-modal causal-attention Gaussian network that employs front-door causal intervention to eliminate implicit confounders and parameterize their Gaussian distribution for fine-grained alignment. Finally, we design a temporal-motion covariance pooling method to capture global features of sign language sequences, facilitating coarse-grained cross-modal feature alignment. Extensive experiments on three public datasets demonstrate that CMCM achieves highly competitive retrieval accuracy. The code is available at: <span><span>https://github.com/vddong-zjut/CMCM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104631"},"PeriodicalIF":3.5,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EventSleep2: Sleep activity recognition on complete night sleep recordings with an event camera EventSleep2:使用事件相机对完整的夜间睡眠记录进行睡眠活动识别
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-02 DOI: 10.1016/j.cviu.2025.104619
Nerea Gallego , Carlos Plou , Miguel Marcos , Pablo Urcola , Luis Montesano , Eduardo Montijano , Ruben Martinez-Cantin , Ana C. Murillo
Sleep is fundamental to health, and society is more and more aware of the impact and relevance of sleep disorders. Traditional diagnostic methods, like polysomnography, are intrusive and resource-intensive. Instead, research is focusing on developing novel, less intrusive or portable methods that combine intelligent sensors with activity recognition for diagnosis support and scoring. Event cameras offer a promising alternative for automated, in-home sleep activity recognition due to their excellent low-light performance and low power consumption. This work introduces EventSleep2-data, a significant extension to the EventSleep dataset, featuring 10 complete night recordings (around 7 h each) of volunteers sleeping in their homes. Unlike the original short and controlled recordings, this new dataset captures natural, full-night sleep sessions under realistic conditions. This new data incorporates challenging real-world scene variations, an efficient movement-triggered sparse data recording pipeline, and synchronized 2-channel EEG data for a subset of recordings. We also present EventSleep2-net, a novel event-based sleep activity recognition approach with a dual-head architecture to simultaneously analyze motion classes and static poses. The model is specifically designed to handle the motion-triggered, sparse nature of complete night recordings. Unlike the original EventSleep architecture, EventSleep2-net can predict both movement and static poses even during long periods with no events. We demonstrate state-of-the-art performance on both EventSleep1-data, the original dataset, and EventSleep2-data, with comprehensive ablation studies validating our design decisions. Together, EventSleep2-data and EventSleep2-net overcome the limitations of the previous setup and enable continuous, full-night analysis for real-world sleep monitoring, significantly advancing the potential of event-based vision for sleep disorder studies. Code and data are publicly available on the webpage: https://sites.google.com/unizar.es/eventsleep.
睡眠是健康的基础,社会越来越意识到睡眠障碍的影响和相关性。传统的诊断方法,如多导睡眠图,是侵入性的和资源密集的。相反,研究的重点是开发新颖、侵入性较低或便携的方法,将智能传感器与活动识别相结合,以提供诊断支持和评分。事件相机由于其出色的弱光性能和低功耗,为自动化的家庭睡眠活动识别提供了一个有前途的选择。这项工作引入了EventSleep2-data,这是EventSleep数据集的一个重要扩展,具有10个完整的夜间记录(每个约7小时)志愿者在家中睡觉。与原始的短而受控的记录不同,这个新的数据集捕捉了现实条件下自然的、整晚的睡眠过程。这种新数据结合了具有挑战性的现实场景变化,高效的运动触发稀疏数据记录管道,以及用于记录子集的同步双通道EEG数据。我们还提出了EventSleep2-net,这是一种新颖的基于事件的睡眠活动识别方法,具有双头部结构,可以同时分析运动类和静态姿势。该模型是专门设计来处理运动触发,稀疏的性质,完整的夜间录音。与最初的EventSleep架构不同,EventSleep2-net甚至可以在没有事件的长时间内预测运动和静态姿势。我们在eventsleep1数据、原始数据集和eventsleep2数据上展示了最先进的性能,并进行了全面的消融研究,验证了我们的设计决策。EventSleep2-data和EventSleep2-net共同克服了之前设置的局限性,为现实世界的睡眠监测提供了连续的、整晚的分析,极大地推进了基于事件的视觉在睡眠障碍研究中的潜力。代码和数据可在网页上公开获取:https://sites.google.com/unizar.es/eventsleep。
{"title":"EventSleep2: Sleep activity recognition on complete night sleep recordings with an event camera","authors":"Nerea Gallego ,&nbsp;Carlos Plou ,&nbsp;Miguel Marcos ,&nbsp;Pablo Urcola ,&nbsp;Luis Montesano ,&nbsp;Eduardo Montijano ,&nbsp;Ruben Martinez-Cantin ,&nbsp;Ana C. Murillo","doi":"10.1016/j.cviu.2025.104619","DOIUrl":"10.1016/j.cviu.2025.104619","url":null,"abstract":"<div><div>Sleep is fundamental to health, and society is more and more aware of the impact and relevance of sleep disorders. Traditional diagnostic methods, like polysomnography, are intrusive and resource-intensive. Instead, research is focusing on developing novel, less intrusive or portable methods that combine intelligent sensors with activity recognition for diagnosis support and scoring. Event cameras offer a promising alternative for automated, in-home sleep activity recognition due to their excellent low-light performance and low power consumption. This work introduces <strong>EventSleep2-data</strong>, a significant extension to the EventSleep dataset, featuring 10 complete night recordings (around 7 h each) of volunteers sleeping in their homes. Unlike the original short and controlled recordings, this new dataset captures natural, full-night sleep sessions under realistic conditions. This new data incorporates challenging real-world scene variations, an efficient movement-triggered sparse data recording pipeline, and synchronized 2-channel EEG data for a subset of recordings. We also present <strong>EventSleep2-net</strong>, a novel event-based sleep activity recognition approach with a dual-head architecture to simultaneously analyze motion classes and static poses. The model is specifically designed to handle the motion-triggered, sparse nature of complete night recordings. Unlike the original EventSleep architecture, EventSleep2-net can predict both movement and static poses even during long periods with no events. We demonstrate state-of-the-art performance on both EventSleep1-data, the original dataset, and EventSleep2-data, with comprehensive ablation studies validating our design decisions. Together, EventSleep2-data and EventSleep2-net overcome the limitations of the previous setup and enable continuous, full-night analysis for real-world sleep monitoring, significantly advancing the potential of event-based vision for sleep disorder studies. Code and data are publicly available on the webpage: <span><span>https://sites.google.com/unizar.es/eventsleep</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104619"},"PeriodicalIF":3.5,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Place recognition for visual assistive localization under challenging visual appearance variations 视觉外观变化条件下视觉辅助定位的位置识别
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 DOI: 10.1016/j.cviu.2025.104623
Ruiqi Cheng , Hai-Miao Hu , Chongze Wang , Xuan Gong
Due to the complexity of real-world environments, self-localization remains critical yet unresolved challenges for individuals with visual impairments during travel. Visual appearance variations in the context of assistive technology, such as season changes, illumination changes, viewpoint changes, and dynamic occlusions, significantly hinder the performance of place recognition. This paper proposes a novel assistive visual localization method to address these challenges. In order to extract landmark-related features from images with appearance variations, the dual constraints of place classification and feature distillation are proposed based on large-scale place recognition and human matting datasets. Additionally, online sequential matching is employed for place recognition, leveraging temporal consistency embedded in multi-frame sequences to further eliminate erroneous localization results. Evaluated on the large-scale SF-XL dataset augmented with human matting, the proposed image feature model achieves a 3% improvement in Recall@1 compared to state-of-the-art approaches using similar backbone architectures, which indicates the better performance of image retrieval under the assistive occlusion scenarios. More importantly, in real-world validation using self-collected assistive datasets, the proposed visual localization pipeline incorporating sequential matching achieves F1 scores over 0.85 and shows advantages over existing sequential place recognition methods. The implementation codes of the proposed algorithm, along with a real-world testing dataset for assistive localization, are released at https://github.com/chengricky/AssistivePlace.
由于现实世界环境的复杂性,对于视力受损的人来说,自我定位仍然是一个关键但尚未解决的挑战。在辅助技术的背景下,视觉外观的变化,如季节变化、照明变化、视点变化和动态遮挡,会严重阻碍位置识别的表现。本文提出了一种新的辅助视觉定位方法来解决这些问题。为了从具有外观变化的图像中提取地标相关特征,提出了基于大规模地点识别和人类抠图数据集的地点分类和特征蒸馏的双重约束。此外,在线序列匹配用于位置识别,利用嵌入在多帧序列中的时间一致性进一步消除错误的定位结果。在经过人类消光增强的大规模SF-XL数据集上进行评估,与使用类似主干架构的最新方法相比,所提出的图像特征模型在Recall@1上实现了3%的改进,这表明在辅助遮挡场景下的图像检索性能更好。更重要的是,在使用自收集的辅助数据集进行实际验证时,所提出的包含顺序匹配的视觉定位管道的F1分数超过0.85,比现有的顺序位置识别方法更具优势。该算法的实现代码以及辅助定位的真实测试数据集发布在https://github.com/chengricky/AssistivePlace。
{"title":"Place recognition for visual assistive localization under challenging visual appearance variations","authors":"Ruiqi Cheng ,&nbsp;Hai-Miao Hu ,&nbsp;Chongze Wang ,&nbsp;Xuan Gong","doi":"10.1016/j.cviu.2025.104623","DOIUrl":"10.1016/j.cviu.2025.104623","url":null,"abstract":"<div><div>Due to the complexity of real-world environments, self-localization remains critical yet unresolved challenges for individuals with visual impairments during travel. Visual appearance variations in the context of assistive technology, such as season changes, illumination changes, viewpoint changes, and dynamic occlusions, significantly hinder the performance of place recognition. This paper proposes a novel assistive visual localization method to address these challenges. In order to extract landmark-related features from images with appearance variations, the dual constraints of place classification and feature distillation are proposed based on large-scale place recognition and human matting datasets. Additionally, online sequential matching is employed for place recognition, leveraging temporal consistency embedded in multi-frame sequences to further eliminate erroneous localization results. Evaluated on the large-scale SF-XL dataset augmented with human matting, the proposed image feature model achieves a 3% improvement in Recall@1 compared to state-of-the-art approaches using similar backbone architectures, which indicates the better performance of image retrieval under the assistive occlusion scenarios. More importantly, in real-world validation using self-collected assistive datasets, the proposed visual localization pipeline incorporating sequential matching achieves <span><math><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span> scores over 0.85 and shows advantages over existing sequential place recognition methods. The implementation codes of the proposed algorithm, along with a real-world testing dataset for assistive localization, are released at <span><span>https://github.com/chengricky/AssistivePlace</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104623"},"PeriodicalIF":3.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145883901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CNNs vs Transformers: Confirmatory factor analysis for eye gaze classification with explainable AI cnn vs变形金刚:用可解释的人工智能进行眼睛凝视分类的验证性因素分析
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-31 DOI: 10.1016/j.cviu.2025.104624
Naman Goyal, Major Singh Goraya, Tajinder Singh
Recent advancements in eye gaze classification have significant implications for enhancing human robot interaction. Existing benchmark datasets such as UnityEyes often exhibit class imbalance issues, negatively impacting classification efficacy. Addressing this challenge, a balanced dataset, termed reformed-UE, containing 500 images per class across eight distinct gaze directions is introduced. A novel hyperparameter optimized deep learning model, designated Gimage, is proposed for image-based gaze direction classification. Additionally, the balanced, large-scale MRL dataset enables rigorous generalization testing of the Gimage model. Comparative evaluations involving state of the art models including MobileNetV2, InceptionNetV3, AttentionCNN, MobileViT, Hybrid PCCR and Swin Transformers are conducted. The Gimage model achieves superior performance metrics, registering a validation accuracy of 93.75%, exceeding competing models by approximately 4 to 5 percentage points. Furthermore, Gimage attains higher precision (0.93), recall (0.93), and F1score (0.93), significantly reducing classification errors, particularly in the challenging TopRight class. Interpretability analyses employing Gradient weighted Class Activation Mapping (GradCAM) heatmaps provide further confirmation of the model’s proficiency in identifying essential latent features critical for accurate classification.
人眼注视分类的最新进展对增强人机交互具有重要意义。现有的基准数据集(如UnityEyes)经常出现类不平衡问题,对分类效果产生负面影响。为了解决这一挑战,我们引入了一个平衡的数据集,称为reved - ue,在八个不同的凝视方向上每个类包含500张图像。提出了一种新的超参数优化深度学习模型Gimage,用于基于图像的凝视方向分类。此外,平衡的大规模MRL数据集可以对Gimage模型进行严格的泛化测试。对包括MobileNetV2、InceptionNetV3、AttentionCNN、MobileViT、Hybrid PCCR和Swin变压器在内的最先进模型进行了比较评估。Gimage模型实现了卓越的性能指标,注册了93.75%的验证精度,超过了竞争模型大约4到5个百分点。此外,Gimage获得了更高的精度(0.93)、召回率(0.93)和F1score(0.93),显著减少了分类错误,特别是在具有挑战性的TopRight类中。使用梯度加权类激活映射(GradCAM)热图的可解释性分析进一步证实了模型在识别对准确分类至关重要的基本潜在特征方面的熟练程度。
{"title":"CNNs vs Transformers: Confirmatory factor analysis for eye gaze classification with explainable AI","authors":"Naman Goyal,&nbsp;Major Singh Goraya,&nbsp;Tajinder Singh","doi":"10.1016/j.cviu.2025.104624","DOIUrl":"10.1016/j.cviu.2025.104624","url":null,"abstract":"<div><div>Recent advancements in eye gaze classification have significant implications for enhancing human robot interaction. Existing benchmark datasets such as UnityEyes often exhibit class imbalance issues, negatively impacting classification efficacy. Addressing this challenge, a balanced dataset, termed reformed-UE, containing 500 images per class across eight distinct gaze directions is introduced. A novel hyperparameter optimized deep learning model, designated <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>image</mi></mrow></msub></math></span>, is proposed for image-based gaze direction classification. Additionally, the balanced, large-scale MRL dataset enables rigorous generalization testing of the <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>image</mi></mrow></msub></math></span> model. Comparative evaluations involving state of the art models including MobileNetV2, InceptionNetV3, AttentionCNN, MobileViT, Hybrid PCCR and Swin Transformers are conducted. The <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>image</mi></mrow></msub></math></span> model achieves superior performance metrics, registering a validation accuracy of 93.75%, exceeding competing models by approximately 4 to 5 percentage points. Furthermore, <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>image</mi></mrow></msub></math></span> attains higher precision (0.93), recall (0.93), and F1score (0.93), significantly reducing classification errors, particularly in the challenging TopRight class. Interpretability analyses employing Gradient weighted Class Activation Mapping (GradCAM) heatmaps provide further confirmation of the model’s proficiency in identifying essential latent features critical for accurate classification.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104624"},"PeriodicalIF":3.5,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors 基于文本-视频交叉注意和贝叶斯先验的自然语言时域视频分割
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-31 DOI: 10.1016/j.cviu.2025.104622
Carlos Plou , Lorenzo Mur-Labadia , Jose J. Guerrero, Ruben Martinez-Cantin, Ana C. Murillo
Video is a crucial perception component in both robotics and wearable devices, two key technologies to enable innovative assistive applications, such as navigation and procedure execution assistance tools. Video understanding tasks are essential to enable these systems to interpret and execute complex instructions in real-world environments. One such task is step grounding, which involves identifying the temporal boundaries of activities based on natural language descriptions in long, untrimmed videos. This paper introduces Bayesian-VSLNet, a probabilistic formulation of step grounding that predicts a likelihood distribution over segments and refines it through Bayesian inference with temporal-order priors. These priors disambiguate cyclic and repeated actions that frequently appear in procedural tasks, enabling precise step localization in long videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results in the Ego4D Goal-Step dataset, winning the Goal Step challenge at the EgoVis 2024 CVPR. Furthermore, experiments on additional benchmarks confirm the generality of our approach beyond Ego4D. In addition, we present qualitative results in a real-world robotics scenario, illustrating the potential of this task to improve human–robot interaction in practical applications. Code is released at https://github.com/cplou99/BayesianVSLNet.
视频是机器人和可穿戴设备中至关重要的感知组件,这是实现创新辅助应用(如导航和程序执行辅助工具)的两项关键技术。视频理解任务对于使这些系统能够在现实环境中解释和执行复杂的指令至关重要。其中一项任务是“逐级分类”,它涉及根据长视频中未修剪的自然语言描述来识别活动的时间边界。本文介绍了Bayesian- vslnet,这是一种阶跃接地的概率公式,它预测分段上的似然分布,并通过具有时间顺序先验的贝叶斯推理对其进行改进。这些先验消除了在程序任务中经常出现的循环和重复动作的歧义,从而实现了长视频中精确的步骤定位。我们的评估证明了优于现有方法的性能,在Ego4D目标步骤数据集中获得了最先进的结果,赢得了EgoVis 2024 CVPR的目标步骤挑战。此外,在其他基准测试上的实验证实了我们的方法超越Ego4D的普遍性。此外,我们在现实世界的机器人场景中提出了定性结果,说明了该任务在实际应用中改善人机交互的潜力。代码发布在https://github.com/cplou99/BayesianVSLNet。
{"title":"Temporal video segmentation with natural language using text–video cross attention and Bayesian order-priors","authors":"Carlos Plou ,&nbsp;Lorenzo Mur-Labadia ,&nbsp;Jose J. Guerrero,&nbsp;Ruben Martinez-Cantin,&nbsp;Ana C. Murillo","doi":"10.1016/j.cviu.2025.104622","DOIUrl":"10.1016/j.cviu.2025.104622","url":null,"abstract":"<div><div>Video is a crucial perception component in both robotics and wearable devices, two key technologies to enable innovative assistive applications, such as navigation and procedure execution assistance tools. Video understanding tasks are essential to enable these systems to interpret and execute complex instructions in real-world environments. One such task is step grounding, which involves identifying the temporal boundaries of activities based on natural language descriptions in long, untrimmed videos. This paper introduces Bayesian-VSLNet, a probabilistic formulation of step grounding that predicts a likelihood distribution over segments and refines it through Bayesian inference with temporal-order priors. These priors disambiguate cyclic and repeated actions that frequently appear in procedural tasks, enabling precise step localization in long videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results in the Ego4D Goal-Step dataset, winning the <em>Goal Step</em> challenge at the EgoVis 2024 CVPR. Furthermore, experiments on additional benchmarks confirm the generality of our approach beyond Ego4D. In addition, we present qualitative results in a real-world robotics scenario, illustrating the potential of this task to improve human–robot interaction in practical applications. Code is released at <span><span>https://github.com/cplou99/BayesianVSLNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104622"},"PeriodicalIF":3.5,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extending Large Language Models to multimodality for non-English languages 将大型语言模型扩展到非英语语言的多模态
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-30 DOI: 10.1016/j.cviu.2025.104618
Elio Musacchio , Lucia Siciliani , Pierpaolo Basile , Giovanni Semeraro
The growing popularity of Large Vision-Language Models has highlighted and intensified one of the most well-known challenges in the field of Large Language Models: training is mainly, and most of the time exclusively, conducted on English data. Consequently, the resulting models are more prone to error in non-English tasks, and this issue is exacerbated in multimodal settings that are even more complex and use task-specific datasets. Given this, research on Large Language Models has turned toward adapting them to non-English languages. However, the scarcity of open and curated resources for these languages poses a significant limitation. In this work, we aim to tackle the aforementioned challenge by exploring Large Vision-Language Models adaptation to non-English languages, using machine translation to overcome the lack of curated data. We also analyze how the evaluation of the results is influenced when training a vision-to-text adapter across different languages, examining the performance variations and challenges associated with multilingual adaptation. Finally, we highlight the importance of using open resources to ensure transparency and reproducibility of the results. Following this philosophy, we provide open access to the entire codebase of the adaptation pipeline, along with the trained models and dataset, to foster further research.1
大型视觉语言模型的日益普及凸显并加剧了大型语言模型领域最著名的挑战之一:训练主要是,而且大多数时候都是在英语数据上进行的。因此,所得到的模型在非英语任务中更容易出错,而这个问题在更复杂和使用任务特定数据集的多模式设置中更加严重。鉴于此,对大型语言模型的研究已经转向使它们适应非英语语言。然而,这些语言的开放和管理资源的稀缺性构成了一个重大的限制。在这项工作中,我们的目标是通过探索适应非英语语言的大型视觉语言模型来解决上述挑战,使用机器翻译来克服缺乏精选数据。我们还分析了跨不同语言训练视觉到文本适配器时对结果评估的影响,研究了与多语言适应相关的性能变化和挑战。最后,我们强调使用开放资源以确保结果的透明度和可重复性的重要性。遵循这一理念,我们提供了对适应管道的整个代码库的开放访问,以及训练过的模型和数据集,以促进进一步的研究
{"title":"Extending Large Language Models to multimodality for non-English languages","authors":"Elio Musacchio ,&nbsp;Lucia Siciliani ,&nbsp;Pierpaolo Basile ,&nbsp;Giovanni Semeraro","doi":"10.1016/j.cviu.2025.104618","DOIUrl":"10.1016/j.cviu.2025.104618","url":null,"abstract":"<div><div>The growing popularity of Large Vision-Language Models has highlighted and intensified one of the most well-known challenges in the field of Large Language Models: training is mainly, and most of the time exclusively, conducted on English data. Consequently, the resulting models are more prone to error in non-English tasks, and this issue is exacerbated in multimodal settings that are even more complex and use task-specific datasets. Given this, research on Large Language Models has turned toward adapting them to non-English languages. However, the scarcity of open and curated resources for these languages poses a significant limitation. In this work, we aim to tackle the aforementioned challenge by exploring Large Vision-Language Models adaptation to non-English languages, using machine translation to overcome the lack of curated data. We also analyze how the evaluation of the results is influenced when training a vision-to-text adapter across different languages, examining the performance variations and challenges associated with multilingual adaptation. Finally, we highlight the importance of using open resources to ensure transparency and reproducibility of the results. Following this philosophy, we provide open access to the entire codebase of the adaptation pipeline, along with the trained models and dataset, to foster further research.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104618"},"PeriodicalIF":3.5,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpinVision: An end-to-end volleyball spin estimation with Siamese-based deep classification SpinVision:基于暹罗的深度分类的端到端排球旋转估计
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-29 DOI: 10.1016/j.cviu.2025.104628
Shreya Bansal , Anterpreet Kaur Bedi , Pratibha Kumari , Rishi Kumar Soni , Narayanan C. Krishnan , Mukesh Saini
Accurate spin estimation is a very crucial process in detailing ball dynamics, conducting trainings and analyzing performance in different sports such as volleyball. Traditional methods usually rely on geometric assumptions, handcrafted features, or marker based estimation, which leads to their limited adaptability to real-world problems. In this paper, we propose a novel spin estimation framework, namely SpinVision, considering it as a soft-classification problem. The deep learning model employs Gaussian soft labels and Kullback–Leibler Divergence (KLD) loss. Further, it employs fusion methods alongside squeeze-and-excitation blocks and residual connections, which helps in achieving distinctive representations without the support of external markers or registration procedures. Also, inclusion of transfer learning helps generalizing the model effectively to real-world problems, such as estimating the spin of a volleyball. When compared with the hard-classification or regression-based methods, the proposed model results in more reliable and smooth predictions, thus highlighting it as more accurate, robust, and practical solution for spin prediction in sports analytics and related applications.
在排球等不同运动项目中,准确的旋转估计是详细描述球动力学、进行训练和分析表现的一个非常重要的过程。传统的方法通常依赖于几何假设、手工制作的特征或基于标记的估计,这导致它们对现实世界问题的适应性有限。本文提出了一种新的自旋估计框架,即SpinVision,并将其视为一个软分类问题。深度学习模型采用高斯软标签和KLD损失。此外,它采用融合方法以及挤压和激励块和残余连接,这有助于在没有外部标记或注册程序支持的情况下实现独特的表示。此外,包含迁移学习有助于将模型有效地推广到现实世界的问题,例如估计排球的旋转。与基于硬分类或回归的方法相比,该模型的预测结果更加可靠和平滑,从而突出了其在运动分析及相关应用中的旋转预测的准确性、鲁棒性和实用性。
{"title":"SpinVision: An end-to-end volleyball spin estimation with Siamese-based deep classification","authors":"Shreya Bansal ,&nbsp;Anterpreet Kaur Bedi ,&nbsp;Pratibha Kumari ,&nbsp;Rishi Kumar Soni ,&nbsp;Narayanan C. Krishnan ,&nbsp;Mukesh Saini","doi":"10.1016/j.cviu.2025.104628","DOIUrl":"10.1016/j.cviu.2025.104628","url":null,"abstract":"<div><div>Accurate spin estimation is a very crucial process in detailing ball dynamics, conducting trainings and analyzing performance in different sports such as volleyball. Traditional methods usually rely on geometric assumptions, handcrafted features, or marker based estimation, which leads to their limited adaptability to real-world problems. In this paper, we propose a novel spin estimation framework, namely SpinVision, considering it as a soft-classification problem. The deep learning model employs Gaussian soft labels and Kullback–Leibler Divergence (KLD) loss. Further, it employs fusion methods alongside squeeze-and-excitation blocks and residual connections, which helps in achieving distinctive representations without the support of external markers or registration procedures. Also, inclusion of transfer learning helps generalizing the model effectively to real-world problems, such as estimating the spin of a volleyball. When compared with the hard-classification or regression-based methods, the proposed model results in more reliable and smooth predictions, thus highlighting it as more accurate, robust, and practical solution for spin prediction in sports analytics and related applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104628"},"PeriodicalIF":3.5,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TI-PREGO: Chain of Thought and In-Context Learning for online mistake detection in PRocedural EGOcentric videos TI-PREGO:程序性自我中心视频中在线错误检测的思维链和语境学习
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-29 DOI: 10.1016/j.cviu.2025.104613
Leonardo Plini , Luca Scofano , Edoardo De Matteis , Guido Maria D’Amely di Melendugno , Alessandro Flaborea , Andrea Sanchietti , Giovanni Maria Farinella , Fabio Galasso , Antonino Furnari
Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, no existing technique can reliably detect open-set procedural mistakes in an online setting. We propose a dual-branch architecture to address this problem in an online fashion: the recognition branch takes input frames from egocentric video, predicts the current action and aggregates frame-level results into action tokens while the anticipation branch leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module.
Extensive experiments on two novel procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach.
从以自我为中心的视频中在线识别程序错误是一项关键但具有挑战性的任务,涉及多个领域,包括制造业、医疗保健和基于技能的培训。这类错误本质上是开放集的,因为不可预见的或新颖的错误可能会发生,因此需要不依赖于先前失败示例的强大检测系统。目前,还没有一种技术能够可靠地检测出在线设置中的开集程序错误。我们提出了一种双分支架构,以在线方式解决这个问题:识别分支从以自我为中心的视频中获取输入帧,预测当前动作并将帧级结果聚合为动作令牌,而预测分支利用大型语言模型(llm)的可靠模式匹配功能,根据先前预测的动作令牌来预测动作令牌。错误被检测为当前识别的动作与预期模块预测的动作之间的不匹配。在两个新的程序数据集上进行的大量实验证明了利用双分支架构进行错误检测的挑战和机遇,展示了我们提出的方法的有效性。
{"title":"TI-PREGO: Chain of Thought and In-Context Learning for online mistake detection in PRocedural EGOcentric videos","authors":"Leonardo Plini ,&nbsp;Luca Scofano ,&nbsp;Edoardo De Matteis ,&nbsp;Guido Maria D’Amely di Melendugno ,&nbsp;Alessandro Flaborea ,&nbsp;Andrea Sanchietti ,&nbsp;Giovanni Maria Farinella ,&nbsp;Fabio Galasso ,&nbsp;Antonino Furnari","doi":"10.1016/j.cviu.2025.104613","DOIUrl":"10.1016/j.cviu.2025.104613","url":null,"abstract":"<div><div>Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, no existing technique can reliably detect open-set procedural mistakes in an online setting. We propose a dual-branch architecture to address this problem in an online fashion: the recognition branch takes input frames from egocentric video, predicts the current action and aggregates frame-level results into action tokens while the anticipation branch leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module.</div><div>Extensive experiments on two novel procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104613"},"PeriodicalIF":3.5,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1