首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
RGB-D and IMU-based staircase quantification for assistive navigation using step estimation for exoskeleton support 基于RGB-D和imu的外骨骼辅助导航阶梯量化
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-23 DOI: 10.1016/j.cviu.2025.104621
Edgar R. Guzman , Letizia Gionfrida , Robert D. Howe
This paper introduces a vision-based environment quantification pipeline designed to tailor the assistance provided by lower limb assistive devices during the transition from level walking to stair navigation. The framework consists of three components: staircase detection, transitional step prediction, and staircase dimension estimation. These components utilize an RGB-D camera worn on the chest and an Inertial Measurement Unit (IMU) worn at the hip. To detect ascending stairs, we employed a YOLOv3 model applied to continuous recordings, achieving an average accuracy of 98.1%. For descending stair detection, an edge detection algorithm was used, resulting in a pixel-wise edge localization accuracy of 89.1%. To estimate user locomotion speed and footfall, the IMU was positioned on the participant’s left waist, and the RGB-D camera was mounted at chest level. This setup accurately captured step lengths with an average accuracy of 94.4% across all participants and trials, enabling precise determination of the number of steps leading up to the transitional step on the staircase. As a result, the system accurately predicted the number of steps and localized the final footfall with an average error of 5.77cm, measured as the distance between the predicted and actual placement of the final foot relative to the target destination. Finally, to capture the dimensions of the staircase’s tread depth and riser height, an algorithm analyzing point cloud data was applied when the user was in close proximity to the stairs. This yielded mean absolute errors of 1.20±0.49cm in height and 1.35±0.45cm in depth for ascending stairs, and 1.28±0.55cm in height and 1.47±0.65cm in depth for descending stairs. Our proposed approach lays the groundwork for optimizing control strategies in exoskeleton technologies by integrating environmental sensing with human locomotion analysis. These results demonstrate the feasibility and effectiveness of our system, promising enhanced user experiences and improved functionality in real-world scenarios.
本文介绍了一种基于视觉的环境量化管道,用于定制下肢辅助装置在水平行走到楼梯导航过渡过程中的辅助。该框架由三个部分组成:阶梯检测、过渡阶梯预测和阶梯维数估计。这些组件使用佩戴在胸前的RGB-D摄像头和佩戴在臀部的惯性测量单元(IMU)。为了检测上升楼梯,我们采用了适用于连续记录的YOLOv3模型,平均准确率达到98.1%。下楼楼梯检测采用边缘检测算法,像素级边缘定位精度达到89.1%。为了估计用户的运动速度和脚步,IMU被放置在参与者的左腰部,RGB-D摄像机被安装在胸部水平。这种设置准确地捕获了所有参与者和试验的台阶长度,平均准确率为94.4%,能够精确地确定通往楼梯过渡台阶的台阶数。结果,系统准确地预测了步数,并定位了最终的脚步,平均误差为5.77厘米,以预测和实际的最终脚步相对于目标目的地的距离来衡量。最后,为了捕捉楼梯的踏面深度和隔水管高度的尺寸,当用户靠近楼梯时,应用了一种分析点云数据的算法。上楼梯的平均绝对误差为高度1.20±0.49cm,深度1.35±0.45cm;下楼梯的平均绝对误差为高度1.28±0.55cm,深度1.47±0.65cm。我们提出的方法通过将环境传感与人体运动分析相结合,为优化外骨骼技术的控制策略奠定了基础。这些结果证明了我们系统的可行性和有效性,有望在现实场景中增强用户体验和改进功能。
{"title":"RGB-D and IMU-based staircase quantification for assistive navigation using step estimation for exoskeleton support","authors":"Edgar R. Guzman ,&nbsp;Letizia Gionfrida ,&nbsp;Robert D. Howe","doi":"10.1016/j.cviu.2025.104621","DOIUrl":"10.1016/j.cviu.2025.104621","url":null,"abstract":"<div><div>This paper introduces a vision-based environment quantification pipeline designed to tailor the assistance provided by lower limb assistive devices during the transition from level walking to stair navigation. The framework consists of three components: staircase detection, transitional step prediction, and staircase dimension estimation. These components utilize an RGB-D camera worn on the chest and an Inertial Measurement Unit (IMU) worn at the hip. To detect ascending stairs, we employed a YOLOv3 model applied to continuous recordings, achieving an average accuracy of 98.1%. For descending stair detection, an edge detection algorithm was used, resulting in a pixel-wise edge localization accuracy of 89.1%. To estimate user locomotion speed and footfall, the IMU was positioned on the participant’s left waist, and the RGB-D camera was mounted at chest level. This setup accurately captured step lengths with an average accuracy of 94.4% across all participants and trials, enabling precise determination of the number of steps leading up to the transitional step on the staircase. As a result, the system accurately predicted the number of steps and localized the final footfall with an average error of <span><math><mrow><mn>5</mn><mo>.</mo><mn>77</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span>, measured as the distance between the predicted and actual placement of the final foot relative to the target destination. Finally, to capture the dimensions of the staircase’s tread depth and riser height, an algorithm analyzing point cloud data was applied when the user was in close proximity to the stairs. This yielded mean absolute errors of <span><math><mrow><mn>1</mn><mo>.</mo><mn>20</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>49</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in height and <span><math><mrow><mn>1</mn><mo>.</mo><mn>35</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>45</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in depth for ascending stairs, and <span><math><mrow><mn>1</mn><mo>.</mo><mn>28</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>55</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in height and <span><math><mrow><mn>1</mn><mo>.</mo><mn>47</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>65</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in depth for descending stairs. Our proposed approach lays the groundwork for optimizing control strategies in exoskeleton technologies by integrating environmental sensing with human locomotion analysis. These results demonstrate the feasibility and effectiveness of our system, promising enhanced user experiences and improved functionality in real-world scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104621"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145847576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Body shape diversity in the training data and consequences on motion generation 训练数据中的形体多样性及其对运动生成的影响
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2026-01-06 DOI: 10.1016/j.cviu.2025.104632
Hugo Rodet, Lama Séoud
Describing human movement is key to many applications, ranging from medicine to 3D animation. Morphology is an important factor influencing how people move, but as of yet it is seldom accounted for in human-centric tasks like motion generation. In this study, we first assess the diversity of body shapes in real human motion datasets, then demonstrate the benefits of morphology-aware motion generation. We reveal biases in the data regarding body shape, in particular for body fat and gender representation. Considering the incompleteness of even the largest motion-capture datasets, proving quantitatively that morphology influences motion is difficult using existing tools: we thus propose a new metric relying on 3D body mesh self-collision, and use it to demonstrate that individuals with varied body mass indices also differ in their movements. One consequence is that generic, morphology-agnostic generated poses tend to be unsuitable for the body models they are used with, and we show that it tends to increase self-collision artifacts. Building upon these results, we show that morphology-aware motion generation reduces mesh self-collision artifacts despite not being trained for it explicitly, even when using a common backbone and a naive conditioning strategy. Morphology-aware generation can also be seamlessly integrated to most pose and motion generation architectures with little-to-no extra computational cost and without compromising generation diversity of realism.
描述人体运动是许多应用的关键,从医学到3D动画。形态学是影响人们运动方式的一个重要因素,但到目前为止,形态学在运动生成等以人为中心的任务中很少被考虑到。在这项研究中,我们首先评估了真实人体运动数据集中身体形状的多样性,然后展示了形态感知运动生成的好处。我们揭示了有关体型的数据偏差,特别是体脂和性别代表性。考虑到即使是最大的动作捕捉数据集的不完整性,使用现有工具定量证明形态学影响运动是困难的:因此,我们提出了一种依赖于3D身体网格自碰撞的新度量,并用它来证明不同体重指数的个体在运动中也不同。一个后果是,通用的、形态不可知的生成姿势往往不适合他们使用的身体模型,我们表明它倾向于增加自碰撞伪影。在这些结果的基础上,我们表明,尽管没有明确训练,但形态感知运动生成减少了网格自碰撞工件,即使使用共同主干和朴素条件反射策略也是如此。形态感知生成也可以无缝集成到大多数姿势和运动生成架构中,几乎没有额外的计算成本,也不会影响生成的真实性多样性。
{"title":"Body shape diversity in the training data and consequences on motion generation","authors":"Hugo Rodet,&nbsp;Lama Séoud","doi":"10.1016/j.cviu.2025.104632","DOIUrl":"10.1016/j.cviu.2025.104632","url":null,"abstract":"<div><div>Describing human movement is key to many applications, ranging from medicine to 3D animation. Morphology is an important factor influencing how people move, but as of yet it is seldom accounted for in human-centric tasks like motion generation. In this study, we first assess the diversity of body shapes in real human motion datasets, then demonstrate the benefits of morphology-aware motion generation. We reveal biases in the data regarding body shape, in particular for body fat and gender representation. Considering the incompleteness of even the largest motion-capture datasets, proving quantitatively that morphology influences motion is difficult using existing tools: we thus propose a new metric relying on 3D body mesh self-collision, and use it to demonstrate that individuals with varied body mass indices also differ in their movements. One consequence is that generic, morphology-agnostic generated poses tend to be unsuitable for the body models they are used with, and we show that it tends to increase self-collision artifacts. Building upon these results, we show that morphology-aware motion generation reduces mesh self-collision artifacts despite not being trained for it explicitly, even when using a common backbone and a naive conditioning strategy. Morphology-aware generation can also be seamlessly integrated to most pose and motion generation architectures with little-to-no extra computational cost and without compromising generation diversity of realism.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104632"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpinVision: An end-to-end volleyball spin estimation with Siamese-based deep classification SpinVision:基于暹罗的深度分类的端到端排球旋转估计
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-29 DOI: 10.1016/j.cviu.2025.104628
Shreya Bansal , Anterpreet Kaur Bedi , Pratibha Kumari , Rishi Kumar Soni , Narayanan C. Krishnan , Mukesh Saini
Accurate spin estimation is a very crucial process in detailing ball dynamics, conducting trainings and analyzing performance in different sports such as volleyball. Traditional methods usually rely on geometric assumptions, handcrafted features, or marker based estimation, which leads to their limited adaptability to real-world problems. In this paper, we propose a novel spin estimation framework, namely SpinVision, considering it as a soft-classification problem. The deep learning model employs Gaussian soft labels and Kullback–Leibler Divergence (KLD) loss. Further, it employs fusion methods alongside squeeze-and-excitation blocks and residual connections, which helps in achieving distinctive representations without the support of external markers or registration procedures. Also, inclusion of transfer learning helps generalizing the model effectively to real-world problems, such as estimating the spin of a volleyball. When compared with the hard-classification or regression-based methods, the proposed model results in more reliable and smooth predictions, thus highlighting it as more accurate, robust, and practical solution for spin prediction in sports analytics and related applications.
在排球等不同运动项目中,准确的旋转估计是详细描述球动力学、进行训练和分析表现的一个非常重要的过程。传统的方法通常依赖于几何假设、手工制作的特征或基于标记的估计,这导致它们对现实世界问题的适应性有限。本文提出了一种新的自旋估计框架,即SpinVision,并将其视为一个软分类问题。深度学习模型采用高斯软标签和KLD损失。此外,它采用融合方法以及挤压和激励块和残余连接,这有助于在没有外部标记或注册程序支持的情况下实现独特的表示。此外,包含迁移学习有助于将模型有效地推广到现实世界的问题,例如估计排球的旋转。与基于硬分类或回归的方法相比,该模型的预测结果更加可靠和平滑,从而突出了其在运动分析及相关应用中的旋转预测的准确性、鲁棒性和实用性。
{"title":"SpinVision: An end-to-end volleyball spin estimation with Siamese-based deep classification","authors":"Shreya Bansal ,&nbsp;Anterpreet Kaur Bedi ,&nbsp;Pratibha Kumari ,&nbsp;Rishi Kumar Soni ,&nbsp;Narayanan C. Krishnan ,&nbsp;Mukesh Saini","doi":"10.1016/j.cviu.2025.104628","DOIUrl":"10.1016/j.cviu.2025.104628","url":null,"abstract":"<div><div>Accurate spin estimation is a very crucial process in detailing ball dynamics, conducting trainings and analyzing performance in different sports such as volleyball. Traditional methods usually rely on geometric assumptions, handcrafted features, or marker based estimation, which leads to their limited adaptability to real-world problems. In this paper, we propose a novel spin estimation framework, namely SpinVision, considering it as a soft-classification problem. The deep learning model employs Gaussian soft labels and Kullback–Leibler Divergence (KLD) loss. Further, it employs fusion methods alongside squeeze-and-excitation blocks and residual connections, which helps in achieving distinctive representations without the support of external markers or registration procedures. Also, inclusion of transfer learning helps generalizing the model effectively to real-world problems, such as estimating the spin of a volleyball. When compared with the hard-classification or regression-based methods, the proposed model results in more reliable and smooth predictions, thus highlighting it as more accurate, robust, and practical solution for spin prediction in sports analytics and related applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104628"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SASTD: Stepwise attention style transfer network based on diffusion models 基于扩散模型的分步注意风格迁移网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-23 DOI: 10.1016/j.cviu.2025.104612
Zhuoya Wang, Gui Chen, Yaxin Li, Yongsheng Dong
Image style transfer techniques have significantly advanced, aiming to create images that adopt the style attributes of one source while maintaining the spatial layout of another. However, the interrelationship between style and content often causes the problem of information entanglement within the generated stylized result. To alleviate this issue, in this paper we propose a stepwise attention style transfer network based on diffusion models (SASTD). Specifically, we introduce an attention feature extraction and fusion module, which employs a step-by-step injection method to effectively combine the extracted content and style attention features at different time stages. Additionally, we propose a noise initialization module based on adaptive instance normalization (AdaIN) in the early fusion stage to initialize the initial latent noise during image generation, preserving certain initial feature statistics. Furthermore, we incorporate edge attention from the content image to enhance the preservation of its structural details. Finally, we propose a LAB space alignment module to further optimize the initially generated stylized image. This method ensures high-quality style transfer while better maintaining the spatial semantics of the content image. Experimental results demonstrate that our proposed SASTD achieves better performance in both qualitative and quantitative comparisons compared to both image style transfer methods and style-guided text-to-image synthesis methods.
图像风格转换技术有了显著的进步,旨在创建采用一个源的风格属性的图像,同时保持另一个源的空间布局。然而,样式和内容之间的相互关系往往会导致生成的样式化结果中出现信息纠缠的问题。为了解决这一问题,本文提出了一种基于扩散模型的逐步注意风格迁移网络(SASTD)。具体来说,我们引入了一个注意力特征提取与融合模块,该模块采用分步注入的方式,将提取的内容和风格在不同时间阶段的注意力特征有效地结合起来。此外,我们在融合早期提出了基于自适应实例归一化(AdaIN)的噪声初始化模块,以初始化图像生成过程中的初始潜在噪声,保留一定的初始特征统计量。此外,我们从内容图像中加入边缘注意,以增强其结构细节的保存。最后,我们提出了一个LAB空间对齐模块来进一步优化初始生成的风格化图像。这种方法保证了高质量的风格传递,同时更好地保持了内容图像的空间语义。实验结果表明,与图像风格转移方法和风格引导的文本到图像合成方法相比,我们提出的SASTD在定性和定量比较方面都取得了更好的性能。
{"title":"SASTD: Stepwise attention style transfer network based on diffusion models","authors":"Zhuoya Wang,&nbsp;Gui Chen,&nbsp;Yaxin Li,&nbsp;Yongsheng Dong","doi":"10.1016/j.cviu.2025.104612","DOIUrl":"10.1016/j.cviu.2025.104612","url":null,"abstract":"<div><div>Image style transfer techniques have significantly advanced, aiming to create images that adopt the style attributes of one source while maintaining the spatial layout of another. However, the interrelationship between style and content often causes the problem of information entanglement within the generated stylized result. To alleviate this issue, in this paper we propose a stepwise attention style transfer network based on diffusion models (SASTD). Specifically, we introduce an attention feature extraction and fusion module, which employs a step-by-step injection method to effectively combine the extracted content and style attention features at different time stages. Additionally, we propose a noise initialization module based on adaptive instance normalization (AdaIN) in the early fusion stage to initialize the initial latent noise during image generation, preserving certain initial feature statistics. Furthermore, we incorporate edge attention from the content image to enhance the preservation of its structural details. Finally, we propose a LAB space alignment module to further optimize the initially generated stylized image. This method ensures high-quality style transfer while better maintaining the spatial semantics of the content image. Experimental results demonstrate that our proposed SASTD achieves better performance in both qualitative and quantitative comparisons compared to both image style transfer methods and style-guided text-to-image synthesis methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104612"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced approach with edge feature guidance for LiDAR signal denoising 基于边缘特征制导的激光雷达信号去噪方法
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-12 DOI: 10.1016/j.cviu.2025.104609
A. Anigo Merjora , P. Sardar Maran
This research addresses the challenging task of denoising Light Detection and Ranging (LiDAR) signals, with a specific focus on Rayleigh backscatter signals that are particularly vulnerable to noise due to atmospheric interference and sensor limitations. An enhanced shearlet wavelet U-Net with edge feature guidance is proposed, which is a novel deep learning framework that integrates the multi-directional, multi-scale decomposition capabilities of the Shearlet wavelet transform with the powerful feature extraction and localization properties of the U-Net architecture. A key contribution of this research is the introduction of an edge feature guidance module within the U-Net, designed to preserve critical structural and edge details typically lost during denoising. The denoising process uses the Shearlet transform to decompose the noisy input signal into different scales and orientations. This allows the model to better identify the difference between noise and signal, and, more importantly, make this differentiation based on resolutions. The experimental assessments applied the suggested method to both synthetic and real-world atmospheric LiDAR datasets and compared it with a number of cutting-edge denoising methods, specifically classical wavelet-based denoising methods, as well as supervised deep-learning methods. Quantitative results indicate that our model produces a 28% higher average signal-to-noise ratio (SNR) and a 31% higher mean squared error (MSE) improvement on average than baseline methods. Qualitative analysis shows the proposed model continues to retain small-scale atmospheric structures and edge continuity. Overall, the results indicate that the proposed method is effective for improving LiDAR signal quality implemented for a wide range of applications in environmental monitoring and meteorology, and signal fidelity is critical.
本研究解决了光探测和测距(LiDAR)信号去噪的挑战性任务,特别关注由于大气干扰和传感器限制而特别容易受到噪声影响的瑞利后向散射信号。提出了一种基于边缘特征引导的增强shearlet小波U-Net深度学习框架,该框架将shearlet小波变换的多向、多尺度分解能力与U-Net结构强大的特征提取和定位特性相结合。本研究的一个关键贡献是在U-Net中引入了一个边缘特征引导模块,旨在保留在去噪过程中通常丢失的关键结构和边缘细节。去噪过程采用Shearlet变换将噪声输入信号分解成不同的尺度和方向。这使得模型能够更好地识别噪声和信号之间的差异,更重要的是,基于分辨率进行这种区分。实验评估将建议的方法应用于合成和真实的大气激光雷达数据集,并将其与许多先进的去噪方法进行了比较,特别是经典的基于小波的去噪方法,以及监督深度学习方法。定量结果表明,与基线方法相比,我们的模型平均信噪比(SNR)提高28%,平均均方误差(MSE)提高31%。定性分析表明,该模型继续保持小尺度大气结构和边缘连续性。总体而言,结果表明,所提出的方法对于提高激光雷达信号质量是有效的,可用于环境监测和气象的广泛应用,信号保真度至关重要。
{"title":"Enhanced approach with edge feature guidance for LiDAR signal denoising","authors":"A. Anigo Merjora ,&nbsp;P. Sardar Maran","doi":"10.1016/j.cviu.2025.104609","DOIUrl":"10.1016/j.cviu.2025.104609","url":null,"abstract":"<div><div>This research addresses the challenging task of denoising Light Detection and Ranging (LiDAR) signals, with a specific focus on Rayleigh backscatter signals that are particularly vulnerable to noise due to atmospheric interference and sensor limitations. An enhanced shearlet wavelet U-Net with edge feature guidance is proposed, which is a novel deep learning framework that integrates the multi-directional, multi-scale decomposition capabilities of the Shearlet wavelet transform with the powerful feature extraction and localization properties of the U-Net architecture. A key contribution of this research is the introduction of an edge feature guidance module within the U-Net, designed to preserve critical structural and edge details typically lost during denoising. The denoising process uses the Shearlet transform to decompose the noisy input signal into different scales and orientations. This allows the model to better identify the difference between noise and signal, and, more importantly, make this differentiation based on resolutions. The experimental assessments applied the suggested method to both synthetic and real-world atmospheric LiDAR datasets and compared it with a number of cutting-edge denoising methods, specifically classical wavelet-based denoising methods, as well as supervised deep-learning methods. Quantitative results indicate that our model produces a 28% higher average signal-to-noise ratio (SNR) and a 31% higher mean squared error (MSE) improvement on average than baseline methods. Qualitative analysis shows the proposed model continues to retain small-scale atmospheric structures and edge continuity. Overall, the results indicate that the proposed method is effective for improving LiDAR signal quality implemented for a wide range of applications in environmental monitoring and meteorology, and signal fidelity is critical.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104609"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146038534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Memory-enriched thought-by-thought framework for complex Diagram Question Answering 记忆丰富的思维框架,复杂的图表问答
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-26 DOI: 10.1016/j.cviu.2025.104608
Xinyu Zhang , Lingling Zhang , Yanrui Wu , Shaowei Wang , Wenjun Wu , Muye Huang , Qianying Wang , Jun Liu
Large language models (LLMs) can effectively generate reasoning processes for simple tasks, but they struggle in complex and novel reasoning scenarios. This problem stems from LLMs often fusing visual and textual information in a single step, lacking the capture and representation of key information during the reasoning process, ignoring critical changes in the reasoning process, and failing to reflect the complex and dynamic nature of human-like reasoning. To address these issues, we propose a new framework called Memory-Enriched Thought-by-Thought (METbT), which incorporates memory and operators. On the one hand, the memory is used to store intermediate representations of the reasoning process, preserving information from the reasoning steps and preventing the language model from generating illogical text. On the other hand, the introduction of operators offers various methods for merging visual and textual representations, significantly enhancing the model’s ability to learn representations. We develop the METbT-Bert, METbT-T5, METbT-Qwen and METbT-InternLM, leveraging Bert, T5, Qwen and InternLM as the foundational language models with our framework, respectively. Experiments are conducted on multiple datasets including Smart-101, ScienceQA, and IconQA, and in all cases, the results surpassed those of the same language models. The results demonstrate that our METbT framework offers superior scalability and robustness.
大型语言模型(llm)可以有效地为简单任务生成推理过程,但在复杂和新颖的推理场景中却难以实现。这一问题源于llm往往将视觉和文本信息融合在一个步骤中,缺乏对推理过程中关键信息的捕获和表示,忽略了推理过程中的关键变化,未能反映类人推理的复杂性和动态性。为了解决这些问题,我们提出了一个新的框架,称为记忆丰富的思想-思想(METbT),它结合了内存和运算符。一方面,内存用于存储推理过程的中间表示,保留推理步骤中的信息,防止语言模型生成不合逻辑的文本。另一方面,算子的引入为合并视觉和文本表示提供了多种方法,显著增强了模型学习表示的能力。我们开发了mett -Bert、mett -T5、mett -Qwen和mett -InternLM,分别利用Bert、T5、Qwen和InternLM作为我们框架的基础语言模型。在Smart-101、ScienceQA、IconQA等多个数据集上进行实验,结果均优于相同语言模型。结果表明,我们的METbT框架提供了优越的可扩展性和鲁棒性。
{"title":"Memory-enriched thought-by-thought framework for complex Diagram Question Answering","authors":"Xinyu Zhang ,&nbsp;Lingling Zhang ,&nbsp;Yanrui Wu ,&nbsp;Shaowei Wang ,&nbsp;Wenjun Wu ,&nbsp;Muye Huang ,&nbsp;Qianying Wang ,&nbsp;Jun Liu","doi":"10.1016/j.cviu.2025.104608","DOIUrl":"10.1016/j.cviu.2025.104608","url":null,"abstract":"<div><div>Large language models (LLMs) can effectively generate reasoning processes for simple tasks, but they struggle in complex and novel reasoning scenarios. This problem stems from LLMs often fusing visual and textual information in a single step, lacking the capture and representation of key information during the reasoning process, ignoring critical changes in the reasoning process, and failing to reflect the complex and dynamic nature of human-like reasoning. To address these issues, we propose a new framework called <strong>M</strong>emory-<strong>E</strong>nriched <strong>T</strong>hought-by-<strong>T</strong>hought (METbT), which incorporates memory and operators. On the one hand, the memory is used to store intermediate representations of the reasoning process, preserving information from the reasoning steps and preventing the language model from generating illogical text. On the other hand, the introduction of operators offers various methods for merging visual and textual representations, significantly enhancing the model’s ability to learn representations. We develop the METbT-Bert, METbT-T5, METbT-Qwen and METbT-InternLM, leveraging Bert, T5, Qwen and InternLM as the foundational language models with our framework, respectively. Experiments are conducted on multiple datasets including Smart-101, ScienceQA, and IconQA, and in all cases, the results surpassed those of the same language models. The results demonstrate that our METbT framework offers superior scalability and robustness.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104608"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CNNs vs Transformers: Confirmatory factor analysis for eye gaze classification with explainable AI cnn vs变形金刚:用可解释的人工智能进行眼睛凝视分类的验证性因素分析
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-31 DOI: 10.1016/j.cviu.2025.104624
Naman Goyal, Major Singh Goraya, Tajinder Singh
Recent advancements in eye gaze classification have significant implications for enhancing human robot interaction. Existing benchmark datasets such as UnityEyes often exhibit class imbalance issues, negatively impacting classification efficacy. Addressing this challenge, a balanced dataset, termed reformed-UE, containing 500 images per class across eight distinct gaze directions is introduced. A novel hyperparameter optimized deep learning model, designated Gimage, is proposed for image-based gaze direction classification. Additionally, the balanced, large-scale MRL dataset enables rigorous generalization testing of the Gimage model. Comparative evaluations involving state of the art models including MobileNetV2, InceptionNetV3, AttentionCNN, MobileViT, Hybrid PCCR and Swin Transformers are conducted. The Gimage model achieves superior performance metrics, registering a validation accuracy of 93.75%, exceeding competing models by approximately 4 to 5 percentage points. Furthermore, Gimage attains higher precision (0.93), recall (0.93), and F1score (0.93), significantly reducing classification errors, particularly in the challenging TopRight class. Interpretability analyses employing Gradient weighted Class Activation Mapping (GradCAM) heatmaps provide further confirmation of the model’s proficiency in identifying essential latent features critical for accurate classification.
人眼注视分类的最新进展对增强人机交互具有重要意义。现有的基准数据集(如UnityEyes)经常出现类不平衡问题,对分类效果产生负面影响。为了解决这一挑战,我们引入了一个平衡的数据集,称为reved - ue,在八个不同的凝视方向上每个类包含500张图像。提出了一种新的超参数优化深度学习模型Gimage,用于基于图像的凝视方向分类。此外,平衡的大规模MRL数据集可以对Gimage模型进行严格的泛化测试。对包括MobileNetV2、InceptionNetV3、AttentionCNN、MobileViT、Hybrid PCCR和Swin变压器在内的最先进模型进行了比较评估。Gimage模型实现了卓越的性能指标,注册了93.75%的验证精度,超过了竞争模型大约4到5个百分点。此外,Gimage获得了更高的精度(0.93)、召回率(0.93)和F1score(0.93),显著减少了分类错误,特别是在具有挑战性的TopRight类中。使用梯度加权类激活映射(GradCAM)热图的可解释性分析进一步证实了模型在识别对准确分类至关重要的基本潜在特征方面的熟练程度。
{"title":"CNNs vs Transformers: Confirmatory factor analysis for eye gaze classification with explainable AI","authors":"Naman Goyal,&nbsp;Major Singh Goraya,&nbsp;Tajinder Singh","doi":"10.1016/j.cviu.2025.104624","DOIUrl":"10.1016/j.cviu.2025.104624","url":null,"abstract":"<div><div>Recent advancements in eye gaze classification have significant implications for enhancing human robot interaction. Existing benchmark datasets such as UnityEyes often exhibit class imbalance issues, negatively impacting classification efficacy. Addressing this challenge, a balanced dataset, termed reformed-UE, containing 500 images per class across eight distinct gaze directions is introduced. A novel hyperparameter optimized deep learning model, designated <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>image</mi></mrow></msub></math></span>, is proposed for image-based gaze direction classification. Additionally, the balanced, large-scale MRL dataset enables rigorous generalization testing of the <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>image</mi></mrow></msub></math></span> model. Comparative evaluations involving state of the art models including MobileNetV2, InceptionNetV3, AttentionCNN, MobileViT, Hybrid PCCR and Swin Transformers are conducted. The <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>image</mi></mrow></msub></math></span> model achieves superior performance metrics, registering a validation accuracy of 93.75%, exceeding competing models by approximately 4 to 5 percentage points. Furthermore, <span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>image</mi></mrow></msub></math></span> attains higher precision (0.93), recall (0.93), and F1score (0.93), significantly reducing classification errors, particularly in the challenging TopRight class. Interpretability analyses employing Gradient weighted Class Activation Mapping (GradCAM) heatmaps provide further confirmation of the model’s proficiency in identifying essential latent features critical for accurate classification.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104624"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AuxFlow: Anchor-grounded homography estimation through flow-guided auxiliary points for Soccer field registration and player localization AuxFlow:通过流引导辅助点对足球场注册和球员定位进行锚接地的单应性估计
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2026-01-21 DOI: 10.1016/j.cviu.2026.104662
Julian Ziegler, Daniel Matthes, Patrick Frenzel, Mirco Fuchs
We introduce AuxFlow, a novel, temporally-aware pipeline for homography estimation and field registration in challenging football broadcast footage. To overcome the temporal instability and high performance variance of existing per-frame keypoint methods, our AuxFlow approach combines a robust frame-wise keypoint model with a temporal propagation strategy. The system automatically identifies high-confidence ”anchor” frames where it estimates the homography solely based on the keypoint model, before sampling auxiliary points, which are re-identified in neighbouring frames using optical flow to establish dense, coherent correspondences across the sequence. This significantly enhances the stability and accuracy of the estimated homographies. Our evaluation on the SoccerNet GSR dataset shows consistent, measurable improvements in robustness and smoothness over existing State-of-the-Art, enabling highly reliable player localization invaluable for downstream applications.
我们介绍了AuxFlow,一种新颖的,时间感知的管道,用于具有挑战性的足球广播镜头的单应性估计和现场注册。为了克服现有的逐帧关键点方法的时间不稳定性和高性能差异,我们的AuxFlow方法将鲁棒的逐帧关键点模型与时间传播策略相结合。在采样辅助点之前,系统自动识别高置信度的“锚”帧,在这些帧中,它仅基于关键点模型估计单应性,然后使用光流在相邻帧中重新识别辅助点,以在整个序列中建立密集的连贯对应。这大大提高了估计同形词的稳定性和准确性。我们对SoccerNet GSR数据集的评估显示,与现有的最先进的技术相比,在稳健性和平稳性方面有了一致的、可衡量的改进,这使得高度可靠的球员定位对下游应用来说是非常宝贵的。
{"title":"AuxFlow: Anchor-grounded homography estimation through flow-guided auxiliary points for Soccer field registration and player localization","authors":"Julian Ziegler,&nbsp;Daniel Matthes,&nbsp;Patrick Frenzel,&nbsp;Mirco Fuchs","doi":"10.1016/j.cviu.2026.104662","DOIUrl":"10.1016/j.cviu.2026.104662","url":null,"abstract":"<div><div>We introduce AuxFlow, a novel, temporally-aware pipeline for homography estimation and field registration in challenging football broadcast footage. To overcome the temporal instability and high performance variance of existing per-frame keypoint methods, our AuxFlow approach combines a robust frame-wise keypoint model with a temporal propagation strategy. The system automatically identifies high-confidence ”anchor” frames where it estimates the homography solely based on the keypoint model, before sampling auxiliary points, which are re-identified in neighbouring frames using optical flow to establish dense, coherent correspondences across the sequence. This significantly enhances the stability and accuracy of the estimated homographies. Our evaluation on the SoccerNet GSR dataset shows consistent, measurable improvements in robustness and smoothness over existing State-of-the-Art, enabling highly reliable player localization invaluable for downstream applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104662"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146188637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BiPG-FER: Bi-intelligence probabilistic graph for facial expression inference drived by action units BiPG-FER:动作单元驱动的面部表情推理的双智能概率图
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2026-01-14 DOI: 10.1016/j.cviu.2026.104655
Fei Wan, Ruicong Zhi
Investigating the associations between facial action units (AUs) and emotions (EMOs) helps to eliminate the constraints imposed by predefined emotion patterns. However, accurately modeling the soft, probabilistic AU–EMO relationships and inferring emotional states from AU sequences remains a challenging task. To address this, we propose a Bi-intelligence Probabilistic Graph model (BiPG-FER), which flexibly learns interpretable AU–AU and AU–EMO associations and enables automatic facial expression inference from AU sequences. In the input phase, a small portion of external prior knowledge is incorporated to mitigate the high-entropy fluctuations often caused by random initialization. We construct a two-layer fully connected AU–EMO association graph and develop an end-to-end architecture with a masking mechanism that dynamically updates the AU–AU and AU–EMO relationships by computing joint probabilities. An oversampling strategy, combined with a adaptive thresholding and a data-contribution-aware reweighting scheme, is introduced to address the skewed post-distribution of emotion labels. Finally, we design a strategy that preserves previous model weights and generates pseudo-samples based on the top-k conditional AU–EMO probabilities, allowing the model to evolve smoothly in a continuously changing and heterogeneous data stream. Experimental results demonstrate that the proposed BiPG-FER effectively produces interpretable probabilistic associations while improving recognition performance on both micro-expression and macro-expression datasets.
研究面部动作单元(AUs)和情绪(emo)之间的关联有助于消除预定义情绪模式所施加的约束。然而,准确建模软概率AU - emo关系并从AU序列推断情绪状态仍然是一项具有挑战性的任务。为了解决这个问题,我们提出了一个双智能概率图模型(BiPG-FER),该模型灵活地学习可解释的AU - AU和AU - emo关联,并能够从AU序列中自动推断面部表情。在输入阶段,加入一小部分外部先验知识,以减轻随机初始化引起的高熵波动。我们构建了一个两层全连接AU-EMO关联图,并开发了一个端到端架构,该架构采用屏蔽机制,通过计算联合概率来动态更新AU-AU和AU-EMO关系。引入了一种超采样策略,结合自适应阈值和数据贡献感知重加权方案,以解决情感标签的后分布偏差。最后,我们设计了一种策略,该策略保留了先前的模型权重,并基于top-k条件AU-EMO概率生成伪样本,使模型能够在不断变化的异构数据流中平稳演化。实验结果表明,所提出的BiPG-FER在微表达和宏表达数据集上有效地产生可解释的概率关联,同时提高了识别性能。
{"title":"BiPG-FER: Bi-intelligence probabilistic graph for facial expression inference drived by action units","authors":"Fei Wan,&nbsp;Ruicong Zhi","doi":"10.1016/j.cviu.2026.104655","DOIUrl":"10.1016/j.cviu.2026.104655","url":null,"abstract":"<div><div>Investigating the associations between facial action units (AUs) and emotions (EMOs) helps to eliminate the constraints imposed by predefined emotion patterns. However, accurately modeling the soft, probabilistic AU–EMO relationships and inferring emotional states from AU sequences remains a challenging task. To address this, we propose a Bi-intelligence Probabilistic Graph model (BiPG-FER), which flexibly learns interpretable AU–AU and AU–EMO associations and enables automatic facial expression inference from AU sequences. In the input phase, a small portion of external prior knowledge is incorporated to mitigate the high-entropy fluctuations often caused by random initialization. We construct a two-layer fully connected AU–EMO association graph and develop an end-to-end architecture with a masking mechanism that dynamically updates the AU–AU and AU–EMO relationships by computing joint probabilities. An oversampling strategy, combined with a adaptive thresholding and a data-contribution-aware reweighting scheme, is introduced to address the skewed post-distribution of emotion labels. Finally, we design a strategy that preserves previous model weights and generates pseudo-samples based on the top-k conditional AU–EMO probabilities, allowing the model to evolve smoothly in a continuously changing and heterogeneous data stream. Experimental results demonstrate that the proposed BiPG-FER effectively produces interpretable probabilistic associations while improving recognition performance on both micro-expression and macro-expression datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104655"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TI-PREGO: Chain of Thought and In-Context Learning for online mistake detection in PRocedural EGOcentric videos TI-PREGO:程序性自我中心视频中在线错误检测的思维链和语境学习
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-29 DOI: 10.1016/j.cviu.2025.104613
Leonardo Plini , Luca Scofano , Edoardo De Matteis , Guido Maria D’Amely di Melendugno , Alessandro Flaborea , Andrea Sanchietti , Giovanni Maria Farinella , Fabio Galasso , Antonino Furnari
Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, no existing technique can reliably detect open-set procedural mistakes in an online setting. We propose a dual-branch architecture to address this problem in an online fashion: the recognition branch takes input frames from egocentric video, predicts the current action and aggregates frame-level results into action tokens while the anticipation branch leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module.
Extensive experiments on two novel procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach.
从以自我为中心的视频中在线识别程序错误是一项关键但具有挑战性的任务,涉及多个领域,包括制造业、医疗保健和基于技能的培训。这类错误本质上是开放集的,因为不可预见的或新颖的错误可能会发生,因此需要不依赖于先前失败示例的强大检测系统。目前,还没有一种技术能够可靠地检测出在线设置中的开集程序错误。我们提出了一种双分支架构,以在线方式解决这个问题:识别分支从以自我为中心的视频中获取输入帧,预测当前动作并将帧级结果聚合为动作令牌,而预测分支利用大型语言模型(llm)的可靠模式匹配功能,根据先前预测的动作令牌来预测动作令牌。错误被检测为当前识别的动作与预期模块预测的动作之间的不匹配。在两个新的程序数据集上进行的大量实验证明了利用双分支架构进行错误检测的挑战和机遇,展示了我们提出的方法的有效性。
{"title":"TI-PREGO: Chain of Thought and In-Context Learning for online mistake detection in PRocedural EGOcentric videos","authors":"Leonardo Plini ,&nbsp;Luca Scofano ,&nbsp;Edoardo De Matteis ,&nbsp;Guido Maria D’Amely di Melendugno ,&nbsp;Alessandro Flaborea ,&nbsp;Andrea Sanchietti ,&nbsp;Giovanni Maria Farinella ,&nbsp;Fabio Galasso ,&nbsp;Antonino Furnari","doi":"10.1016/j.cviu.2025.104613","DOIUrl":"10.1016/j.cviu.2025.104613","url":null,"abstract":"<div><div>Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, no existing technique can reliably detect open-set procedural mistakes in an online setting. We propose a dual-branch architecture to address this problem in an online fashion: the recognition branch takes input frames from egocentric video, predicts the current action and aggregates frame-level results into action tokens while the anticipation branch leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module.</div><div>Extensive experiments on two novel procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104613"},"PeriodicalIF":3.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1