首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
TI-PREGO: Chain of Thought and In-Context Learning for online mistake detection in PRocedural EGOcentric videos TI-PREGO:程序性自我中心视频中在线错误检测的思维链和语境学习
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-29 DOI: 10.1016/j.cviu.2025.104613
Leonardo Plini , Luca Scofano , Edoardo De Matteis , Guido Maria D’Amely di Melendugno , Alessandro Flaborea , Andrea Sanchietti , Giovanni Maria Farinella , Fabio Galasso , Antonino Furnari
Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, no existing technique can reliably detect open-set procedural mistakes in an online setting. We propose a dual-branch architecture to address this problem in an online fashion: the recognition branch takes input frames from egocentric video, predicts the current action and aggregates frame-level results into action tokens while the anticipation branch leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module.
Extensive experiments on two novel procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach.
从以自我为中心的视频中在线识别程序错误是一项关键但具有挑战性的任务,涉及多个领域,包括制造业、医疗保健和基于技能的培训。这类错误本质上是开放集的,因为不可预见的或新颖的错误可能会发生,因此需要不依赖于先前失败示例的强大检测系统。目前,还没有一种技术能够可靠地检测出在线设置中的开集程序错误。我们提出了一种双分支架构,以在线方式解决这个问题:识别分支从以自我为中心的视频中获取输入帧,预测当前动作并将帧级结果聚合为动作令牌,而预测分支利用大型语言模型(llm)的可靠模式匹配功能,根据先前预测的动作令牌来预测动作令牌。错误被检测为当前识别的动作与预期模块预测的动作之间的不匹配。在两个新的程序数据集上进行的大量实验证明了利用双分支架构进行错误检测的挑战和机遇,展示了我们提出的方法的有效性。
{"title":"TI-PREGO: Chain of Thought and In-Context Learning for online mistake detection in PRocedural EGOcentric videos","authors":"Leonardo Plini ,&nbsp;Luca Scofano ,&nbsp;Edoardo De Matteis ,&nbsp;Guido Maria D’Amely di Melendugno ,&nbsp;Alessandro Flaborea ,&nbsp;Andrea Sanchietti ,&nbsp;Giovanni Maria Farinella ,&nbsp;Fabio Galasso ,&nbsp;Antonino Furnari","doi":"10.1016/j.cviu.2025.104613","DOIUrl":"10.1016/j.cviu.2025.104613","url":null,"abstract":"<div><div>Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, no existing technique can reliably detect open-set procedural mistakes in an online setting. We propose a dual-branch architecture to address this problem in an online fashion: the recognition branch takes input frames from egocentric video, predicts the current action and aggregates frame-level results into action tokens while the anticipation branch leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module.</div><div>Extensive experiments on two novel procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104613"},"PeriodicalIF":3.5,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal transformer–diffusion framework for large-scale reconstruction of soccer tracking data 足球跟踪数据大规模重建的多模态变压器-扩散框架
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-29 DOI: 10.1016/j.cviu.2025.104626
Harry Hughes , Patrick Lucey , Michael Horton , Harshala Gammulle , Clinton Fookes , Sridha Sridharan
In soccer, tracking data (player and ball locations over time) is central to performance analysis and a major focus of computer vision in sport. Tracking from broadcast or single-view video offers scalable coverage across all professional matches but suffers from frequent occlusions and missing information. Existing academic work typically evaluates short clips under simplified conditions, whereas industrial applications require complete, game-level coverage. We address these challenges with a multimodal transformer–diffusion framework that combines human-in-the-loop event supervision with single-view video. Our approach first leverages long-term multimodal context — tracking and event annotations — to improve coarse agent localization, then reconstructs full trajectories using a diffusion-based generative model that produces realistic, temporally coherent motion. Compared to state-of-the-art methods, our approach substantially improves both coarse and fine-grained accuracy while scaling effectively to industrial settings. By integrating human supervision with multimodal generative modeling, we provide a robust and practical solution for producing accurate and realistic player and ball trajectories under challenging real-world single-view conditions.
在足球运动中,跟踪数据(球员和球的位置随时间变化)是性能分析的核心,也是计算机视觉在体育运动中的主要关注点。来自广播或单视图视频的跟踪提供了覆盖所有职业比赛的可扩展范围,但经常受到遮挡和信息缺失的影响。现有的学术工作通常在简化的条件下评估短片段,而工业应用则需要完整的游戏级覆盖。我们通过多模态变压器扩散框架解决了这些挑战,该框架将人在环事件监督与单视图视频相结合。我们的方法首先利用长期的多模态上下文——跟踪和事件注释——来改进粗代理定位,然后使用基于扩散的生成模型重建完整的轨迹,该模型产生逼真的、时间连贯的运动。与最先进的方法相比,我们的方法大大提高了粗粒度和细粒度的精度,同时有效地扩展到工业环境。通过将人类监督与多模态生成建模相结合,我们提供了一个强大而实用的解决方案,可以在具有挑战性的现实世界单视图条件下生成准确而真实的球员和球轨迹。
{"title":"Multimodal transformer–diffusion framework for large-scale reconstruction of soccer tracking data","authors":"Harry Hughes ,&nbsp;Patrick Lucey ,&nbsp;Michael Horton ,&nbsp;Harshala Gammulle ,&nbsp;Clinton Fookes ,&nbsp;Sridha Sridharan","doi":"10.1016/j.cviu.2025.104626","DOIUrl":"10.1016/j.cviu.2025.104626","url":null,"abstract":"<div><div>In soccer, tracking data (player and ball locations over time) is central to performance analysis and a major focus of computer vision in sport. Tracking from broadcast or single-view video offers scalable coverage across all professional matches but suffers from frequent occlusions and missing information. Existing academic work typically evaluates short clips under simplified conditions, whereas industrial applications require complete, game-level coverage. We address these challenges with a multimodal transformer–diffusion framework that combines human-in-the-loop event supervision with single-view video. Our approach first leverages long-term multimodal context — tracking and event annotations — to improve coarse agent localization, then reconstructs full trajectories using a diffusion-based generative model that produces realistic, temporally coherent motion. Compared to state-of-the-art methods, our approach substantially improves both coarse and fine-grained accuracy while scaling effectively to industrial settings. By integrating human supervision with multimodal generative modeling, we provide a robust and practical solution for producing accurate and realistic player and ball trajectories under challenging real-world single-view conditions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104626"},"PeriodicalIF":3.5,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Boundary-aware semantic segmentation for ice hockey rink registration 基于边界感知的冰球场注册语义分割
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-27 DOI: 10.1016/j.cviu.2025.104627
Zhibo Wang , Amir Nazemi , Stephie Liu , Sirisha Rambhatla , Yuhao Chen , David Clausi
Accurate registration of ice hockey rinks from broadcast video frames is fundamental to sports analytics, as it aligns the rink template and broadcast frame into a unified coordinate system for consistent player analysis. Existing approaches, including keypoint- and segmentation-based methods, often yield suboptimal homography estimation due to insufficient attention to rink boundaries. To address this, we propose a segmentation-based framework that explicitly introduces the rink boundary as a new segmentation class. To further improve accuracy, we introduce three components that enhance boundary awareness: (i) a boundary-aware loss to strengthen boundary representation, (ii) a dynamic class-weighted mechanism in homography estimation to emphasize informative regions, and (iii) a self-distillation strategy to enrich feature diversity. Experiments on the NHL and SHL datasets demonstrate that our method significantly outperforms both baselines, achieving improvements of +2.84 and +3.48 in IoUpart and IoUwhole on the NHL dataset, and +1.53 and +5.85 on the SHL dataset, respectively. Ablation studies further confirm the contribution of each component, establishing a robust solution for rink registration and a strong foundation for downstream sports vision tasks.
从广播视频帧中准确注册冰球溜冰场是体育分析的基础,因为它将溜冰场模板和广播帧对齐到统一的坐标系统中,以实现一致的球员分析。现有的方法,包括基于关键点和基于分割的方法,由于对溜冰场边界的关注不够,通常会产生次优的单应性估计。为了解决这个问题,我们提出了一个基于分割的框架,明确地将溜冰场边界作为一个新的分割类引入。为了进一步提高准确性,我们引入了三个增强边界感知的组件:(i)加强边界表示的边界感知损失,(ii)在单应性估计中强调信息区域的动态类加权机制,以及(iii)丰富特征多样性的自蒸馏策略。在NHL和SHL数据集上的实验表明,我们的方法明显优于这两个基线,在NHL数据集上IoUpart和IoUwhole分别提高了+2.84和+3.48,在SHL数据集上分别提高了+1.53和+5.85。消融研究进一步证实了每个组成部分的贡献,为溜冰场注册建立了一个强大的解决方案,并为下游运动视觉任务奠定了坚实的基础。
{"title":"Boundary-aware semantic segmentation for ice hockey rink registration","authors":"Zhibo Wang ,&nbsp;Amir Nazemi ,&nbsp;Stephie Liu ,&nbsp;Sirisha Rambhatla ,&nbsp;Yuhao Chen ,&nbsp;David Clausi","doi":"10.1016/j.cviu.2025.104627","DOIUrl":"10.1016/j.cviu.2025.104627","url":null,"abstract":"<div><div>Accurate registration of ice hockey rinks from broadcast video frames is fundamental to sports analytics, as it aligns the rink template and broadcast frame into a unified coordinate system for consistent player analysis. Existing approaches, including keypoint- and segmentation-based methods, often yield suboptimal homography estimation due to insufficient attention to rink boundaries. To address this, we propose a segmentation-based framework that explicitly introduces the rink boundary as a new segmentation class. To further improve accuracy, we introduce three components that enhance boundary awareness: (i) a boundary-aware loss to strengthen boundary representation, (ii) a dynamic class-weighted mechanism in homography estimation to emphasize informative regions, and (iii) a self-distillation strategy to enrich feature diversity. Experiments on the NHL and SHL datasets demonstrate that our method significantly outperforms both baselines, achieving improvements of <span><math><mrow><mo>+</mo><mn>2</mn><mo>.</mo><mn>84</mn></mrow></math></span> and <span><math><mrow><mo>+</mo><mn>3</mn><mo>.</mo><mn>48</mn></mrow></math></span> in IoU<sub>part</sub> and IoU<sub>whole</sub> on the NHL dataset, and <span><math><mrow><mo>+</mo><mn>1</mn><mo>.</mo><mn>53</mn></mrow></math></span> and <span><math><mrow><mo>+</mo><mn>5</mn><mo>.</mo><mn>85</mn></mrow></math></span> on the SHL dataset, respectively. Ablation studies further confirm the contribution of each component, establishing a robust solution for rink registration and a strong foundation for downstream sports vision tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104627"},"PeriodicalIF":3.5,"publicationDate":"2025-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Memory-enriched thought-by-thought framework for complex Diagram Question Answering 记忆丰富的思维框架,复杂的图表问答
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-26 DOI: 10.1016/j.cviu.2025.104608
Xinyu Zhang , Lingling Zhang , Yanrui Wu , Shaowei Wang , Wenjun Wu , Muye Huang , Qianying Wang , Jun Liu
Large language models (LLMs) can effectively generate reasoning processes for simple tasks, but they struggle in complex and novel reasoning scenarios. This problem stems from LLMs often fusing visual and textual information in a single step, lacking the capture and representation of key information during the reasoning process, ignoring critical changes in the reasoning process, and failing to reflect the complex and dynamic nature of human-like reasoning. To address these issues, we propose a new framework called Memory-Enriched Thought-by-Thought (METbT), which incorporates memory and operators. On the one hand, the memory is used to store intermediate representations of the reasoning process, preserving information from the reasoning steps and preventing the language model from generating illogical text. On the other hand, the introduction of operators offers various methods for merging visual and textual representations, significantly enhancing the model’s ability to learn representations. We develop the METbT-Bert, METbT-T5, METbT-Qwen and METbT-InternLM, leveraging Bert, T5, Qwen and InternLM as the foundational language models with our framework, respectively. Experiments are conducted on multiple datasets including Smart-101, ScienceQA, and IconQA, and in all cases, the results surpassed those of the same language models. The results demonstrate that our METbT framework offers superior scalability and robustness.
大型语言模型(llm)可以有效地为简单任务生成推理过程,但在复杂和新颖的推理场景中却难以实现。这一问题源于llm往往将视觉和文本信息融合在一个步骤中,缺乏对推理过程中关键信息的捕获和表示,忽略了推理过程中的关键变化,未能反映类人推理的复杂性和动态性。为了解决这些问题,我们提出了一个新的框架,称为记忆丰富的思想-思想(METbT),它结合了内存和运算符。一方面,内存用于存储推理过程的中间表示,保留推理步骤中的信息,防止语言模型生成不合逻辑的文本。另一方面,算子的引入为合并视觉和文本表示提供了多种方法,显著增强了模型学习表示的能力。我们开发了mett -Bert、mett -T5、mett -Qwen和mett -InternLM,分别利用Bert、T5、Qwen和InternLM作为我们框架的基础语言模型。在Smart-101、ScienceQA、IconQA等多个数据集上进行实验,结果均优于相同语言模型。结果表明,我们的METbT框架提供了优越的可扩展性和鲁棒性。
{"title":"Memory-enriched thought-by-thought framework for complex Diagram Question Answering","authors":"Xinyu Zhang ,&nbsp;Lingling Zhang ,&nbsp;Yanrui Wu ,&nbsp;Shaowei Wang ,&nbsp;Wenjun Wu ,&nbsp;Muye Huang ,&nbsp;Qianying Wang ,&nbsp;Jun Liu","doi":"10.1016/j.cviu.2025.104608","DOIUrl":"10.1016/j.cviu.2025.104608","url":null,"abstract":"<div><div>Large language models (LLMs) can effectively generate reasoning processes for simple tasks, but they struggle in complex and novel reasoning scenarios. This problem stems from LLMs often fusing visual and textual information in a single step, lacking the capture and representation of key information during the reasoning process, ignoring critical changes in the reasoning process, and failing to reflect the complex and dynamic nature of human-like reasoning. To address these issues, we propose a new framework called <strong>M</strong>emory-<strong>E</strong>nriched <strong>T</strong>hought-by-<strong>T</strong>hought (METbT), which incorporates memory and operators. On the one hand, the memory is used to store intermediate representations of the reasoning process, preserving information from the reasoning steps and preventing the language model from generating illogical text. On the other hand, the introduction of operators offers various methods for merging visual and textual representations, significantly enhancing the model’s ability to learn representations. We develop the METbT-Bert, METbT-T5, METbT-Qwen and METbT-InternLM, leveraging Bert, T5, Qwen and InternLM as the foundational language models with our framework, respectively. Experiments are conducted on multiple datasets including Smart-101, ScienceQA, and IconQA, and in all cases, the results surpassed those of the same language models. The results demonstrate that our METbT framework offers superior scalability and robustness.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104608"},"PeriodicalIF":3.5,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SASTD: Stepwise attention style transfer network based on diffusion models 基于扩散模型的分步注意风格迁移网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-23 DOI: 10.1016/j.cviu.2025.104612
Zhuoya Wang, Gui Chen, Yaxin Li, Yongsheng Dong
Image style transfer techniques have significantly advanced, aiming to create images that adopt the style attributes of one source while maintaining the spatial layout of another. However, the interrelationship between style and content often causes the problem of information entanglement within the generated stylized result. To alleviate this issue, in this paper we propose a stepwise attention style transfer network based on diffusion models (SASTD). Specifically, we introduce an attention feature extraction and fusion module, which employs a step-by-step injection method to effectively combine the extracted content and style attention features at different time stages. Additionally, we propose a noise initialization module based on adaptive instance normalization (AdaIN) in the early fusion stage to initialize the initial latent noise during image generation, preserving certain initial feature statistics. Furthermore, we incorporate edge attention from the content image to enhance the preservation of its structural details. Finally, we propose a LAB space alignment module to further optimize the initially generated stylized image. This method ensures high-quality style transfer while better maintaining the spatial semantics of the content image. Experimental results demonstrate that our proposed SASTD achieves better performance in both qualitative and quantitative comparisons compared to both image style transfer methods and style-guided text-to-image synthesis methods.
图像风格转换技术有了显著的进步,旨在创建采用一个源的风格属性的图像,同时保持另一个源的空间布局。然而,样式和内容之间的相互关系往往会导致生成的样式化结果中出现信息纠缠的问题。为了解决这一问题,本文提出了一种基于扩散模型的逐步注意风格迁移网络(SASTD)。具体来说,我们引入了一个注意力特征提取与融合模块,该模块采用分步注入的方式,将提取的内容和风格在不同时间阶段的注意力特征有效地结合起来。此外,我们在融合早期提出了基于自适应实例归一化(AdaIN)的噪声初始化模块,以初始化图像生成过程中的初始潜在噪声,保留一定的初始特征统计量。此外,我们从内容图像中加入边缘注意,以增强其结构细节的保存。最后,我们提出了一个LAB空间对齐模块来进一步优化初始生成的风格化图像。这种方法保证了高质量的风格传递,同时更好地保持了内容图像的空间语义。实验结果表明,与图像风格转移方法和风格引导的文本到图像合成方法相比,我们提出的SASTD在定性和定量比较方面都取得了更好的性能。
{"title":"SASTD: Stepwise attention style transfer network based on diffusion models","authors":"Zhuoya Wang,&nbsp;Gui Chen,&nbsp;Yaxin Li,&nbsp;Yongsheng Dong","doi":"10.1016/j.cviu.2025.104612","DOIUrl":"10.1016/j.cviu.2025.104612","url":null,"abstract":"<div><div>Image style transfer techniques have significantly advanced, aiming to create images that adopt the style attributes of one source while maintaining the spatial layout of another. However, the interrelationship between style and content often causes the problem of information entanglement within the generated stylized result. To alleviate this issue, in this paper we propose a stepwise attention style transfer network based on diffusion models (SASTD). Specifically, we introduce an attention feature extraction and fusion module, which employs a step-by-step injection method to effectively combine the extracted content and style attention features at different time stages. Additionally, we propose a noise initialization module based on adaptive instance normalization (AdaIN) in the early fusion stage to initialize the initial latent noise during image generation, preserving certain initial feature statistics. Furthermore, we incorporate edge attention from the content image to enhance the preservation of its structural details. Finally, we propose a LAB space alignment module to further optimize the initially generated stylized image. This method ensures high-quality style transfer while better maintaining the spatial semantics of the content image. Experimental results demonstrate that our proposed SASTD achieves better performance in both qualitative and quantitative comparisons compared to both image style transfer methods and style-guided text-to-image synthesis methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104612"},"PeriodicalIF":3.5,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RGB-D and IMU-based staircase quantification for assistive navigation using step estimation for exoskeleton support 基于RGB-D和imu的外骨骼辅助导航阶梯量化
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-23 DOI: 10.1016/j.cviu.2025.104621
Edgar R. Guzman , Letizia Gionfrida , Robert D. Howe
This paper introduces a vision-based environment quantification pipeline designed to tailor the assistance provided by lower limb assistive devices during the transition from level walking to stair navigation. The framework consists of three components: staircase detection, transitional step prediction, and staircase dimension estimation. These components utilize an RGB-D camera worn on the chest and an Inertial Measurement Unit (IMU) worn at the hip. To detect ascending stairs, we employed a YOLOv3 model applied to continuous recordings, achieving an average accuracy of 98.1%. For descending stair detection, an edge detection algorithm was used, resulting in a pixel-wise edge localization accuracy of 89.1%. To estimate user locomotion speed and footfall, the IMU was positioned on the participant’s left waist, and the RGB-D camera was mounted at chest level. This setup accurately captured step lengths with an average accuracy of 94.4% across all participants and trials, enabling precise determination of the number of steps leading up to the transitional step on the staircase. As a result, the system accurately predicted the number of steps and localized the final footfall with an average error of 5.77cm, measured as the distance between the predicted and actual placement of the final foot relative to the target destination. Finally, to capture the dimensions of the staircase’s tread depth and riser height, an algorithm analyzing point cloud data was applied when the user was in close proximity to the stairs. This yielded mean absolute errors of 1.20±0.49cm in height and 1.35±0.45cm in depth for ascending stairs, and 1.28±0.55cm in height and 1.47±0.65cm in depth for descending stairs. Our proposed approach lays the groundwork for optimizing control strategies in exoskeleton technologies by integrating environmental sensing with human locomotion analysis. These results demonstrate the feasibility and effectiveness of our system, promising enhanced user experiences and improved functionality in real-world scenarios.
本文介绍了一种基于视觉的环境量化管道,用于定制下肢辅助装置在水平行走到楼梯导航过渡过程中的辅助。该框架由三个部分组成:阶梯检测、过渡阶梯预测和阶梯维数估计。这些组件使用佩戴在胸前的RGB-D摄像头和佩戴在臀部的惯性测量单元(IMU)。为了检测上升楼梯,我们采用了适用于连续记录的YOLOv3模型,平均准确率达到98.1%。下楼楼梯检测采用边缘检测算法,像素级边缘定位精度达到89.1%。为了估计用户的运动速度和脚步,IMU被放置在参与者的左腰部,RGB-D摄像机被安装在胸部水平。这种设置准确地捕获了所有参与者和试验的台阶长度,平均准确率为94.4%,能够精确地确定通往楼梯过渡台阶的台阶数。结果,系统准确地预测了步数,并定位了最终的脚步,平均误差为5.77厘米,以预测和实际的最终脚步相对于目标目的地的距离来衡量。最后,为了捕捉楼梯的踏面深度和隔水管高度的尺寸,当用户靠近楼梯时,应用了一种分析点云数据的算法。上楼梯的平均绝对误差为高度1.20±0.49cm,深度1.35±0.45cm;下楼梯的平均绝对误差为高度1.28±0.55cm,深度1.47±0.65cm。我们提出的方法通过将环境传感与人体运动分析相结合,为优化外骨骼技术的控制策略奠定了基础。这些结果证明了我们系统的可行性和有效性,有望在现实场景中增强用户体验和改进功能。
{"title":"RGB-D and IMU-based staircase quantification for assistive navigation using step estimation for exoskeleton support","authors":"Edgar R. Guzman ,&nbsp;Letizia Gionfrida ,&nbsp;Robert D. Howe","doi":"10.1016/j.cviu.2025.104621","DOIUrl":"10.1016/j.cviu.2025.104621","url":null,"abstract":"<div><div>This paper introduces a vision-based environment quantification pipeline designed to tailor the assistance provided by lower limb assistive devices during the transition from level walking to stair navigation. The framework consists of three components: staircase detection, transitional step prediction, and staircase dimension estimation. These components utilize an RGB-D camera worn on the chest and an Inertial Measurement Unit (IMU) worn at the hip. To detect ascending stairs, we employed a YOLOv3 model applied to continuous recordings, achieving an average accuracy of 98.1%. For descending stair detection, an edge detection algorithm was used, resulting in a pixel-wise edge localization accuracy of 89.1%. To estimate user locomotion speed and footfall, the IMU was positioned on the participant’s left waist, and the RGB-D camera was mounted at chest level. This setup accurately captured step lengths with an average accuracy of 94.4% across all participants and trials, enabling precise determination of the number of steps leading up to the transitional step on the staircase. As a result, the system accurately predicted the number of steps and localized the final footfall with an average error of <span><math><mrow><mn>5</mn><mo>.</mo><mn>77</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span>, measured as the distance between the predicted and actual placement of the final foot relative to the target destination. Finally, to capture the dimensions of the staircase’s tread depth and riser height, an algorithm analyzing point cloud data was applied when the user was in close proximity to the stairs. This yielded mean absolute errors of <span><math><mrow><mn>1</mn><mo>.</mo><mn>20</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>49</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in height and <span><math><mrow><mn>1</mn><mo>.</mo><mn>35</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>45</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in depth for ascending stairs, and <span><math><mrow><mn>1</mn><mo>.</mo><mn>28</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>55</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in height and <span><math><mrow><mn>1</mn><mo>.</mo><mn>47</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>65</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in depth for descending stairs. Our proposed approach lays the groundwork for optimizing control strategies in exoskeleton technologies by integrating environmental sensing with human locomotion analysis. These results demonstrate the feasibility and effectiveness of our system, promising enhanced user experiences and improved functionality in real-world scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104621"},"PeriodicalIF":3.5,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145847576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SynTaskNet: A synergistic multi-task network for joint segmentation and classification of small anatomical structures in ultrasound imaging SynTaskNet:超声成像中用于关节分割和小解剖结构分类的协同多任务网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-18 DOI: 10.1016/j.cviu.2025.104616
Abdulrhman H. Al-Jebrni , Saba Ghazanfar Ali , Bin Sheng , Huating Li , Xiao Lin , Ping Li , Younhyun Jung , Jinman Kim , Li Xu , Lixin Jiang , Jing Du
Segmenting small, low-contrast anatomical structures and classifying their pathological status in ultrasound (US) images remain challenging tasks in computer vision, especially under the noise and ambiguity inherent in real-world clinical data. Papillary thyroid microcarcinoma (PTMC), characterized by nodules 1.0 cm, exemplifies these challenges where both precise segmentation and accurate lymph node metastasis (LNM) prediction are essential for informed clinical decisions. We propose SynTaskNet, a synergistic multi-task learning (MTL) architecture that jointly performs PTMC nodule segmentation and LNM classification from US images. Built upon a DenseNet201 backbone, SynTaskNet incorporates several specialized modules: a Coordinated Depth-wise Convolution (CDC) layer for enhancing spatial features, an Adaptive Context Block (ACB) for embedding contextual dependencies, and a Multi-scale Contextual Boundary Attention (MCBA) module to improve boundary localization in low-contrast regions. To strengthen task interaction, we introduce a Selective Enhancement Fusion (SEF) mechanism that hierarchically integrates features across three semantic levels, enabling effective information exchange between segmentation and classification branches. On top of this, we formulate a synergistic learning scheme wherein an Auxiliary Segmentation Map (ASM) generated by the segmentation decoder is injected into SEF’s third class-specific fusion path to guide LNM classification. In parallel, the predicted LNM label is concatenated with the third-path SEF output to refine the Final Segmentation Map (FSM), enabling bidirectional task reinforcement. Extensive evaluations on a dedicated PTMC US dataset demonstrate that SynTaskNet achieves state-of-the-art performance, with a Dice score of 93.0% for segmentation and a classification accuracy of 94.2% for LNM prediction, validating its clinical relevance and technical efficacy.
在超声(US)图像中分割小的、低对比度的解剖结构并对其病理状态进行分类仍然是计算机视觉中具有挑战性的任务,特别是在现实世界临床数据中固有的噪声和模糊性下。甲状腺乳头状微癌(PTMC)以结节≤1.0 cm为特征,体现了这些挑战,其中精确的分割和准确的淋巴结转移(LNM)预测对于知情的临床决策至关重要。我们提出了SynTaskNet,这是一种协同多任务学习(MTL)架构,可以联合执行PTMC模块分割和LNM分类。基于DenseNet201主干,SynTaskNet集成了几个专用模块:用于增强空间特征的协调深度卷积(CDC)层,用于嵌入上下文依赖的自适应上下文块(ACB),以及用于改进低对比度区域边界定位的多尺度上下文边界注意(MCBA)模块。为了加强任务交互,我们引入了一种选择性增强融合(SEF)机制,该机制分层地集成了三个语义级别的特征,从而实现了分词和分类分支之间的有效信息交换。在此基础上,我们制定了一种协同学习方案,将分割解码器生成的辅助分割映射(ASM)注入到SEF的第三类特定融合路径中,以指导LNM分类。同时,将预测的LNM标签与第三路径SEF输出连接起来,以改进最终分割映射(FSM),从而实现双向任务强化。对专用PTMC US数据集的广泛评估表明,SynTaskNet达到了最先进的性能,分割的Dice得分为93.0%,LNM预测的分类准确率为94.2%,验证了其临床相关性和技术有效性。
{"title":"SynTaskNet: A synergistic multi-task network for joint segmentation and classification of small anatomical structures in ultrasound imaging","authors":"Abdulrhman H. Al-Jebrni ,&nbsp;Saba Ghazanfar Ali ,&nbsp;Bin Sheng ,&nbsp;Huating Li ,&nbsp;Xiao Lin ,&nbsp;Ping Li ,&nbsp;Younhyun Jung ,&nbsp;Jinman Kim ,&nbsp;Li Xu ,&nbsp;Lixin Jiang ,&nbsp;Jing Du","doi":"10.1016/j.cviu.2025.104616","DOIUrl":"10.1016/j.cviu.2025.104616","url":null,"abstract":"<div><div>Segmenting small, low-contrast anatomical structures and classifying their pathological status in ultrasound (US) images remain challenging tasks in computer vision, especially under the noise and ambiguity inherent in real-world clinical data. Papillary thyroid microcarcinoma (PTMC), characterized by nodules <span><math><mrow><mo>≤</mo><mn>1</mn><mo>.</mo><mn>0</mn></mrow></math></span> cm, exemplifies these challenges where both precise segmentation and accurate lymph node metastasis (LNM) prediction are essential for informed clinical decisions. We propose SynTaskNet, a synergistic multi-task learning (MTL) architecture that jointly performs PTMC nodule segmentation and LNM classification from US images. Built upon a DenseNet201 backbone, SynTaskNet incorporates several specialized modules: a Coordinated Depth-wise Convolution (CDC) layer for enhancing spatial features, an Adaptive Context Block (ACB) for embedding contextual dependencies, and a Multi-scale Contextual Boundary Attention (MCBA) module to improve boundary localization in low-contrast regions. To strengthen task interaction, we introduce a Selective Enhancement Fusion (SEF) mechanism that hierarchically integrates features across three semantic levels, enabling effective information exchange between segmentation and classification branches. On top of this, we formulate a synergistic learning scheme wherein an Auxiliary Segmentation Map (ASM) generated by the segmentation decoder is injected into SEF’s third class-specific fusion path to guide LNM classification. In parallel, the predicted LNM label is concatenated with the third-path SEF output to refine the Final Segmentation Map (FSM), enabling bidirectional task reinforcement. Extensive evaluations on a dedicated PTMC US dataset demonstrate that SynTaskNet achieves state-of-the-art performance, with a Dice score of 93.0% for segmentation and a classification accuracy of 94.2% for LNM prediction, validating its clinical relevance and technical efficacy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104616"},"PeriodicalIF":3.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Label-informed knowledge integration: Advancing visual prompt for VLMs adaptation 基于标签的知识集成:推进vlm适应的可视化提示
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-18 DOI: 10.1016/j.cviu.2025.104614
Yue Wu , Yunhong Wang , Guodong Wang , Jinjin Zhang , Yingjie Gao , Xiuguo Bao , Di Huang
Prompt tuning has emerged as a pivotal technique for adapting pre-trained vision-language models (VLMs) to a wide range of downstream tasks. Recent developments have introduced multimodal learnable prompts to construct task-specific classifiers. However, these methods often exhibit limited generalization to unseen classes, primarily due to fixed prompt designs that are tightly coupled with seen training data and lack adaptability to novel class distributions. To overcome this limitation, we propose Label-Informed Knowledge Integration (LIKI)—a novel framework that harnesses the robust generalizability of textual label semantics to guide the generation of adaptive visual prompts. Rather than directly mapping textual prompts into the visual domain, LIKI utilizes robust text embeddings as a knowledge source to inform the visual prompt optimization. Central to our method is a simple yet effective Label Semantic Integration (LSI) module, which dynamically incorporates knowledge from both seen and unseen labels into the visual prompts. This label-informed prompting strategy imbues the visual encoder with semantic awareness, thereby enhancing the generalization and discriminative capacity of VLMs across diverse scenarios. Extensive experiments demonstrate that LIKI consistently outperforms state-of-the-art approaches in base-to-novel generalization, cross-dataset transfer, and domain generalization tasks, offering a significant advancement in prompt-based VLM adaptation.
快速调优已经成为一种关键的技术,使预训练的视觉语言模型(VLMs)适应广泛的下游任务。最近的发展引入了多模态可学习提示来构建特定于任务的分类器。然而,这些方法通常对未见过的类表现出有限的泛化,主要是由于固定的提示设计与可见的训练数据紧密耦合,并且缺乏对新类分布的适应性。为了克服这一限制,我们提出了标签通知知识集成(LIKI) -一个利用文本标签语义的鲁棒泛化性来指导自适应视觉提示生成的新框架。LIKI不是直接将文本提示映射到视觉域,而是利用健壮的文本嵌入作为知识来源来通知视觉提示优化。我们的方法的核心是一个简单而有效的标签语义集成(LSI)模块,它动态地将来自可见和不可见标签的知识整合到视觉提示中。这种标注提示策略增强了视觉编码器的语义感知能力,从而增强了vlm在不同场景下的泛化和判别能力。广泛的实验表明,LIKI在基础到新概化、跨数据集传输和领域概化任务中始终优于最先进的方法,在基于提示的VLM适应方面取得了重大进展。
{"title":"Label-informed knowledge integration: Advancing visual prompt for VLMs adaptation","authors":"Yue Wu ,&nbsp;Yunhong Wang ,&nbsp;Guodong Wang ,&nbsp;Jinjin Zhang ,&nbsp;Yingjie Gao ,&nbsp;Xiuguo Bao ,&nbsp;Di Huang","doi":"10.1016/j.cviu.2025.104614","DOIUrl":"10.1016/j.cviu.2025.104614","url":null,"abstract":"<div><div>Prompt tuning has emerged as a pivotal technique for adapting pre-trained vision-language models (VLMs) to a wide range of downstream tasks. Recent developments have introduced multimodal learnable prompts to construct task-specific classifiers. However, these methods often exhibit limited generalization to unseen classes, primarily due to fixed prompt designs that are tightly coupled with seen training data and lack adaptability to novel class distributions. To overcome this limitation, we propose Label-Informed Knowledge Integration (LIKI)—a novel framework that harnesses the robust generalizability of textual label semantics to guide the generation of adaptive visual prompts. Rather than directly mapping textual prompts into the visual domain, LIKI utilizes robust text embeddings as a knowledge source to inform the visual prompt optimization. Central to our method is a simple yet effective Label Semantic Integration (LSI) module, which dynamically incorporates knowledge from both seen and unseen labels into the visual prompts. This label-informed prompting strategy imbues the visual encoder with semantic awareness, thereby enhancing the generalization and discriminative capacity of VLMs across diverse scenarios. Extensive experiments demonstrate that LIKI consistently outperforms state-of-the-art approaches in base-to-novel generalization, cross-dataset transfer, and domain generalization tasks, offering a significant advancement in prompt-based VLM adaptation.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104614"},"PeriodicalIF":3.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SCAFNet: Multimodal stroke medical image synthesis and fusion network based on self attention and cross attention SCAFNet:基于自注意和交叉注意的多模态脑卒中医学图像合成与融合网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-18 DOI: 10.1016/j.cviu.2025.104611
Yu Zhu , Liqiang Song , Junli Zhao , Guodong Wang , Hui Li , Yi Li
Early diagnosis and intervention are critical in managing acute ischemic stroke to effectively reduce morbidity and mortality. Medical image synthesis generates multimodal images from unimodal inputs, while image fusion integrates complementary information across modalities. However, current approaches typically address these tasks separately, neglecting their inherent synergies and the potential for a richer, more comprehensive diagnostic picture. To overcome this, we propose a two-stage deep learning(DL) framework for improved lesion analysis in ischemic stroke, which combines medical image synthesis and fusion to improve diagnostic informativeness. In the first stage, a Generative Adversarial Network (GAN)-based method, pix2pixHD, efficiently synthesizes high-fidelity multimodal medical images from unimodal inputs, thereby enriching the available diagnostic data for subsequent processing. The second stage introduces a multimodal medical image fusion network, SCAFNet, leveraging self-attention and cross-attention mechanisms. SCAFNet captures intra-modal feature relationships via self-attention to emphasize key information within each modality, and constructs inter-modal feature interactions via cross-attention to fully exploit their complementarity. Additionally, an Information Assistance Module (IAM) is introduced to facilitate the extraction of more meaningful information and improve the visual quality of fused images. Experimental results demonstrate that the proposed framework significantly outperforms existing methods in both generated and fused image quality, highlighting its substantial potential for clinical applications in medical image analysis.
早期诊断和干预对于有效降低急性缺血性脑卒中的发病率和死亡率至关重要。医学图像合成从单模态输入生成多模态图像,而图像融合则集成了跨模态的互补信息。然而,目前的方法通常分别处理这些任务,忽视了它们内在的协同作用和更丰富、更全面的诊断图像的潜力。为了克服这一问题,我们提出了一种两阶段深度学习(DL)框架来改进缺血性卒中的病变分析,该框架将医学图像合成和融合相结合,以提高诊断的信息量。在第一阶段,基于生成对抗网络(GAN)的pix2pixHD方法,从单峰输入有效地合成高保真的多模态医学图像,从而丰富可用的诊断数据供后续处理。第二阶段引入多模态医学图像融合网络SCAFNet,利用自注意和交叉注意机制。SCAFNet通过自注意捕获模态内的特征关系,强调各模态内的关键信息;通过交叉注意构建模态间的特征交互,充分利用其互补性。此外,引入信息辅助模块(IAM),方便提取更有意义的信息,提高融合图像的视觉质量。实验结果表明,所提出的框架在生成和融合图像质量方面都明显优于现有方法,突出了其在医学图像分析中的临床应用潜力。
{"title":"SCAFNet: Multimodal stroke medical image synthesis and fusion network based on self attention and cross attention","authors":"Yu Zhu ,&nbsp;Liqiang Song ,&nbsp;Junli Zhao ,&nbsp;Guodong Wang ,&nbsp;Hui Li ,&nbsp;Yi Li","doi":"10.1016/j.cviu.2025.104611","DOIUrl":"10.1016/j.cviu.2025.104611","url":null,"abstract":"<div><div>Early diagnosis and intervention are critical in managing acute ischemic stroke to effectively reduce morbidity and mortality. Medical image synthesis generates multimodal images from unimodal inputs, while image fusion integrates complementary information across modalities. However, current approaches typically address these tasks separately, neglecting their inherent synergies and the potential for a richer, more comprehensive diagnostic picture. To overcome this, we propose a two-stage deep learning(DL) framework for improved lesion analysis in ischemic stroke, which combines medical image synthesis and fusion to improve diagnostic informativeness. In the first stage, a Generative Adversarial Network (GAN)-based method, pix2pixHD, efficiently synthesizes high-fidelity multimodal medical images from unimodal inputs, thereby enriching the available diagnostic data for subsequent processing. The second stage introduces a multimodal medical image fusion network, SCAFNet, leveraging self-attention and cross-attention mechanisms. SCAFNet captures intra-modal feature relationships via self-attention to emphasize key information within each modality, and constructs inter-modal feature interactions via cross-attention to fully exploit their complementarity. Additionally, an Information Assistance Module (IAM) is introduced to facilitate the extraction of more meaningful information and improve the visual quality of fused images. Experimental results demonstrate that the proposed framework significantly outperforms existing methods in both generated and fused image quality, highlighting its substantial potential for clinical applications in medical image analysis.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104611"},"PeriodicalIF":3.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A dynamic hybrid network with attention and mamba for image captioning 一个带有注意力和曼巴的动态混合网络,用于图像字幕
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-18 DOI: 10.1016/j.cviu.2025.104617
Lulu Wang, Ruiji Xue, Zhengtao Yu, Ruoyu Zhang, Tongling Pan, Yingna Li
Image captioning (IC) is a pivotal cross-modal task that generates coherent textual descriptions for visual inputs, bridging vision and language domains. Attention-based methods have significantly advanced the field of image captioning. However, empirical observations indicate that attention mechanisms often allocate focus uniformly across the full spectrum of feature sequences, which inadvertently diminishes emphasis on long-range dependencies. Such remote elements, nevertheless, play a critical role in yielding captions of superior quality. Therefore, we pursued strategies that harmonize comprehensive feature representation with targeted prioritization of key signals, ultimately proposed the Dynamic Hybrid Network (DH-Net) to enhance caption quality. Specifically, following the encoder–decoder architecture, we propose a hybrid encoder (HE) to integrate the attention mechanisms with the mamba blocks. which further complements the attention by leveraging mamba’s superior long-sequence modeling capabilities, and enables a synergistic combination of local feature extraction and global context modeling. Additionally, we introduce a Feature Aggregation Module (FAM) into the decoder, which dynamically adapts multi-modal feature fusion to evolving decoding contexts, ensuring context-sensitive integration of heterogeneous features. Extensive evaluations on the MSCOCO and Flickr30k dataset demonstrate that DH-Net achieves state-of-the-art performance, significantly outperforming existing approaches in generating accurate and semantically rich captions. The implementation code is accessible via https://github.com/simple-boy/DH-Net.
图像字幕(IC)是一项关键的跨模态任务,它为视觉输入生成连贯的文本描述,架起视觉和语言领域的桥梁。基于注意力的方法极大地推动了图像字幕领域的发展。然而,经验观察表明,注意机制通常将注意力均匀地分配到特征序列的全谱上,这无意中减少了对远程依赖关系的强调。然而,这些遥远的元素在产生高质量的字幕中起着关键作用。因此,我们寻求将综合特征表示与关键信号的目标优先级协调起来的策略,最终提出了动态混合网络(Dynamic Hybrid Network, DH-Net)来提高字幕质量。具体来说,在编码器-解码器架构的基础上,我们提出了一个混合编码器(HE),将注意力机制与曼巴块集成在一起。通过利用曼巴优越的长序列建模能力,进一步补充了注意力,并实现了局部特征提取和全局上下文建模的协同结合。此外,我们在解码器中引入了特征聚合模块(FAM),该模块可以根据不断变化的解码上下文动态地适应多模态特征融合,从而确保异构特征的上下文敏感集成。对MSCOCO和Flickr30k数据集的广泛评估表明,DH-Net实现了最先进的性能,在生成准确且语义丰富的字幕方面显著优于现有方法。实现代码可通过https://github.com/simple-boy/DH-Net访问。
{"title":"A dynamic hybrid network with attention and mamba for image captioning","authors":"Lulu Wang,&nbsp;Ruiji Xue,&nbsp;Zhengtao Yu,&nbsp;Ruoyu Zhang,&nbsp;Tongling Pan,&nbsp;Yingna Li","doi":"10.1016/j.cviu.2025.104617","DOIUrl":"10.1016/j.cviu.2025.104617","url":null,"abstract":"<div><div>Image captioning (IC) is a pivotal cross-modal task that generates coherent textual descriptions for visual inputs, bridging vision and language domains. Attention-based methods have significantly advanced the field of image captioning. However, empirical observations indicate that attention mechanisms often allocate focus uniformly across the full spectrum of feature sequences, which inadvertently diminishes emphasis on long-range dependencies. Such remote elements, nevertheless, play a critical role in yielding captions of superior quality. Therefore, we pursued strategies that harmonize comprehensive feature representation with targeted prioritization of key signals, ultimately proposed the Dynamic Hybrid Network (DH-Net) to enhance caption quality. Specifically, following the encoder–decoder architecture, we propose a hybrid encoder (HE) to integrate the attention mechanisms with the mamba blocks. which further complements the attention by leveraging mamba’s superior long-sequence modeling capabilities, and enables a synergistic combination of local feature extraction and global context modeling. Additionally, we introduce a Feature Aggregation Module (FAM) into the decoder, which dynamically adapts multi-modal feature fusion to evolving decoding contexts, ensuring context-sensitive integration of heterogeneous features. Extensive evaluations on the MSCOCO and Flickr30k dataset demonstrate that DH-Net achieves state-of-the-art performance, significantly outperforming existing approaches in generating accurate and semantically rich captions. The implementation code is accessible via <span><span>https://github.com/simple-boy/DH-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104617"},"PeriodicalIF":3.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1