首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
Extending Large Language Models to multimodality for non-English languages 将大型语言模型扩展到非英语语言的多模态
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-30 DOI: 10.1016/j.cviu.2025.104618
Elio Musacchio , Lucia Siciliani , Pierpaolo Basile , Giovanni Semeraro
The growing popularity of Large Vision-Language Models has highlighted and intensified one of the most well-known challenges in the field of Large Language Models: training is mainly, and most of the time exclusively, conducted on English data. Consequently, the resulting models are more prone to error in non-English tasks, and this issue is exacerbated in multimodal settings that are even more complex and use task-specific datasets. Given this, research on Large Language Models has turned toward adapting them to non-English languages. However, the scarcity of open and curated resources for these languages poses a significant limitation. In this work, we aim to tackle the aforementioned challenge by exploring Large Vision-Language Models adaptation to non-English languages, using machine translation to overcome the lack of curated data. We also analyze how the evaluation of the results is influenced when training a vision-to-text adapter across different languages, examining the performance variations and challenges associated with multilingual adaptation. Finally, we highlight the importance of using open resources to ensure transparency and reproducibility of the results. Following this philosophy, we provide open access to the entire codebase of the adaptation pipeline, along with the trained models and dataset, to foster further research.1
大型视觉语言模型的日益普及凸显并加剧了大型语言模型领域最著名的挑战之一:训练主要是,而且大多数时候都是在英语数据上进行的。因此,所得到的模型在非英语任务中更容易出错,而这个问题在更复杂和使用任务特定数据集的多模式设置中更加严重。鉴于此,对大型语言模型的研究已经转向使它们适应非英语语言。然而,这些语言的开放和管理资源的稀缺性构成了一个重大的限制。在这项工作中,我们的目标是通过探索适应非英语语言的大型视觉语言模型来解决上述挑战,使用机器翻译来克服缺乏精选数据。我们还分析了跨不同语言训练视觉到文本适配器时对结果评估的影响,研究了与多语言适应相关的性能变化和挑战。最后,我们强调使用开放资源以确保结果的透明度和可重复性的重要性。遵循这一理念,我们提供了对适应管道的整个代码库的开放访问,以及训练过的模型和数据集,以促进进一步的研究
{"title":"Extending Large Language Models to multimodality for non-English languages","authors":"Elio Musacchio ,&nbsp;Lucia Siciliani ,&nbsp;Pierpaolo Basile ,&nbsp;Giovanni Semeraro","doi":"10.1016/j.cviu.2025.104618","DOIUrl":"10.1016/j.cviu.2025.104618","url":null,"abstract":"<div><div>The growing popularity of Large Vision-Language Models has highlighted and intensified one of the most well-known challenges in the field of Large Language Models: training is mainly, and most of the time exclusively, conducted on English data. Consequently, the resulting models are more prone to error in non-English tasks, and this issue is exacerbated in multimodal settings that are even more complex and use task-specific datasets. Given this, research on Large Language Models has turned toward adapting them to non-English languages. However, the scarcity of open and curated resources for these languages poses a significant limitation. In this work, we aim to tackle the aforementioned challenge by exploring Large Vision-Language Models adaptation to non-English languages, using machine translation to overcome the lack of curated data. We also analyze how the evaluation of the results is influenced when training a vision-to-text adapter across different languages, examining the performance variations and challenges associated with multilingual adaptation. Finally, we highlight the importance of using open resources to ensure transparency and reproducibility of the results. Following this philosophy, we provide open access to the entire codebase of the adaptation pipeline, along with the trained models and dataset, to foster further research.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104618"},"PeriodicalIF":3.5,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpinVision: An end-to-end volleyball spin estimation with Siamese-based deep classification SpinVision:基于暹罗的深度分类的端到端排球旋转估计
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-29 DOI: 10.1016/j.cviu.2025.104628
Shreya Bansal , Anterpreet Kaur Bedi , Pratibha Kumari , Rishi Kumar Soni , Narayanan C. Krishnan , Mukesh Saini
Accurate spin estimation is a very crucial process in detailing ball dynamics, conducting trainings and analyzing performance in different sports such as volleyball. Traditional methods usually rely on geometric assumptions, handcrafted features, or marker based estimation, which leads to their limited adaptability to real-world problems. In this paper, we propose a novel spin estimation framework, namely SpinVision, considering it as a soft-classification problem. The deep learning model employs Gaussian soft labels and Kullback–Leibler Divergence (KLD) loss. Further, it employs fusion methods alongside squeeze-and-excitation blocks and residual connections, which helps in achieving distinctive representations without the support of external markers or registration procedures. Also, inclusion of transfer learning helps generalizing the model effectively to real-world problems, such as estimating the spin of a volleyball. When compared with the hard-classification or regression-based methods, the proposed model results in more reliable and smooth predictions, thus highlighting it as more accurate, robust, and practical solution for spin prediction in sports analytics and related applications.
在排球等不同运动项目中,准确的旋转估计是详细描述球动力学、进行训练和分析表现的一个非常重要的过程。传统的方法通常依赖于几何假设、手工制作的特征或基于标记的估计,这导致它们对现实世界问题的适应性有限。本文提出了一种新的自旋估计框架,即SpinVision,并将其视为一个软分类问题。深度学习模型采用高斯软标签和KLD损失。此外,它采用融合方法以及挤压和激励块和残余连接,这有助于在没有外部标记或注册程序支持的情况下实现独特的表示。此外,包含迁移学习有助于将模型有效地推广到现实世界的问题,例如估计排球的旋转。与基于硬分类或回归的方法相比,该模型的预测结果更加可靠和平滑,从而突出了其在运动分析及相关应用中的旋转预测的准确性、鲁棒性和实用性。
{"title":"SpinVision: An end-to-end volleyball spin estimation with Siamese-based deep classification","authors":"Shreya Bansal ,&nbsp;Anterpreet Kaur Bedi ,&nbsp;Pratibha Kumari ,&nbsp;Rishi Kumar Soni ,&nbsp;Narayanan C. Krishnan ,&nbsp;Mukesh Saini","doi":"10.1016/j.cviu.2025.104628","DOIUrl":"10.1016/j.cviu.2025.104628","url":null,"abstract":"<div><div>Accurate spin estimation is a very crucial process in detailing ball dynamics, conducting trainings and analyzing performance in different sports such as volleyball. Traditional methods usually rely on geometric assumptions, handcrafted features, or marker based estimation, which leads to their limited adaptability to real-world problems. In this paper, we propose a novel spin estimation framework, namely SpinVision, considering it as a soft-classification problem. The deep learning model employs Gaussian soft labels and Kullback–Leibler Divergence (KLD) loss. Further, it employs fusion methods alongside squeeze-and-excitation blocks and residual connections, which helps in achieving distinctive representations without the support of external markers or registration procedures. Also, inclusion of transfer learning helps generalizing the model effectively to real-world problems, such as estimating the spin of a volleyball. When compared with the hard-classification or regression-based methods, the proposed model results in more reliable and smooth predictions, thus highlighting it as more accurate, robust, and practical solution for spin prediction in sports analytics and related applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104628"},"PeriodicalIF":3.5,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TI-PREGO: Chain of Thought and In-Context Learning for online mistake detection in PRocedural EGOcentric videos TI-PREGO:程序性自我中心视频中在线错误检测的思维链和语境学习
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-29 DOI: 10.1016/j.cviu.2025.104613
Leonardo Plini , Luca Scofano , Edoardo De Matteis , Guido Maria D’Amely di Melendugno , Alessandro Flaborea , Andrea Sanchietti , Giovanni Maria Farinella , Fabio Galasso , Antonino Furnari
Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, no existing technique can reliably detect open-set procedural mistakes in an online setting. We propose a dual-branch architecture to address this problem in an online fashion: the recognition branch takes input frames from egocentric video, predicts the current action and aggregates frame-level results into action tokens while the anticipation branch leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module.
Extensive experiments on two novel procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach.
从以自我为中心的视频中在线识别程序错误是一项关键但具有挑战性的任务,涉及多个领域,包括制造业、医疗保健和基于技能的培训。这类错误本质上是开放集的,因为不可预见的或新颖的错误可能会发生,因此需要不依赖于先前失败示例的强大检测系统。目前,还没有一种技术能够可靠地检测出在线设置中的开集程序错误。我们提出了一种双分支架构,以在线方式解决这个问题:识别分支从以自我为中心的视频中获取输入帧,预测当前动作并将帧级结果聚合为动作令牌,而预测分支利用大型语言模型(llm)的可靠模式匹配功能,根据先前预测的动作令牌来预测动作令牌。错误被检测为当前识别的动作与预期模块预测的动作之间的不匹配。在两个新的程序数据集上进行的大量实验证明了利用双分支架构进行错误检测的挑战和机遇,展示了我们提出的方法的有效性。
{"title":"TI-PREGO: Chain of Thought and In-Context Learning for online mistake detection in PRocedural EGOcentric videos","authors":"Leonardo Plini ,&nbsp;Luca Scofano ,&nbsp;Edoardo De Matteis ,&nbsp;Guido Maria D’Amely di Melendugno ,&nbsp;Alessandro Flaborea ,&nbsp;Andrea Sanchietti ,&nbsp;Giovanni Maria Farinella ,&nbsp;Fabio Galasso ,&nbsp;Antonino Furnari","doi":"10.1016/j.cviu.2025.104613","DOIUrl":"10.1016/j.cviu.2025.104613","url":null,"abstract":"<div><div>Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, no existing technique can reliably detect open-set procedural mistakes in an online setting. We propose a dual-branch architecture to address this problem in an online fashion: the recognition branch takes input frames from egocentric video, predicts the current action and aggregates frame-level results into action tokens while the anticipation branch leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module.</div><div>Extensive experiments on two novel procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104613"},"PeriodicalIF":3.5,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal transformer–diffusion framework for large-scale reconstruction of soccer tracking data 足球跟踪数据大规模重建的多模态变压器-扩散框架
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-29 DOI: 10.1016/j.cviu.2025.104626
Harry Hughes , Patrick Lucey , Michael Horton , Harshala Gammulle , Clinton Fookes , Sridha Sridharan
In soccer, tracking data (player and ball locations over time) is central to performance analysis and a major focus of computer vision in sport. Tracking from broadcast or single-view video offers scalable coverage across all professional matches but suffers from frequent occlusions and missing information. Existing academic work typically evaluates short clips under simplified conditions, whereas industrial applications require complete, game-level coverage. We address these challenges with a multimodal transformer–diffusion framework that combines human-in-the-loop event supervision with single-view video. Our approach first leverages long-term multimodal context — tracking and event annotations — to improve coarse agent localization, then reconstructs full trajectories using a diffusion-based generative model that produces realistic, temporally coherent motion. Compared to state-of-the-art methods, our approach substantially improves both coarse and fine-grained accuracy while scaling effectively to industrial settings. By integrating human supervision with multimodal generative modeling, we provide a robust and practical solution for producing accurate and realistic player and ball trajectories under challenging real-world single-view conditions.
在足球运动中,跟踪数据(球员和球的位置随时间变化)是性能分析的核心,也是计算机视觉在体育运动中的主要关注点。来自广播或单视图视频的跟踪提供了覆盖所有职业比赛的可扩展范围,但经常受到遮挡和信息缺失的影响。现有的学术工作通常在简化的条件下评估短片段,而工业应用则需要完整的游戏级覆盖。我们通过多模态变压器扩散框架解决了这些挑战,该框架将人在环事件监督与单视图视频相结合。我们的方法首先利用长期的多模态上下文——跟踪和事件注释——来改进粗代理定位,然后使用基于扩散的生成模型重建完整的轨迹,该模型产生逼真的、时间连贯的运动。与最先进的方法相比,我们的方法大大提高了粗粒度和细粒度的精度,同时有效地扩展到工业环境。通过将人类监督与多模态生成建模相结合,我们提供了一个强大而实用的解决方案,可以在具有挑战性的现实世界单视图条件下生成准确而真实的球员和球轨迹。
{"title":"Multimodal transformer–diffusion framework for large-scale reconstruction of soccer tracking data","authors":"Harry Hughes ,&nbsp;Patrick Lucey ,&nbsp;Michael Horton ,&nbsp;Harshala Gammulle ,&nbsp;Clinton Fookes ,&nbsp;Sridha Sridharan","doi":"10.1016/j.cviu.2025.104626","DOIUrl":"10.1016/j.cviu.2025.104626","url":null,"abstract":"<div><div>In soccer, tracking data (player and ball locations over time) is central to performance analysis and a major focus of computer vision in sport. Tracking from broadcast or single-view video offers scalable coverage across all professional matches but suffers from frequent occlusions and missing information. Existing academic work typically evaluates short clips under simplified conditions, whereas industrial applications require complete, game-level coverage. We address these challenges with a multimodal transformer–diffusion framework that combines human-in-the-loop event supervision with single-view video. Our approach first leverages long-term multimodal context — tracking and event annotations — to improve coarse agent localization, then reconstructs full trajectories using a diffusion-based generative model that produces realistic, temporally coherent motion. Compared to state-of-the-art methods, our approach substantially improves both coarse and fine-grained accuracy while scaling effectively to industrial settings. By integrating human supervision with multimodal generative modeling, we provide a robust and practical solution for producing accurate and realistic player and ball trajectories under challenging real-world single-view conditions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104626"},"PeriodicalIF":3.5,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Boundary-aware semantic segmentation for ice hockey rink registration 基于边界感知的冰球场注册语义分割
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-27 DOI: 10.1016/j.cviu.2025.104627
Zhibo Wang , Amir Nazemi , Stephie Liu , Sirisha Rambhatla , Yuhao Chen , David Clausi
Accurate registration of ice hockey rinks from broadcast video frames is fundamental to sports analytics, as it aligns the rink template and broadcast frame into a unified coordinate system for consistent player analysis. Existing approaches, including keypoint- and segmentation-based methods, often yield suboptimal homography estimation due to insufficient attention to rink boundaries. To address this, we propose a segmentation-based framework that explicitly introduces the rink boundary as a new segmentation class. To further improve accuracy, we introduce three components that enhance boundary awareness: (i) a boundary-aware loss to strengthen boundary representation, (ii) a dynamic class-weighted mechanism in homography estimation to emphasize informative regions, and (iii) a self-distillation strategy to enrich feature diversity. Experiments on the NHL and SHL datasets demonstrate that our method significantly outperforms both baselines, achieving improvements of +2.84 and +3.48 in IoUpart and IoUwhole on the NHL dataset, and +1.53 and +5.85 on the SHL dataset, respectively. Ablation studies further confirm the contribution of each component, establishing a robust solution for rink registration and a strong foundation for downstream sports vision tasks.
从广播视频帧中准确注册冰球溜冰场是体育分析的基础,因为它将溜冰场模板和广播帧对齐到统一的坐标系统中,以实现一致的球员分析。现有的方法,包括基于关键点和基于分割的方法,由于对溜冰场边界的关注不够,通常会产生次优的单应性估计。为了解决这个问题,我们提出了一个基于分割的框架,明确地将溜冰场边界作为一个新的分割类引入。为了进一步提高准确性,我们引入了三个增强边界感知的组件:(i)加强边界表示的边界感知损失,(ii)在单应性估计中强调信息区域的动态类加权机制,以及(iii)丰富特征多样性的自蒸馏策略。在NHL和SHL数据集上的实验表明,我们的方法明显优于这两个基线,在NHL数据集上IoUpart和IoUwhole分别提高了+2.84和+3.48,在SHL数据集上分别提高了+1.53和+5.85。消融研究进一步证实了每个组成部分的贡献,为溜冰场注册建立了一个强大的解决方案,并为下游运动视觉任务奠定了坚实的基础。
{"title":"Boundary-aware semantic segmentation for ice hockey rink registration","authors":"Zhibo Wang ,&nbsp;Amir Nazemi ,&nbsp;Stephie Liu ,&nbsp;Sirisha Rambhatla ,&nbsp;Yuhao Chen ,&nbsp;David Clausi","doi":"10.1016/j.cviu.2025.104627","DOIUrl":"10.1016/j.cviu.2025.104627","url":null,"abstract":"<div><div>Accurate registration of ice hockey rinks from broadcast video frames is fundamental to sports analytics, as it aligns the rink template and broadcast frame into a unified coordinate system for consistent player analysis. Existing approaches, including keypoint- and segmentation-based methods, often yield suboptimal homography estimation due to insufficient attention to rink boundaries. To address this, we propose a segmentation-based framework that explicitly introduces the rink boundary as a new segmentation class. To further improve accuracy, we introduce three components that enhance boundary awareness: (i) a boundary-aware loss to strengthen boundary representation, (ii) a dynamic class-weighted mechanism in homography estimation to emphasize informative regions, and (iii) a self-distillation strategy to enrich feature diversity. Experiments on the NHL and SHL datasets demonstrate that our method significantly outperforms both baselines, achieving improvements of <span><math><mrow><mo>+</mo><mn>2</mn><mo>.</mo><mn>84</mn></mrow></math></span> and <span><math><mrow><mo>+</mo><mn>3</mn><mo>.</mo><mn>48</mn></mrow></math></span> in IoU<sub>part</sub> and IoU<sub>whole</sub> on the NHL dataset, and <span><math><mrow><mo>+</mo><mn>1</mn><mo>.</mo><mn>53</mn></mrow></math></span> and <span><math><mrow><mo>+</mo><mn>5</mn><mo>.</mo><mn>85</mn></mrow></math></span> on the SHL dataset, respectively. Ablation studies further confirm the contribution of each component, establishing a robust solution for rink registration and a strong foundation for downstream sports vision tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104627"},"PeriodicalIF":3.5,"publicationDate":"2025-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Memory-enriched thought-by-thought framework for complex Diagram Question Answering 记忆丰富的思维框架,复杂的图表问答
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-26 DOI: 10.1016/j.cviu.2025.104608
Xinyu Zhang , Lingling Zhang , Yanrui Wu , Shaowei Wang , Wenjun Wu , Muye Huang , Qianying Wang , Jun Liu
Large language models (LLMs) can effectively generate reasoning processes for simple tasks, but they struggle in complex and novel reasoning scenarios. This problem stems from LLMs often fusing visual and textual information in a single step, lacking the capture and representation of key information during the reasoning process, ignoring critical changes in the reasoning process, and failing to reflect the complex and dynamic nature of human-like reasoning. To address these issues, we propose a new framework called Memory-Enriched Thought-by-Thought (METbT), which incorporates memory and operators. On the one hand, the memory is used to store intermediate representations of the reasoning process, preserving information from the reasoning steps and preventing the language model from generating illogical text. On the other hand, the introduction of operators offers various methods for merging visual and textual representations, significantly enhancing the model’s ability to learn representations. We develop the METbT-Bert, METbT-T5, METbT-Qwen and METbT-InternLM, leveraging Bert, T5, Qwen and InternLM as the foundational language models with our framework, respectively. Experiments are conducted on multiple datasets including Smart-101, ScienceQA, and IconQA, and in all cases, the results surpassed those of the same language models. The results demonstrate that our METbT framework offers superior scalability and robustness.
大型语言模型(llm)可以有效地为简单任务生成推理过程,但在复杂和新颖的推理场景中却难以实现。这一问题源于llm往往将视觉和文本信息融合在一个步骤中,缺乏对推理过程中关键信息的捕获和表示,忽略了推理过程中的关键变化,未能反映类人推理的复杂性和动态性。为了解决这些问题,我们提出了一个新的框架,称为记忆丰富的思想-思想(METbT),它结合了内存和运算符。一方面,内存用于存储推理过程的中间表示,保留推理步骤中的信息,防止语言模型生成不合逻辑的文本。另一方面,算子的引入为合并视觉和文本表示提供了多种方法,显著增强了模型学习表示的能力。我们开发了mett -Bert、mett -T5、mett -Qwen和mett -InternLM,分别利用Bert、T5、Qwen和InternLM作为我们框架的基础语言模型。在Smart-101、ScienceQA、IconQA等多个数据集上进行实验,结果均优于相同语言模型。结果表明,我们的METbT框架提供了优越的可扩展性和鲁棒性。
{"title":"Memory-enriched thought-by-thought framework for complex Diagram Question Answering","authors":"Xinyu Zhang ,&nbsp;Lingling Zhang ,&nbsp;Yanrui Wu ,&nbsp;Shaowei Wang ,&nbsp;Wenjun Wu ,&nbsp;Muye Huang ,&nbsp;Qianying Wang ,&nbsp;Jun Liu","doi":"10.1016/j.cviu.2025.104608","DOIUrl":"10.1016/j.cviu.2025.104608","url":null,"abstract":"<div><div>Large language models (LLMs) can effectively generate reasoning processes for simple tasks, but they struggle in complex and novel reasoning scenarios. This problem stems from LLMs often fusing visual and textual information in a single step, lacking the capture and representation of key information during the reasoning process, ignoring critical changes in the reasoning process, and failing to reflect the complex and dynamic nature of human-like reasoning. To address these issues, we propose a new framework called <strong>M</strong>emory-<strong>E</strong>nriched <strong>T</strong>hought-by-<strong>T</strong>hought (METbT), which incorporates memory and operators. On the one hand, the memory is used to store intermediate representations of the reasoning process, preserving information from the reasoning steps and preventing the language model from generating illogical text. On the other hand, the introduction of operators offers various methods for merging visual and textual representations, significantly enhancing the model’s ability to learn representations. We develop the METbT-Bert, METbT-T5, METbT-Qwen and METbT-InternLM, leveraging Bert, T5, Qwen and InternLM as the foundational language models with our framework, respectively. Experiments are conducted on multiple datasets including Smart-101, ScienceQA, and IconQA, and in all cases, the results surpassed those of the same language models. The results demonstrate that our METbT framework offers superior scalability and robustness.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104608"},"PeriodicalIF":3.5,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SASTD: Stepwise attention style transfer network based on diffusion models 基于扩散模型的分步注意风格迁移网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-23 DOI: 10.1016/j.cviu.2025.104612
Zhuoya Wang, Gui Chen, Yaxin Li, Yongsheng Dong
Image style transfer techniques have significantly advanced, aiming to create images that adopt the style attributes of one source while maintaining the spatial layout of another. However, the interrelationship between style and content often causes the problem of information entanglement within the generated stylized result. To alleviate this issue, in this paper we propose a stepwise attention style transfer network based on diffusion models (SASTD). Specifically, we introduce an attention feature extraction and fusion module, which employs a step-by-step injection method to effectively combine the extracted content and style attention features at different time stages. Additionally, we propose a noise initialization module based on adaptive instance normalization (AdaIN) in the early fusion stage to initialize the initial latent noise during image generation, preserving certain initial feature statistics. Furthermore, we incorporate edge attention from the content image to enhance the preservation of its structural details. Finally, we propose a LAB space alignment module to further optimize the initially generated stylized image. This method ensures high-quality style transfer while better maintaining the spatial semantics of the content image. Experimental results demonstrate that our proposed SASTD achieves better performance in both qualitative and quantitative comparisons compared to both image style transfer methods and style-guided text-to-image synthesis methods.
图像风格转换技术有了显著的进步,旨在创建采用一个源的风格属性的图像,同时保持另一个源的空间布局。然而,样式和内容之间的相互关系往往会导致生成的样式化结果中出现信息纠缠的问题。为了解决这一问题,本文提出了一种基于扩散模型的逐步注意风格迁移网络(SASTD)。具体来说,我们引入了一个注意力特征提取与融合模块,该模块采用分步注入的方式,将提取的内容和风格在不同时间阶段的注意力特征有效地结合起来。此外,我们在融合早期提出了基于自适应实例归一化(AdaIN)的噪声初始化模块,以初始化图像生成过程中的初始潜在噪声,保留一定的初始特征统计量。此外,我们从内容图像中加入边缘注意,以增强其结构细节的保存。最后,我们提出了一个LAB空间对齐模块来进一步优化初始生成的风格化图像。这种方法保证了高质量的风格传递,同时更好地保持了内容图像的空间语义。实验结果表明,与图像风格转移方法和风格引导的文本到图像合成方法相比,我们提出的SASTD在定性和定量比较方面都取得了更好的性能。
{"title":"SASTD: Stepwise attention style transfer network based on diffusion models","authors":"Zhuoya Wang,&nbsp;Gui Chen,&nbsp;Yaxin Li,&nbsp;Yongsheng Dong","doi":"10.1016/j.cviu.2025.104612","DOIUrl":"10.1016/j.cviu.2025.104612","url":null,"abstract":"<div><div>Image style transfer techniques have significantly advanced, aiming to create images that adopt the style attributes of one source while maintaining the spatial layout of another. However, the interrelationship between style and content often causes the problem of information entanglement within the generated stylized result. To alleviate this issue, in this paper we propose a stepwise attention style transfer network based on diffusion models (SASTD). Specifically, we introduce an attention feature extraction and fusion module, which employs a step-by-step injection method to effectively combine the extracted content and style attention features at different time stages. Additionally, we propose a noise initialization module based on adaptive instance normalization (AdaIN) in the early fusion stage to initialize the initial latent noise during image generation, preserving certain initial feature statistics. Furthermore, we incorporate edge attention from the content image to enhance the preservation of its structural details. Finally, we propose a LAB space alignment module to further optimize the initially generated stylized image. This method ensures high-quality style transfer while better maintaining the spatial semantics of the content image. Experimental results demonstrate that our proposed SASTD achieves better performance in both qualitative and quantitative comparisons compared to both image style transfer methods and style-guided text-to-image synthesis methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104612"},"PeriodicalIF":3.5,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RGB-D and IMU-based staircase quantification for assistive navigation using step estimation for exoskeleton support 基于RGB-D和imu的外骨骼辅助导航阶梯量化
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-23 DOI: 10.1016/j.cviu.2025.104621
Edgar R. Guzman , Letizia Gionfrida , Robert D. Howe
This paper introduces a vision-based environment quantification pipeline designed to tailor the assistance provided by lower limb assistive devices during the transition from level walking to stair navigation. The framework consists of three components: staircase detection, transitional step prediction, and staircase dimension estimation. These components utilize an RGB-D camera worn on the chest and an Inertial Measurement Unit (IMU) worn at the hip. To detect ascending stairs, we employed a YOLOv3 model applied to continuous recordings, achieving an average accuracy of 98.1%. For descending stair detection, an edge detection algorithm was used, resulting in a pixel-wise edge localization accuracy of 89.1%. To estimate user locomotion speed and footfall, the IMU was positioned on the participant’s left waist, and the RGB-D camera was mounted at chest level. This setup accurately captured step lengths with an average accuracy of 94.4% across all participants and trials, enabling precise determination of the number of steps leading up to the transitional step on the staircase. As a result, the system accurately predicted the number of steps and localized the final footfall with an average error of 5.77cm, measured as the distance between the predicted and actual placement of the final foot relative to the target destination. Finally, to capture the dimensions of the staircase’s tread depth and riser height, an algorithm analyzing point cloud data was applied when the user was in close proximity to the stairs. This yielded mean absolute errors of 1.20±0.49cm in height and 1.35±0.45cm in depth for ascending stairs, and 1.28±0.55cm in height and 1.47±0.65cm in depth for descending stairs. Our proposed approach lays the groundwork for optimizing control strategies in exoskeleton technologies by integrating environmental sensing with human locomotion analysis. These results demonstrate the feasibility and effectiveness of our system, promising enhanced user experiences and improved functionality in real-world scenarios.
本文介绍了一种基于视觉的环境量化管道,用于定制下肢辅助装置在水平行走到楼梯导航过渡过程中的辅助。该框架由三个部分组成:阶梯检测、过渡阶梯预测和阶梯维数估计。这些组件使用佩戴在胸前的RGB-D摄像头和佩戴在臀部的惯性测量单元(IMU)。为了检测上升楼梯,我们采用了适用于连续记录的YOLOv3模型,平均准确率达到98.1%。下楼楼梯检测采用边缘检测算法,像素级边缘定位精度达到89.1%。为了估计用户的运动速度和脚步,IMU被放置在参与者的左腰部,RGB-D摄像机被安装在胸部水平。这种设置准确地捕获了所有参与者和试验的台阶长度,平均准确率为94.4%,能够精确地确定通往楼梯过渡台阶的台阶数。结果,系统准确地预测了步数,并定位了最终的脚步,平均误差为5.77厘米,以预测和实际的最终脚步相对于目标目的地的距离来衡量。最后,为了捕捉楼梯的踏面深度和隔水管高度的尺寸,当用户靠近楼梯时,应用了一种分析点云数据的算法。上楼梯的平均绝对误差为高度1.20±0.49cm,深度1.35±0.45cm;下楼梯的平均绝对误差为高度1.28±0.55cm,深度1.47±0.65cm。我们提出的方法通过将环境传感与人体运动分析相结合,为优化外骨骼技术的控制策略奠定了基础。这些结果证明了我们系统的可行性和有效性,有望在现实场景中增强用户体验和改进功能。
{"title":"RGB-D and IMU-based staircase quantification for assistive navigation using step estimation for exoskeleton support","authors":"Edgar R. Guzman ,&nbsp;Letizia Gionfrida ,&nbsp;Robert D. Howe","doi":"10.1016/j.cviu.2025.104621","DOIUrl":"10.1016/j.cviu.2025.104621","url":null,"abstract":"<div><div>This paper introduces a vision-based environment quantification pipeline designed to tailor the assistance provided by lower limb assistive devices during the transition from level walking to stair navigation. The framework consists of three components: staircase detection, transitional step prediction, and staircase dimension estimation. These components utilize an RGB-D camera worn on the chest and an Inertial Measurement Unit (IMU) worn at the hip. To detect ascending stairs, we employed a YOLOv3 model applied to continuous recordings, achieving an average accuracy of 98.1%. For descending stair detection, an edge detection algorithm was used, resulting in a pixel-wise edge localization accuracy of 89.1%. To estimate user locomotion speed and footfall, the IMU was positioned on the participant’s left waist, and the RGB-D camera was mounted at chest level. This setup accurately captured step lengths with an average accuracy of 94.4% across all participants and trials, enabling precise determination of the number of steps leading up to the transitional step on the staircase. As a result, the system accurately predicted the number of steps and localized the final footfall with an average error of <span><math><mrow><mn>5</mn><mo>.</mo><mn>77</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span>, measured as the distance between the predicted and actual placement of the final foot relative to the target destination. Finally, to capture the dimensions of the staircase’s tread depth and riser height, an algorithm analyzing point cloud data was applied when the user was in close proximity to the stairs. This yielded mean absolute errors of <span><math><mrow><mn>1</mn><mo>.</mo><mn>20</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>49</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in height and <span><math><mrow><mn>1</mn><mo>.</mo><mn>35</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>45</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in depth for ascending stairs, and <span><math><mrow><mn>1</mn><mo>.</mo><mn>28</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>55</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in height and <span><math><mrow><mn>1</mn><mo>.</mo><mn>47</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>65</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in depth for descending stairs. Our proposed approach lays the groundwork for optimizing control strategies in exoskeleton technologies by integrating environmental sensing with human locomotion analysis. These results demonstrate the feasibility and effectiveness of our system, promising enhanced user experiences and improved functionality in real-world scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104621"},"PeriodicalIF":3.5,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145847576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SynTaskNet: A synergistic multi-task network for joint segmentation and classification of small anatomical structures in ultrasound imaging SynTaskNet:超声成像中用于关节分割和小解剖结构分类的协同多任务网络
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-18 DOI: 10.1016/j.cviu.2025.104616
Abdulrhman H. Al-Jebrni , Saba Ghazanfar Ali , Bin Sheng , Huating Li , Xiao Lin , Ping Li , Younhyun Jung , Jinman Kim , Li Xu , Lixin Jiang , Jing Du
Segmenting small, low-contrast anatomical structures and classifying their pathological status in ultrasound (US) images remain challenging tasks in computer vision, especially under the noise and ambiguity inherent in real-world clinical data. Papillary thyroid microcarcinoma (PTMC), characterized by nodules 1.0 cm, exemplifies these challenges where both precise segmentation and accurate lymph node metastasis (LNM) prediction are essential for informed clinical decisions. We propose SynTaskNet, a synergistic multi-task learning (MTL) architecture that jointly performs PTMC nodule segmentation and LNM classification from US images. Built upon a DenseNet201 backbone, SynTaskNet incorporates several specialized modules: a Coordinated Depth-wise Convolution (CDC) layer for enhancing spatial features, an Adaptive Context Block (ACB) for embedding contextual dependencies, and a Multi-scale Contextual Boundary Attention (MCBA) module to improve boundary localization in low-contrast regions. To strengthen task interaction, we introduce a Selective Enhancement Fusion (SEF) mechanism that hierarchically integrates features across three semantic levels, enabling effective information exchange between segmentation and classification branches. On top of this, we formulate a synergistic learning scheme wherein an Auxiliary Segmentation Map (ASM) generated by the segmentation decoder is injected into SEF’s third class-specific fusion path to guide LNM classification. In parallel, the predicted LNM label is concatenated with the third-path SEF output to refine the Final Segmentation Map (FSM), enabling bidirectional task reinforcement. Extensive evaluations on a dedicated PTMC US dataset demonstrate that SynTaskNet achieves state-of-the-art performance, with a Dice score of 93.0% for segmentation and a classification accuracy of 94.2% for LNM prediction, validating its clinical relevance and technical efficacy.
在超声(US)图像中分割小的、低对比度的解剖结构并对其病理状态进行分类仍然是计算机视觉中具有挑战性的任务,特别是在现实世界临床数据中固有的噪声和模糊性下。甲状腺乳头状微癌(PTMC)以结节≤1.0 cm为特征,体现了这些挑战,其中精确的分割和准确的淋巴结转移(LNM)预测对于知情的临床决策至关重要。我们提出了SynTaskNet,这是一种协同多任务学习(MTL)架构,可以联合执行PTMC模块分割和LNM分类。基于DenseNet201主干,SynTaskNet集成了几个专用模块:用于增强空间特征的协调深度卷积(CDC)层,用于嵌入上下文依赖的自适应上下文块(ACB),以及用于改进低对比度区域边界定位的多尺度上下文边界注意(MCBA)模块。为了加强任务交互,我们引入了一种选择性增强融合(SEF)机制,该机制分层地集成了三个语义级别的特征,从而实现了分词和分类分支之间的有效信息交换。在此基础上,我们制定了一种协同学习方案,将分割解码器生成的辅助分割映射(ASM)注入到SEF的第三类特定融合路径中,以指导LNM分类。同时,将预测的LNM标签与第三路径SEF输出连接起来,以改进最终分割映射(FSM),从而实现双向任务强化。对专用PTMC US数据集的广泛评估表明,SynTaskNet达到了最先进的性能,分割的Dice得分为93.0%,LNM预测的分类准确率为94.2%,验证了其临床相关性和技术有效性。
{"title":"SynTaskNet: A synergistic multi-task network for joint segmentation and classification of small anatomical structures in ultrasound imaging","authors":"Abdulrhman H. Al-Jebrni ,&nbsp;Saba Ghazanfar Ali ,&nbsp;Bin Sheng ,&nbsp;Huating Li ,&nbsp;Xiao Lin ,&nbsp;Ping Li ,&nbsp;Younhyun Jung ,&nbsp;Jinman Kim ,&nbsp;Li Xu ,&nbsp;Lixin Jiang ,&nbsp;Jing Du","doi":"10.1016/j.cviu.2025.104616","DOIUrl":"10.1016/j.cviu.2025.104616","url":null,"abstract":"<div><div>Segmenting small, low-contrast anatomical structures and classifying their pathological status in ultrasound (US) images remain challenging tasks in computer vision, especially under the noise and ambiguity inherent in real-world clinical data. Papillary thyroid microcarcinoma (PTMC), characterized by nodules <span><math><mrow><mo>≤</mo><mn>1</mn><mo>.</mo><mn>0</mn></mrow></math></span> cm, exemplifies these challenges where both precise segmentation and accurate lymph node metastasis (LNM) prediction are essential for informed clinical decisions. We propose SynTaskNet, a synergistic multi-task learning (MTL) architecture that jointly performs PTMC nodule segmentation and LNM classification from US images. Built upon a DenseNet201 backbone, SynTaskNet incorporates several specialized modules: a Coordinated Depth-wise Convolution (CDC) layer for enhancing spatial features, an Adaptive Context Block (ACB) for embedding contextual dependencies, and a Multi-scale Contextual Boundary Attention (MCBA) module to improve boundary localization in low-contrast regions. To strengthen task interaction, we introduce a Selective Enhancement Fusion (SEF) mechanism that hierarchically integrates features across three semantic levels, enabling effective information exchange between segmentation and classification branches. On top of this, we formulate a synergistic learning scheme wherein an Auxiliary Segmentation Map (ASM) generated by the segmentation decoder is injected into SEF’s third class-specific fusion path to guide LNM classification. In parallel, the predicted LNM label is concatenated with the third-path SEF output to refine the Final Segmentation Map (FSM), enabling bidirectional task reinforcement. Extensive evaluations on a dedicated PTMC US dataset demonstrate that SynTaskNet achieves state-of-the-art performance, with a Dice score of 93.0% for segmentation and a classification accuracy of 94.2% for LNM prediction, validating its clinical relevance and technical efficacy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104616"},"PeriodicalIF":3.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Label-informed knowledge integration: Advancing visual prompt for VLMs adaptation 基于标签的知识集成:推进vlm适应的可视化提示
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-18 DOI: 10.1016/j.cviu.2025.104614
Yue Wu , Yunhong Wang , Guodong Wang , Jinjin Zhang , Yingjie Gao , Xiuguo Bao , Di Huang
Prompt tuning has emerged as a pivotal technique for adapting pre-trained vision-language models (VLMs) to a wide range of downstream tasks. Recent developments have introduced multimodal learnable prompts to construct task-specific classifiers. However, these methods often exhibit limited generalization to unseen classes, primarily due to fixed prompt designs that are tightly coupled with seen training data and lack adaptability to novel class distributions. To overcome this limitation, we propose Label-Informed Knowledge Integration (LIKI)—a novel framework that harnesses the robust generalizability of textual label semantics to guide the generation of adaptive visual prompts. Rather than directly mapping textual prompts into the visual domain, LIKI utilizes robust text embeddings as a knowledge source to inform the visual prompt optimization. Central to our method is a simple yet effective Label Semantic Integration (LSI) module, which dynamically incorporates knowledge from both seen and unseen labels into the visual prompts. This label-informed prompting strategy imbues the visual encoder with semantic awareness, thereby enhancing the generalization and discriminative capacity of VLMs across diverse scenarios. Extensive experiments demonstrate that LIKI consistently outperforms state-of-the-art approaches in base-to-novel generalization, cross-dataset transfer, and domain generalization tasks, offering a significant advancement in prompt-based VLM adaptation.
快速调优已经成为一种关键的技术,使预训练的视觉语言模型(VLMs)适应广泛的下游任务。最近的发展引入了多模态可学习提示来构建特定于任务的分类器。然而,这些方法通常对未见过的类表现出有限的泛化,主要是由于固定的提示设计与可见的训练数据紧密耦合,并且缺乏对新类分布的适应性。为了克服这一限制,我们提出了标签通知知识集成(LIKI) -一个利用文本标签语义的鲁棒泛化性来指导自适应视觉提示生成的新框架。LIKI不是直接将文本提示映射到视觉域,而是利用健壮的文本嵌入作为知识来源来通知视觉提示优化。我们的方法的核心是一个简单而有效的标签语义集成(LSI)模块,它动态地将来自可见和不可见标签的知识整合到视觉提示中。这种标注提示策略增强了视觉编码器的语义感知能力,从而增强了vlm在不同场景下的泛化和判别能力。广泛的实验表明,LIKI在基础到新概化、跨数据集传输和领域概化任务中始终优于最先进的方法,在基于提示的VLM适应方面取得了重大进展。
{"title":"Label-informed knowledge integration: Advancing visual prompt for VLMs adaptation","authors":"Yue Wu ,&nbsp;Yunhong Wang ,&nbsp;Guodong Wang ,&nbsp;Jinjin Zhang ,&nbsp;Yingjie Gao ,&nbsp;Xiuguo Bao ,&nbsp;Di Huang","doi":"10.1016/j.cviu.2025.104614","DOIUrl":"10.1016/j.cviu.2025.104614","url":null,"abstract":"<div><div>Prompt tuning has emerged as a pivotal technique for adapting pre-trained vision-language models (VLMs) to a wide range of downstream tasks. Recent developments have introduced multimodal learnable prompts to construct task-specific classifiers. However, these methods often exhibit limited generalization to unseen classes, primarily due to fixed prompt designs that are tightly coupled with seen training data and lack adaptability to novel class distributions. To overcome this limitation, we propose Label-Informed Knowledge Integration (LIKI)—a novel framework that harnesses the robust generalizability of textual label semantics to guide the generation of adaptive visual prompts. Rather than directly mapping textual prompts into the visual domain, LIKI utilizes robust text embeddings as a knowledge source to inform the visual prompt optimization. Central to our method is a simple yet effective Label Semantic Integration (LSI) module, which dynamically incorporates knowledge from both seen and unseen labels into the visual prompts. This label-informed prompting strategy imbues the visual encoder with semantic awareness, thereby enhancing the generalization and discriminative capacity of VLMs across diverse scenarios. Extensive experiments demonstrate that LIKI consistently outperforms state-of-the-art approaches in base-to-novel generalization, cross-dataset transfer, and domain generalization tasks, offering a significant advancement in prompt-based VLM adaptation.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104614"},"PeriodicalIF":3.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1