首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
Enhancing UAV small target detection: A balanced accuracy-efficiency algorithm with tiered feature focus 增强无人机小目标检测:一种具有分层特征焦点的精度-效率平衡算法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-10 DOI: 10.1016/j.imavis.2026.105897
Hanwei Guo, Shugang Liu
Small target detection in unmanned aerial vehicle (UAV) imagery is crucial for both military and civilian applications. However, achieving a balance between detection performance, efficiency, and lightweight architecture remains challenging. This paper introduces TF-DEIM-DFINE, a tiered focused small target detection model designed specifically for UAV tasks.We propose the Convolutional Gated-Visual Mamba (CG-VIM) module to enhance global dependency capture and local detail extraction through long sequence modeling, along with the Half-Channel Single-Head Attention (HCSA) module for global modeling, which improves fine-grained representation while reducing computational redundancy. Additionally, our Tiered Focus-Feature Pyramid Networks (TF-FPN) improve the representational capability of high-frequency information in multi-scale features without significantly increasing computational overhead. Experimental results on the VisDrone dataset demonstrate a 4.7% improvement in APM and a 5.8% improvement in AP metrics, with a 37% reduction in parameter count and only a 6% increase in GFLOPs, maintaining unchanged FPS. These results highlight TF-DEIM-DFINE’s ability to improve detection accuracy while preserving a lightweight and efficient structure
无人机图像中的小目标检测在军事和民用应用中都是至关重要的。然而,在检测性能、效率和轻量级架构之间取得平衡仍然具有挑战性。本文介绍了一种针对无人机任务设计的分层聚焦小目标检测模型TF-DEIM-DFINE。我们提出了卷积门控视觉曼巴(CG-VIM)模块,通过长序列建模增强了全局依赖捕获和局部细节提取,以及半通道单头注意(HCSA)模块,用于全局建模,提高了细粒度表示,同时减少了计算冗余。此外,我们的分层焦点-特征金字塔网络(TF-FPN)在不显著增加计算开销的情况下提高了多尺度特征中高频信息的表示能力。在VisDrone数据集上的实验结果表明,APM提高了4.7%,AP指标提高了5.8%,参数计数减少了37%,GFLOPs仅增加了6%,FPS保持不变。这些结果突出了TF-DEIM-DFINE在保持轻量化和高效结构的同时提高检测精度的能力
{"title":"Enhancing UAV small target detection: A balanced accuracy-efficiency algorithm with tiered feature focus","authors":"Hanwei Guo,&nbsp;Shugang Liu","doi":"10.1016/j.imavis.2026.105897","DOIUrl":"10.1016/j.imavis.2026.105897","url":null,"abstract":"<div><div>Small target detection in unmanned aerial vehicle (UAV) imagery is crucial for both military and civilian applications. However, achieving a balance between detection performance, efficiency, and lightweight architecture remains challenging. This paper introduces TF-DEIM-DFINE, a tiered focused small target detection model designed specifically for UAV tasks.We propose the Convolutional Gated-Visual Mamba (CG-VIM) module to enhance global dependency capture and local detail extraction through long sequence modeling, along with the Half-Channel Single-Head Attention (HCSA) module for global modeling, which improves fine-grained representation while reducing computational redundancy. Additionally, our Tiered Focus-Feature Pyramid Networks (TF-FPN) improve the representational capability of high-frequency information in multi-scale features without significantly increasing computational overhead. Experimental results on the VisDrone dataset demonstrate a 4.7% improvement in AP<span><math><msub><mrow></mrow><mrow><mtext>M</mtext></mrow></msub></math></span> and a 5.8% improvement in AP metrics, with a 37% reduction in parameter count and only a 6% increase in GFLOPs, maintaining unchanged FPS. These results highlight TF-DEIM-DFINE’s ability to improve detection accuracy while preserving a lightweight and efficient structure</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105897"},"PeriodicalIF":4.2,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OIDSty: One-shot identity-preserving face stylization OIDSty:一次性保留身份的面部样式化
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-07 DOI: 10.1016/j.imavis.2026.105899
Kairui Wang , Xinying Liu , Di Zhao , Xuelei Geng , Tian Xian , Yonghao Chang
In recent years, image generation techniques based on diffusion models have made significant progress in the field of facial stylization. However, existing methods still face challenges in achieving high identity fidelity while maintaining strong stylistic expressiveness, particularly in balancing the geometric deformations introduced by stylization with the preservation of fine facial features (such as facial features and poses). To address this issue, this paper proposes a novel single-sample facial stylization system—OIDSty. Its core innovation lies in decoupling identity preservation and style injection tasks across distinct attention layers, primarily achieved through two key designs: (1) High-Fidelity Identity Module, which innovatively combines strong semantic conditions and weak spatial conditions to guide cross-attention layers. This design enables precise retention of core identity and facial layout features while permitting stylized geometric deformations; (2) The DINO-Style Texture Guidance Module introduces this loss function into the self-attention layer to compute the feature difference between the ideal stylized output and the current output. This loss is integrated into the denoising sampling process, dynamically calibrating latent features through gradients to ensure efficient and accurate transfer of stylized textures onto the target image. Extensive experimental results demonstrate that OIDSty generates high-fidelity, stylistically distinct images across multiple styles. Compared to existing state-of-the-art methods, our method exhibits significant advantages across all objective and subjective evaluation metrics without requiring complex parameter tuning.
近年来,基于扩散模型的图像生成技术在人脸风格化领域取得了重大进展。然而,现有的方法仍然面临着在保持强烈的风格表现力的同时实现高身份保真度的挑战,特别是在平衡风格化引入的几何变形与保留精细的面部特征(如面部特征和姿势)方面。为了解决这一问题,本文提出了一种新的单样本面部风格化系统oidsty。其核心创新点在于将不同注意层之间的身份保存和风格注入任务解耦,主要通过两个关键设计来实现:(1)高保真身份模块,创新地将强语义条件和弱空间条件结合起来,引导跨注意层。这种设计能够精确地保留核心身份和面部布局特征,同时允许程式化的几何变形;(2) DINO-Style Texture Guidance Module将该损失函数引入自关注层,计算理想风格化输出与当前输出的特征差。这种损失被整合到去噪采样过程中,通过梯度动态校准潜在特征,以确保有效和准确地将风格化纹理转移到目标图像上。大量的实验结果表明,OIDSty可以生成高保真度、风格鲜明的多风格图像。与现有的最先进的方法相比,我们的方法在所有客观和主观评估指标上都表现出显著的优势,而不需要复杂的参数调整。
{"title":"OIDSty: One-shot identity-preserving face stylization","authors":"Kairui Wang ,&nbsp;Xinying Liu ,&nbsp;Di Zhao ,&nbsp;Xuelei Geng ,&nbsp;Tian Xian ,&nbsp;Yonghao Chang","doi":"10.1016/j.imavis.2026.105899","DOIUrl":"10.1016/j.imavis.2026.105899","url":null,"abstract":"<div><div>In recent years, image generation techniques based on diffusion models have made significant progress in the field of facial stylization. However, existing methods still face challenges in achieving high identity fidelity while maintaining strong stylistic expressiveness, particularly in balancing the geometric deformations introduced by stylization with the preservation of fine facial features (such as facial features and poses). To address this issue, this paper proposes a novel single-sample facial stylization system—OIDSty. Its core innovation lies in decoupling identity preservation and style injection tasks across distinct attention layers, primarily achieved through two key designs: (1) High-Fidelity Identity Module, which innovatively combines strong semantic conditions and weak spatial conditions to guide cross-attention layers. This design enables precise retention of core identity and facial layout features while permitting stylized geometric deformations; (2) The DINO-Style Texture Guidance Module introduces this loss function into the self-attention layer to compute the feature difference between the ideal stylized output and the current output. This loss is integrated into the denoising sampling process, dynamically calibrating latent features through gradients to ensure efficient and accurate transfer of stylized textures onto the target image. Extensive experimental results demonstrate that OIDSty generates high-fidelity, stylistically distinct images across multiple styles. Compared to existing state-of-the-art methods, our method exhibits significant advantages across all objective and subjective evaluation metrics without requiring complex parameter tuning.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105899"},"PeriodicalIF":4.2,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LoRA-empowered efficient diffusion for accurate fine-grained detail rendering in real-image cartoonization 基于lora的高效扩散,在实景图像卡通化中实现精确的细粒度细节渲染
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-06 DOI: 10.1016/j.imavis.2026.105898
Mingjin Liu , Yien Li
Recent advances in generative models have enabled diverse applications, from text-to-image synthesis to artistic content creation. However, generating high-quality, domain-specific content — particularly for culturally unique styles like Chinese opera — remains challenging due to limited generalization on long-tail data and the high cost of fine-tuning with specialized datasets. To address these limitations, we propose DreamOpera, a novel framework for transforming real-world Chinese opera character photographs into stylized cartoon representations. Our approach leverages a two-step process: (1) feature extraction using a pre-trained encoder to capture key visual attributes (e.g., clothing, facial features), and (2) domain transformation via a LoRA-fine-tuned diffusion model trained on a small, unpaired dataset of cartoon-style opera images. This strategy bypasses the need for costly paired data while preserving fine-grained details. Experiments demonstrate that DreamOpera outperforms existing methods in generating high-fidelity, culturally nuanced artwork, offering practical value for cultural dissemination and digital art.
生成模型的最新进展使各种应用成为可能,从文本到图像的合成到艺术内容的创作。然而,由于长尾数据的泛化有限,以及使用专门数据集进行微调的高成本,生成高质量的、特定领域的内容——特别是针对中国戏曲等文化独特风格的内容——仍然具有挑战性。为了解决这些限制,我们提出了DreamOpera,这是一个将现实世界的中国戏曲人物照片转换为程式化卡通表现的新框架。我们的方法利用了两个步骤的过程:(1)使用预训练的编码器进行特征提取,以捕获关键的视觉属性(例如,服装,面部特征);(2)通过lora微调扩散模型进行域转换,该模型训练在一个小型的,未配对的卡通风格歌剧图像数据集上。这种策略绕过了对昂贵的成对数据的需求,同时保留了细粒度的细节。实验表明,DreamOpera在生成高保真、文化细腻的艺术品方面优于现有方法,为文化传播和数字艺术提供了实用价值。
{"title":"LoRA-empowered efficient diffusion for accurate fine-grained detail rendering in real-image cartoonization","authors":"Mingjin Liu ,&nbsp;Yien Li","doi":"10.1016/j.imavis.2026.105898","DOIUrl":"10.1016/j.imavis.2026.105898","url":null,"abstract":"<div><div>Recent advances in generative models have enabled diverse applications, from text-to-image synthesis to artistic content creation. However, generating high-quality, domain-specific content — particularly for culturally unique styles like Chinese opera — remains challenging due to limited generalization on long-tail data and the high cost of fine-tuning with specialized datasets. To address these limitations, we propose DreamOpera, a novel framework for transforming real-world Chinese opera character photographs into stylized cartoon representations. Our approach leverages a two-step process: (1) feature extraction using a pre-trained encoder to capture key visual attributes (e.g., clothing, facial features), and (2) domain transformation via a LoRA-fine-tuned diffusion model trained on a small, unpaired dataset of cartoon-style opera images. This strategy bypasses the need for costly paired data while preserving fine-grained details. Experiments demonstrate that DreamOpera outperforms existing methods in generating high-fidelity, culturally nuanced artwork, offering practical value for cultural dissemination and digital art.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105898"},"PeriodicalIF":4.2,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145978221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Long-FAS: Cross-domain face anti-spoofing with long text guidance long - fas:长文本引导的跨域人脸防欺骗
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-06 DOI: 10.1016/j.imavis.2026.105901
Jianwen Zhang , Jianfeng Zhang , Dedong Yang, Rongtao Li, Ziyang Li
Recent studies have demonstrated that utilizing natural language as a supervisory signal can enhance face anti-spoofing (FAS) performance; however, these methods still fall short in fully addressing long-text inputs and fine-grained information. To mitigate these limitations, we leverage MiniGPT-4 to generate detailed long-form textual descriptions of facial features for input images, and propose a novel framework, Long-FAS, which extracts textual and visual information through a dual-branch architecture. Specifically, we incorporate positional encoding for knowledge retention to enable the learning of effective feature representations from long texts, and employ principal component analysis (PCA) matching to capture essential attribute information while prioritizing critical attributes. Furthermore, matching visual and textual features at both coarse and fine granularities enhances the model’s ability to effectively handle both long and short texts, thereby empowering it to learn robust discriminative cues from facial images. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art counterparts.
最近的研究表明,利用自然语言作为监督信号可以提高人脸抗欺骗(FAS)性能;然而,这些方法仍然不能完全处理长文本输入和细粒度信息。为了减轻这些限制,我们利用MiniGPT-4为输入图像生成详细的长篇面部特征文本描述,并提出了一个新的框架Long-FAS,它通过双分支架构提取文本和视觉信息。具体来说,我们将位置编码用于知识保留,以便从长文本中学习有效的特征表示,并使用主成分分析(PCA)匹配来捕获基本属性信息,同时对关键属性进行优先级排序。此外,在粗粒度和细粒度上匹配视觉和文本特征增强了模型有效处理长文本和短文本的能力,从而使其能够从面部图像中学习稳健的判别线索。大量的实验表明,我们的方法明显优于最先进的同行。
{"title":"Long-FAS: Cross-domain face anti-spoofing with long text guidance","authors":"Jianwen Zhang ,&nbsp;Jianfeng Zhang ,&nbsp;Dedong Yang,&nbsp;Rongtao Li,&nbsp;Ziyang Li","doi":"10.1016/j.imavis.2026.105901","DOIUrl":"10.1016/j.imavis.2026.105901","url":null,"abstract":"<div><div>Recent studies have demonstrated that utilizing natural language as a supervisory signal can enhance face anti-spoofing (FAS) performance; however, these methods still fall short in fully addressing long-text inputs and fine-grained information. To mitigate these limitations, we leverage MiniGPT-4 to generate detailed long-form textual descriptions of facial features for input images, and propose a novel framework, Long-FAS, which extracts textual and visual information through a dual-branch architecture. Specifically, we incorporate positional encoding for knowledge retention to enable the learning of effective feature representations from long texts, and employ principal component analysis (PCA) matching to capture essential attribute information while prioritizing critical attributes. Furthermore, matching visual and textual features at both coarse and fine granularities enhances the model’s ability to effectively handle both long and short texts, thereby empowering it to learn robust discriminative cues from facial images. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art counterparts.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105901"},"PeriodicalIF":4.2,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed quantum model learning for traffic density estimation 交通密度估计的分布式量子模型学习
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-06 DOI: 10.1016/j.imavis.2026.105900
Kewen Wang, Bin Wang, Wenzhe Zhai, Jing-an Cheng
In Intelligent Autonomous Transport Systems (IATS), the integration of lightweight machine learning techniques enables the deployment of real-time and efficient AI models on edge devices. A fundamental aspect is to estimate traffic density, which is crucial for efficient intelligent traffic control. The rapid progress in deep neural networks (DNNs) has led to a notable improvement in the accuracy of traffic density estimation. However, two main issues remain unsolved. Firstly, current DNN models involve numerous parameters and consume large computing resources, and thus their performance degrades when detecting multi-scale vehicle targets. Secondly, growing privacy concerns have made individuals increasingly unwilling to share their data for model training, which leads to data isolation challenges. To address the problems above, we introduce the Distributed Quantum Model Learning (DQML) model for traffic density estimation. It combines an Efficient Quantum-driven Adaptive (EQA) module to capture multi-scale information using quantum states. In addition, we propose a distributed learning strategy that trains multiple client models with local data and aggregates them via a global parameter server. This strategy ensures privacy protection while offering a significant improvement in estimation performance compared to models trained on limited and isolated data. We evaluated the proposed model on six key benchmarks for vehicle and crowd density analysis, and comprehensive experiments demonstrated that it surpasses other state-of-the-art models in both accuracy and efficiency.
在智能自主运输系统(IATS)中,轻量级机器学习技术的集成可以在边缘设备上部署实时高效的人工智能模型。交通密度估计是实现高效智能交通控制的一个重要方面。深度神经网络(dnn)的快速发展使得交通密度估计的准确性得到了显著提高。然而,两个主要问题仍未解决。首先,目前的深度神经网络模型涉及的参数多,计算资源消耗大,在检测多尺度车辆目标时性能下降。其次,越来越多的隐私问题使得个人越来越不愿意分享他们的数据用于模型训练,这导致了数据隔离的挑战。为了解决上述问题,我们引入分布式量子模型学习(DQML)模型用于交通密度估计。它结合了一个高效量子驱动自适应(EQA)模块,利用量子态捕获多尺度信息。此外,我们提出了一种分布式学习策略,该策略使用本地数据训练多个客户端模型,并通过全局参数服务器聚合它们。与在有限和孤立数据上训练的模型相比,该策略确保了隐私保护,同时显著提高了估计性能。我们在车辆和人群密度分析的六个关键基准上对所提出的模型进行了评估,综合实验表明,它在准确性和效率方面都优于其他最先进的模型。
{"title":"Distributed quantum model learning for traffic density estimation","authors":"Kewen Wang,&nbsp;Bin Wang,&nbsp;Wenzhe Zhai,&nbsp;Jing-an Cheng","doi":"10.1016/j.imavis.2026.105900","DOIUrl":"10.1016/j.imavis.2026.105900","url":null,"abstract":"<div><div>In Intelligent Autonomous Transport Systems (IATS), the integration of lightweight machine learning techniques enables the deployment of real-time and efficient AI models on edge devices. A fundamental aspect is to estimate traffic density, which is crucial for efficient intelligent traffic control. The rapid progress in deep neural networks (DNNs) has led to a notable improvement in the accuracy of traffic density estimation. However, two main issues remain unsolved. Firstly, current DNN models involve numerous parameters and consume large computing resources, and thus their performance degrades when detecting multi-scale vehicle targets. Secondly, growing privacy concerns have made individuals increasingly unwilling to share their data for model training, which leads to data isolation challenges. To address the problems above, we introduce the Distributed Quantum Model Learning (DQML) model for traffic density estimation. It combines an Efficient Quantum-driven Adaptive (EQA) module to capture multi-scale information using quantum states. In addition, we propose a distributed learning strategy that trains multiple client models with local data and aggregates them via a global parameter server. This strategy ensures privacy protection while offering a significant improvement in estimation performance compared to models trained on limited and isolated data. We evaluated the proposed model on six key benchmarks for vehicle and crowd density analysis, and comprehensive experiments demonstrated that it surpasses other state-of-the-art models in both accuracy and efficiency.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105900"},"PeriodicalIF":4.2,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating spatial features and dynamically learned temporal features via contrastive learning for video temporal grounding in LLM 基于对比学习的LLM视频时间基础空间特征与动态学习时间特征集成
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-05 DOI: 10.1016/j.imavis.2026.105895
Peifu Wang , Yixiong Liang , Yigang Cen , Lihui Cen , Zhe Qu , Jingling Liu , Shichao Kan
Video temporal grounding (VTG) is crucial for fine-grained temporal understanding in vision-language tasks. While large vision-language models (LVLMs) have shown promising results through image–text alignment and video-instruction tuning, they represent videos as static sequences of sampled frames processed by image-based vision encoders, inherently limiting their capacity to capture dynamic and sequential information effectively, leading to suboptimal performance. To address this, we propose integrating spatial features with dynamically learned temporal features using contrastive learning. Temporal features are dynamically extracted by learning a set of temporal query tokens, which prompt temporal feature extraction via contrastive alignment between video sequences and their corresponding descriptions. On the other hand, VTG based on large language models are always supervised solely through the language modeling loss, which is insufficient for effectively guiding such tasks. Thus, the VTG model in our method is trained with a temporal localization loss that combines mean squared error (MSE), intersection-over-union (IoU) of the temporal range, and cosine similarity of temporal embeddings, which is designed to be applicable to large language models. Our experiments on benchmark datasets demonstrate the effectiveness of the proposed method.
视频时间基础(VTG)对于视觉语言任务中的细粒度时间理解至关重要。虽然大型视觉语言模型(LVLMs)通过图像-文本对齐和视频指令调优显示了有希望的结果,但它们将视频表示为由基于图像的视觉编码器处理的采样帧的静态序列,这固有地限制了它们有效捕获动态和顺序信息的能力,导致性能不佳。为了解决这个问题,我们建议使用对比学习将空间特征与动态学习的时间特征相结合。通过学习一组时间查询令牌来动态提取时间特征,这些令牌通过视频序列与其相应描述之间的对比比对来提示时间特征提取。另一方面,基于大型语言模型的VTG往往仅通过语言建模损失进行监督,不足以有效指导此类任务。因此,在我们的方法中,VTG模型是用结合均方误差(MSE)、时间范围的交集-过并(IoU)和时间嵌入的余弦相似度的时间定位损失来训练的,该方法被设计为适用于大型语言模型。我们在基准数据集上的实验证明了该方法的有效性。
{"title":"Integrating spatial features and dynamically learned temporal features via contrastive learning for video temporal grounding in LLM","authors":"Peifu Wang ,&nbsp;Yixiong Liang ,&nbsp;Yigang Cen ,&nbsp;Lihui Cen ,&nbsp;Zhe Qu ,&nbsp;Jingling Liu ,&nbsp;Shichao Kan","doi":"10.1016/j.imavis.2026.105895","DOIUrl":"10.1016/j.imavis.2026.105895","url":null,"abstract":"<div><div>Video temporal grounding (VTG) is crucial for fine-grained temporal understanding in vision-language tasks. While large vision-language models (LVLMs) have shown promising results through image–text alignment and video-instruction tuning, they represent videos as static sequences of sampled frames processed by image-based vision encoders, inherently limiting their capacity to capture dynamic and sequential information effectively, leading to suboptimal performance. To address this, we propose integrating spatial features with dynamically learned temporal features using contrastive learning. Temporal features are dynamically extracted by learning a set of temporal query tokens, which prompt temporal feature extraction via contrastive alignment between video sequences and their corresponding descriptions. On the other hand, VTG based on large language models are always supervised solely through the language modeling loss, which is insufficient for effectively guiding such tasks. Thus, the VTG model in our method is trained with a temporal localization loss that combines mean squared error (MSE), intersection-over-union (IoU) of the temporal range, and cosine similarity of temporal embeddings, which is designed to be applicable to large language models. Our experiments on benchmark datasets demonstrate the effectiveness of the proposed method.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105895"},"PeriodicalIF":4.2,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01
{"title":"","authors":"","doi":"","DOIUrl":"","url":null,"abstract":"","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105914"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146237833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01
{"title":"","authors":"","doi":"","DOIUrl":"","url":null,"abstract":"","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105899"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146237841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01
{"title":"","authors":"","doi":"","DOIUrl":"","url":null,"abstract":"","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105898"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146237839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01
{"title":"","authors":"","doi":"","DOIUrl":"","url":null,"abstract":"","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105890"},"PeriodicalIF":4.2,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146237849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1