首页 > 最新文献

IEEE Transactions on Multimedia最新文献

英文 中文
Generic-to-Personalised Learning for Multimodal Image Synthesis With Bidirectional Variational GAN 基于双向变分GAN的多模态图像合成的从通用到个性化学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632663
Long Chen;Xirui Dong;Jiangrong Shen;Lu Zhang;Qi Xu;Gang Pan;Qiang Zhang
Multimodal image synthesis, which predicts target-modality images from source-modality images, has garnered considerable attention in the field of clinical diagnosis. Both unidirectional and bidirectional multimodal image synthesis methods have been explored in the medical domain, however, unidirectional models heavily rely on paired images, while current bidirectional models typically overlook local image details due to their unsupervised training patterns. In this work, we propose a Bidirectional Variational Generative Adversarial Network (BVGAN) for multimodal image synthesis, which achieves high-quality bidirectional translations between any two modalities using only a limited number paired images. Firstly, BVGAN’s generator incorporates a variational structure (VAS) to regularise the latent space for noise reduction. This regularisation imposes smoothness to the latent space, enabling BVGAN to produce high-quality, noise-free images. Secondly, a novel generic-to-personalised (GTP) learning strategy is introduced to train BVGAN and reduce its reliance on a large sets of paired images. GTP initially leverages an unsupervised learning model to capture the global mapping between two modalities using unpaired images from generic patients. It then applies a supervised learning model to refine the mapping for individual patient, enhancing image details. Finally, the GTP learning strategy along with VAS enables BVGAN to achieve state-of-the-art performance on two multi-modality medical datasets: Brain CTMRI and BRATS.
从源模态图像预测目标模态图像的多模态图像合成在临床诊断领域引起了相当大的关注。医学领域已经探索了单向和双向多模态图像合成方法,但是单向模型严重依赖于成对图像,而目前的双向模型由于其无监督的训练模式而忽略了图像的局部细节。在这项工作中,我们提出了一种用于多模态图像合成的双向变分生成对抗网络(BVGAN),它仅使用有限数量的配对图像在任意两个模态之间实现高质量的双向翻译。首先,BVGAN的发生器采用变分结构(VAS)对潜在空间进行正则化以降低噪声。这种正则化对潜在空间施加平滑性,使BVGAN能够产生高质量,无噪声的图像。其次,引入了一种新的通用到个性化(GTP)学习策略来训练BVGAN,减少了对大量配对图像的依赖。GTP最初利用无监督学习模型,利用来自普通患者的未配对图像来捕获两种模式之间的全局映射。然后,它应用监督学习模型来细化单个患者的映射,增强图像细节。最后,GTP学习策略以及VAS使BVGAN能够在两个多模态医疗数据集(脑CTMRI和BRATS)上实现最先进的性能。
{"title":"Generic-to-Personalised Learning for Multimodal Image Synthesis With Bidirectional Variational GAN","authors":"Long Chen;Xirui Dong;Jiangrong Shen;Lu Zhang;Qi Xu;Gang Pan;Qiang Zhang","doi":"10.1109/TMM.2025.3632663","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632663","url":null,"abstract":"Multimodal image synthesis, which predicts target-modality images from source-modality images, has garnered considerable attention in the field of clinical diagnosis. Both unidirectional and bidirectional multimodal image synthesis methods have been explored in the medical domain, however, unidirectional models heavily rely on paired images, while current bidirectional models typically overlook local image details due to their unsupervised training patterns. In this work, we propose a Bidirectional Variational Generative Adversarial Network (BVGAN) for multimodal image synthesis, which achieves high-quality bidirectional translations between any two modalities using only a limited number paired images. Firstly, BVGAN’s generator incorporates a variational structure (VAS) to regularise the latent space for noise reduction. This regularisation imposes smoothness to the latent space, enabling BVGAN to produce high-quality, noise-free images. Secondly, a novel generic-to-personalised (GTP) learning strategy is introduced to train BVGAN and reduce its reliance on a large sets of paired images. GTP initially leverages an unsupervised learning model to capture the global mapping between two modalities using unpaired images from generic patients. It then applies a supervised learning model to refine the mapping for individual patient, enhancing image details. Finally, the GTP learning strategy along with VAS enables BVGAN to achieve state-of-the-art performance on two multi-modality medical datasets: Brain CTMRI and BRATS.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"902-914"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Progressive Learning of Instance-Level Proxy Semantics for Few-Shot Action Recognition 基于实例级代理语义的少射动作识别
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632652
Fang Peng;Xiaoshan Yang;Yaowei Wang;Changsheng Xu
Few-shot action recognition is a crucial task for mitigating the challenges of data scarcity in video understanding. Recent advancements in large-scale pre-trained models have introduced the potential of incorporating semantic knowledge from multi-modal pre-trained models, such as CLIP, to alleviate these challenges. Although some progress have been made, existing methods still rely on class-level text embeddings that are inherently low in diversity, limiting their ability to generalize to unseen actions. To overcome this limitation, we propose a novel framework called Progressive Learning of Instance-Level Proxy Semantics (ProLIPS). ProLIPS integrates Proxy Semantic Diffusion (PSD) to generate rich, instance-level proxy semantic features with diverse semantic contents and temporal dynamics, utilizing a multi-step CLIP-guidance mechanism and a time-conditioned reverse diffusion process. Our approach preserves the diversity of semantic-aligned visual features, significantly improving the generalization and robustness of few-shot action recognition. Extensive experiments on five challenging benchmarks demonstrate the effectiveness of ProLIPS.
少镜头动作识别是缓解视频理解中数据稀缺性挑战的关键任务。大规模预训练模型的最新进展引入了整合多模态预训练模型(如CLIP)的语义知识的潜力,以缓解这些挑战。尽管已经取得了一些进展,但现有的方法仍然依赖于类级别的文本嵌入,其多样性本身就很低,限制了它们泛化到看不见的动作的能力。为了克服这一限制,我们提出了一个新的框架,称为实例级代理语义的渐进学习(ProLIPS)。ProLIPS集成了代理语义扩散(Proxy Semantic Diffusion, PSD),利用多步clip引导机制和时间条件反向扩散过程,生成具有多种语义内容和时间动态的丰富的实例级代理语义特征。我们的方法保留了语义对齐视觉特征的多样性,显著提高了少镜头动作识别的泛化和鲁棒性。在五个具有挑战性的基准上进行的大量实验证明了ProLIPS的有效性。
{"title":"Progressive Learning of Instance-Level Proxy Semantics for Few-Shot Action Recognition","authors":"Fang Peng;Xiaoshan Yang;Yaowei Wang;Changsheng Xu","doi":"10.1109/TMM.2025.3632652","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632652","url":null,"abstract":"Few-shot action recognition is a crucial task for mitigating the challenges of data scarcity in video understanding. Recent advancements in large-scale pre-trained models have introduced the potential of incorporating semantic knowledge from multi-modal pre-trained models, such as CLIP, to alleviate these challenges. Although some progress have been made, existing methods still rely on class-level text embeddings that are inherently low in diversity, limiting their ability to generalize to unseen actions. To overcome this limitation, we propose a novel framework called Progressive Learning of Instance-Level Proxy Semantics (ProLIPS). ProLIPS integrates Proxy Semantic Diffusion (PSD) to generate rich, instance-level proxy semantic features with diverse semantic contents and temporal dynamics, utilizing a multi-step CLIP-guidance mechanism and a time-conditioned reverse diffusion process. Our approach preserves the diversity of semantic-aligned visual features, significantly improving the generalization and robustness of few-shot action recognition. Extensive experiments on five challenging benchmarks demonstrate the effectiveness of ProLIPS.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"853-864"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cas-OVD: Cascaded Open-Vocabulary Detection of Small Objects Using Multi-Refined Region Proposal Network in Autonomous Driving Cas-OVD:基于多细化区域建议网络的自动驾驶小目标级联开放词汇检测
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632649
Zhenyu Fang;Yulong Wu;Jinchang Ren;Jiangbin Zheng;Yijun Yan;Lixiang Zhang
Although text information has aided existing models to achieve promising results in open vocabulary object detection (OVD), the lack of semantic information has led to the difficulty in small objects detection (SOD). Moreover, such semantic gap also causes failure when matching texts and image features, resulting in false negative instances being detected. To address these issues, we propose a Cascade Open Vocabulary Detector (Cas-OVD), which builds upon existing multi-stage detection pipelines but specializes in text-vision alignment for small objects. In particular, we adapt a multi-refined region proposal network, guided by a non-sampled anchor strategy, to reduce the missing and false detections of small objects. Meanwhile, a deformable convolution network based feature conversion module is proposed to enhance the semantic information of small objects even the potential ones with low confidence. Unlike existing methods that rely on coarse-grained image-based features for image-text matching, Cas-OVD refines these features through a cascade alignment process, allowing each stage to build on the results of the previous one. This can progressively enhance the feature correlation between the image regions and the textual descriptions through successive error correction. On the joint BDD100K-SODA-D dataset, Cas-OVD achieved 17.95% AP$_{mathrm{all}}$ and 14.6% AP$_{mathrm{s}}$, outperforming RegionCLIP by 3.5% AP$_{mathrm{all}}$ and 3.0% AP$_{mathrm{s}}$, respectively. On the OV_COCO dataset, Cas-OVD has the 32.71% AP$_{mathrm{all}}$ and 17.26% AP$_{mathrm{s}}$, surpassing the RegionCLIP by 6.6% AP$_{mathrm{all}}$ and 6.1% AP$_{mathrm{s}}$, respectively.
尽管文本信息帮助现有模型在开放词汇目标检测(OVD)中取得了很好的结果,但语义信息的缺乏导致了小目标检测(SOD)的困难。此外,这种语义差距也会导致文本和图像特征匹配失败,从而检测到假阴性实例。为了解决这些问题,我们提出了一个级联开放词汇检测器(Cas-OVD),它建立在现有的多阶段检测管道之上,但专门用于小对象的文本视觉对齐。特别地,我们采用了一种由非采样锚策略指导的多细化区域建议网络,以减少小目标的缺失和错误检测。同时,提出了一种基于可变形卷积网络的特征转换模块,增强小目标甚至潜在低置信度目标的语义信息。与依赖基于图像的粗粒度特征进行图像-文本匹配的现有方法不同,Cas-OVD通过级联对齐过程对这些特征进行细化,允许每个阶段都以前一个阶段的结果为基础。通过逐次纠错,逐步增强图像区域与文本描述之间的特征相关性。在联合BDD100K-SODA-D数据集上,Cas-OVD实现了17.95%的AP$_{mathrm{all}}$和14.6%的AP$_{mathrm{s}}$,分别比RegionCLIP高出3.5%的AP$_{mathrm{all}}$和3.0%的AP$_{mathrm{s}}$。在OV_COCO数据集上,Cas-OVD具有32.71%的AP$_{mathrm{all}}$和17.26%的AP$_{mathrm{s}}$,分别比RegionCLIP高出6.6%的AP$_{mathrm{all}}$和6.1%的AP$_{mathrm{s}}$。
{"title":"Cas-OVD: Cascaded Open-Vocabulary Detection of Small Objects Using Multi-Refined Region Proposal Network in Autonomous Driving","authors":"Zhenyu Fang;Yulong Wu;Jinchang Ren;Jiangbin Zheng;Yijun Yan;Lixiang Zhang","doi":"10.1109/TMM.2025.3632649","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632649","url":null,"abstract":"Although text information has aided existing models to achieve promising results in open vocabulary object detection (OVD), the lack of semantic information has led to the difficulty in small objects detection (SOD). Moreover, such semantic gap also causes failure when matching texts and image features, resulting in false negative instances being detected. To address these issues, we propose a Cascade Open Vocabulary Detector (Cas-OVD), which builds upon existing multi-stage detection pipelines but specializes in text-vision alignment for small objects. In particular, we adapt a multi-refined region proposal network, guided by a non-sampled anchor strategy, to reduce the missing and false detections of small objects. Meanwhile, a deformable convolution network based feature conversion module is proposed to enhance the semantic information of small objects even the potential ones with low confidence. Unlike existing methods that rely on coarse-grained image-based features for image-text matching, Cas-OVD refines these features through a cascade alignment process, allowing each stage to build on the results of the previous one. This can progressively enhance the feature correlation between the image regions and the textual descriptions through successive error correction. On the joint BDD100K-SODA-D dataset, Cas-OVD achieved 17.95% AP<inline-formula><tex-math>$_{mathrm{all}}$</tex-math></inline-formula> and 14.6% AP<inline-formula><tex-math>$_{mathrm{s}}$</tex-math></inline-formula>, outperforming RegionCLIP by 3.5% AP<inline-formula><tex-math>$_{mathrm{all}}$</tex-math></inline-formula> and 3.0% AP<inline-formula><tex-math>$_{mathrm{s}}$</tex-math></inline-formula>, respectively. On the OV_COCO dataset, Cas-OVD has the 32.71% AP<inline-formula><tex-math>$_{mathrm{all}}$</tex-math></inline-formula> and 17.26% AP<inline-formula><tex-math>$_{mathrm{s}}$</tex-math></inline-formula>, surpassing the RegionCLIP by 6.6% AP<inline-formula><tex-math>$_{mathrm{all}}$</tex-math></inline-formula> and 6.1% AP<inline-formula><tex-math>$_{mathrm{s}}$</tex-math></inline-formula>, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"757-771"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MDT-FI: Mask-Guided Dual-Branch Transformer With Texture and Structure Feature Interaction for Image Inpainting MDT-FI:具有纹理和结构特征交互的掩模引导双支路变压器
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632651
Dong Liu;Xiaofeng Wang;Ruidong Han;Jianghua Li;Shanmin Pang
Image inpainting has attracted considerable attention in computer vision and image processing due to its wide range of applications. While deep learning-based methods have shown promising potential, accurately recovering pixel-level details remains a significant challenge, particularly in the presence of large and irregular missing regions. Furthermore, existing methods are limited by unidirectional semantic guidance and a localized understanding of global structural context. In this study, we propose a mask-guided dual-branch Transformer-based framework, named MDT-FI, which effectively balances local detail restoration and global contextual reasoning by explicitly modeling long-range dependencies. MDT-FI consists of three key components: the Interactive Attention Module (IAM), the Spectral Harmonization Module (SHM), and the Lateral Adaptation Network (LAN). The model integrates multi-scale feature interaction, frequency-domain information fusion, and a mask-guided attention mechanism to progressively build cross-level feature associations. This design facilitates multi-level representation learning and optimization, thereby enhancing local texture synthesis while preserving global structural consistency. To further improve perceptual quality, a feature augmenter is employed to assess the fidelity of both texture and structure in the generated results. Extensive experiments on CelebA-HQ, Places2, and Paris Street View demonstrate that MDT-FI significantly outperforms state-of-the-art methods.
在计算机视觉和图像处理领域,图像补绘因其广泛的应用而备受关注。虽然基于深度学习的方法已经显示出很大的潜力,但准确地恢复像素级的细节仍然是一个重大的挑战,特别是在存在大型和不规则缺失区域的情况下。此外,现有的方法受到单向语义引导和对全局结构上下文的局部理解的限制。在这项研究中,我们提出了一个基于掩模引导的双支路变压器框架,命名为MDT-FI,它通过显式建模远程依赖关系,有效地平衡了局部细节恢复和全局上下文推理。MDT-FI由三个关键组件组成:交互注意模块(IAM)、频谱协调模块(SHM)和横向适应网络(LAN)。该模型结合多尺度特征交互、频域信息融合和掩模引导注意机制,逐步构建跨层次特征关联。这种设计促进了多级表示学习和优化,从而在保持全局结构一致性的同时增强了局部纹理合成。为了进一步提高感知质量,使用特征增强器来评估生成结果中纹理和结构的保真度。在CelebA-HQ、Places2和巴黎街景上进行的大量实验表明,MDT-FI明显优于最先进的方法。
{"title":"MDT-FI: Mask-Guided Dual-Branch Transformer With Texture and Structure Feature Interaction for Image Inpainting","authors":"Dong Liu;Xiaofeng Wang;Ruidong Han;Jianghua Li;Shanmin Pang","doi":"10.1109/TMM.2025.3632651","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632651","url":null,"abstract":"Image inpainting has attracted considerable attention in computer vision and image processing due to its wide range of applications. While deep learning-based methods have shown promising potential, accurately recovering pixel-level details remains a significant challenge, particularly in the presence of large and irregular missing regions. Furthermore, existing methods are limited by unidirectional semantic guidance and a localized understanding of global structural context. In this study, we propose a mask-guided dual-branch Transformer-based framework, named MDT-FI, which effectively balances local detail restoration and global contextual reasoning by explicitly modeling long-range dependencies. MDT-FI consists of three key components: the Interactive Attention Module (IAM), the Spectral Harmonization Module (SHM), and the Lateral Adaptation Network (LAN). The model integrates multi-scale feature interaction, frequency-domain information fusion, and a mask-guided attention mechanism to progressively build cross-level feature associations. This design facilitates multi-level representation learning and optimization, thereby enhancing local texture synthesis while preserving global structural consistency. To further improve perceptual quality, a feature augmenter is employed to assess the fidelity of both texture and structure in the generated results. Extensive experiments on CelebA-HQ, Places2, and Paris Street View demonstrate that MDT-FI significantly outperforms state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"985-997"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MUVOD: A Novel Multi-View Video Object Segm entation Dataset and a Benchmark for 3D Segmentation MUVOD:一种新的多视点视频对象分割数据集和三维分割基准
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632697
Bangning Wei;Joshua Maraval;Meriem Outtas;Kidiyo Kpalma;Nicolas Ramin;Lu Zhang
The application of methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS) have steadily gained popularity in the field of 3D object segmentation in static scenes. These approaches demonstrate efficacy in a range of 3D scene understanding and editing tasks. Nevertheless, the 4D object segmentation of dynamic scenes remains an underexplored field due to the absence of a sufficiently extensive and accurately labelled multi-view video dataset. In this paper, we present MUVOD, a new multi-view video dataset for training and evaluating object segmentation in reconstructed real-world scenarios. The 17 selected scenes, describing various indoor or outdoor activities, are collected from different sources of datasets originating from various types of camera rigs. Each scene contains a minimum of 9 views and a maximum of 46 views. We provide 7830 RGB images (30 frames per video) with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig. This dataset, which contains 459 instances of 73 categories, is intended as a basic benchmark for the evaluation of multi-view video segmentation methods. We also present an evaluation metric and a baseline segmentation approach to encourage and evaluate progress in this evolving field. Additionally, we propose a new benchmark for 3D object segmentation task with a subset of annotated multi-view images selected from our MUVOD dataset. This subset contains 50 objects of different conditions in different scenarios, providing a more comprehensive analysis of state-of-the-art 3D object segmentation methods.
基于神经辐射场(Neural Radiance Fields, NeRF)和三维高斯飞溅(3D Gaussian splplatting, 3D GS)的方法在静态场景下的三维目标分割领域得到了广泛的应用。这些方法在一系列3D场景理解和编辑任务中证明了有效性。然而,由于缺乏足够广泛和准确标记的多视点视频数据集,动态场景的4D物体分割仍然是一个未开发的领域。在本文中,我们提出了一种新的多视图视频数据集MUVOD,用于在重建的真实场景中训练和评估目标分割。17个选定的场景,描述了各种室内或室外活动,从不同来源的数据集收集,这些数据集来自不同类型的摄像机。每个场景包含最少9个视图,最多46个视图。我们在4D运动中提供7830个RGB图像(每个视频30帧)及其相应的分割掩码,这意味着场景中任何感兴趣的对象都可以跨给定视图的时间帧或跨属于同一摄像机的不同视图进行跟踪。该数据集包含73个类别的459个实例,旨在作为评估多视图视频分割方法的基本基准。我们还提出了一个评估指标和基线分割方法,以鼓励和评估这一不断发展的领域的进展。此外,我们提出了一个新的基准,用于3D物体分割任务,该任务使用从我们的MUVOD数据集中选择的带注释的多视图图像子集。该子集包含50个不同场景下不同条件的对象,为最先进的3D对象分割方法提供更全面的分析。
{"title":"MUVOD: A Novel Multi-View Video Object Segm entation Dataset and a Benchmark for 3D Segmentation","authors":"Bangning Wei;Joshua Maraval;Meriem Outtas;Kidiyo Kpalma;Nicolas Ramin;Lu Zhang","doi":"10.1109/TMM.2025.3632697","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632697","url":null,"abstract":"The application of methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS) have steadily gained popularity in the field of 3D object segmentation in static scenes. These approaches demonstrate efficacy in a range of 3D scene understanding and editing tasks. Nevertheless, the 4D object segmentation of dynamic scenes remains an underexplored field due to the absence of a sufficiently extensive and accurately labelled multi-view video dataset. In this paper, we present MUVOD, a new multi-view video dataset for training and evaluating object segmentation in reconstructed real-world scenarios. The 17 selected scenes, describing various indoor or outdoor activities, are collected from different sources of datasets originating from various types of camera rigs. Each scene contains a minimum of 9 views and a maximum of 46 views. We provide 7830 RGB images (30 frames per video) with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig. This dataset, which contains 459 instances of 73 categories, is intended as a basic benchmark for the evaluation of multi-view video segmentation methods. We also present an evaluation metric and a baseline segmentation approach to encourage and evaluate progress in this evolving field. Additionally, we propose a new benchmark for 3D object segmentation task with a subset of annotated multi-view images selected from our MUVOD dataset. This subset contains 50 objects of different conditions in different scenarios, providing a more comprehensive analysis of state-of-the-art 3D object segmentation methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"726-741"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ASK-HOI: Affordance-Scene Knowledge Prompting for Human-Object Interaction Detection 面向人-物交互检测的情景知识提示
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632627
Dongpan Chen;Dehui Kong;Junna Gao;Jinghua Li;Qianxing Li;Baocai Yin
Human-object interaction (HOI) detection task aims to learn how humans interact with surrounding objectsby inferring fine-grained triples of $leftlangle rm {{human, action, object}} rightrangle$, which plays a vital role in computer vision tasks such as human-centered scene understanding and visual question answering. However, HOI detection suffers from class long-tailed distributions and zero-shot problems. Current methods typically identify HOI only from input images or label spaces in a data-driven manner, lacking sufficient knowledge prompts, and consequently limits their potential for real-world scenes. Hence, to fill this gap, this paper introduces affordance and scene knowledge as prompts on different granularities to the HOI detector to improve its recognition ability. Concretely, we first construct a large-scale affordance-scene knowledge graph, named ASKG, whose knowledge can be divided into two categories according to the fields of image information, i.e., the knowledge related to affordances of object instances and the knowledge associated with the scene. Subsequently, the knowledge of affordance and scene specific to the input image is extracted by an ASKG-based prior knowledge embedding module. Since this knowledge corresponds to the image at different granularities, we then propose an instance field adaptive fusion module and a scene field adaptive fusion module to enable visual features fully absorb the knowledge prompts. These two encoded features of different fields and knowledge embeddings are finally fed into a proposed HOI recognition module to predict more accurate HOI results. Extensive experiments on both HICO-DET and V-COCO benchmarks demonstrate that the proposed method leads to competitive results compared with the state-of-the-art methods.
人类-物体交互(HOI)检测任务旨在通过推断$leftlangle rm {{human, action, object}} rightrangle$的细粒度三元组来了解人类与周围物体的交互方式,这在以人为中心的场景理解和视觉问答等计算机视觉任务中起着至关重要的作用。然而,HOI检测受到类长尾分布和零射击问题的困扰。目前的方法通常仅以数据驱动的方式从输入图像或标签空间中识别HOI,缺乏足够的知识提示,因此限制了它们在现实场景中的潜力。因此,为了填补这一空白,本文在HOI检测器中引入了不同粒度的可视性和场景知识作为提示,以提高其识别能力。具体而言,我们首先构建了一个大规模的情景情景知识图ASKG,根据图像信息的领域将其知识分为两类,即与对象实例的情景情景相关的知识和与场景相关的知识。随后,通过基于ask的先验知识嵌入模块提取输入图像特定的可见性和场景知识。由于这些知识对应于不同粒度的图像,我们提出了实例场自适应融合模块和场景场自适应融合模块,使视觉特征充分吸收知识提示。这两种不同领域的编码特征和知识嵌入最终被输入到所提出的HOI识别模块中,以预测更准确的HOI结果。在HICO-DET和V-COCO基准测试上进行的大量实验表明,与最先进的方法相比,所提出的方法可以产生具有竞争力的结果。
{"title":"ASK-HOI: Affordance-Scene Knowledge Prompting for Human-Object Interaction Detection","authors":"Dongpan Chen;Dehui Kong;Junna Gao;Jinghua Li;Qianxing Li;Baocai Yin","doi":"10.1109/TMM.2025.3632627","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632627","url":null,"abstract":"Human-object interaction (HOI) detection task aims to learn how humans interact with surrounding objectsby inferring fine-grained triples of <inline-formula><tex-math>$leftlangle rm {{human, action, object}} rightrangle$</tex-math></inline-formula>, which plays a vital role in computer vision tasks such as human-centered scene understanding and visual question answering. However, HOI detection suffers from class long-tailed distributions and zero-shot problems. Current methods typically identify HOI only from input images or label spaces in a data-driven manner, lacking sufficient knowledge prompts, and consequently limits their potential for real-world scenes. Hence, to fill this gap, this paper introduces affordance and scene knowledge as prompts on different granularities to the HOI detector to improve its recognition ability. Concretely, we first construct a large-scale affordance-scene knowledge graph, named ASKG, whose knowledge can be divided into two categories according to the fields of image information, i.e., the knowledge related to affordances of object instances and the knowledge associated with the scene. Subsequently, the knowledge of affordance and scene specific to the input image is extracted by an ASKG-based prior knowledge embedding module. Since this knowledge corresponds to the image at different granularities, we then propose an instance field adaptive fusion module and a scene field adaptive fusion module to enable visual features fully absorb the knowledge prompts. These two encoded features of different fields and knowledge embeddings are finally fed into a proposed HOI recognition module to predict more accurate HOI results. Extensive experiments on both HICO-DET and V-COCO benchmarks demonstrate that the proposed method leads to competitive results compared with the state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"742-756"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Partition Map-Based Fast Block Partitioning for VVC Inter Coding 基于分区映射的VVC编码快速块分区
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632639
Xinmin Feng;Zhuoyuan Li;Li Li;Dong Liu;Feng Wu
Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast block partitioning in inter coding. Based on our previous work on partition map-based methods for intra coding, we analyze the characteristics of VVC inter coding and improve the partition map by incorporating an MTT mask for early termination. Next, we develop a neural network that uses both spatial and temporal features to predict the partition map. It consists of several special designs, including stacked top-down and bottom-up processing, quantization parameter modulation layers, and partitioning-adaptive warping. Furthermore, we present a dual-threshold decision scheme to achieve a fine-grained trade-off between complexity reduction and rate-distortion performance loss. The experimental results demonstrate that the proposed method achieves an average 51.30% encoding time saving with a 2.12% Bjøntegaard-delta-bit-rate under the random access configuration.
在通用视频编码(VVC)的新技术中,嵌套多类型树(MTT)块结构的四叉树通过提供更灵活的块划分模式而获得了显著的编码增益。然而,VVC编码器中的递归分割搜索大大增加了编码器的复杂度。为了解决这个问题,我们提出了一种基于分区映射的算法来实现编码间的快速块划分。在前人基于分区映射编码方法的基础上,我们分析了VVC间编码的特点,并通过引入MTT掩码来改进分区映射。接下来,我们开发了一个使用空间和时间特征来预测分区图的神经网络。它由几种特殊的设计组成,包括自顶向下和自底向上的堆叠处理、量化参数调制层和分区自适应翘曲。此外,我们提出了一种双阈值决策方案,以实现复杂性降低和率失真性能损失之间的细粒度权衡。实验结果表明,在随机接入配置下,该方法平均节省51.30%的编码时间和2.12%的bj øntegaard-delta比特率。
{"title":"Partition Map-Based Fast Block Partitioning for VVC Inter Coding","authors":"Xinmin Feng;Zhuoyuan Li;Li Li;Dong Liu;Feng Wu","doi":"10.1109/TMM.2025.3632639","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632639","url":null,"abstract":"Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast block partitioning in inter coding. Based on our previous work on partition map-based methods for intra coding, we analyze the characteristics of VVC inter coding and improve the partition map by incorporating an MTT mask for early termination. Next, we develop a neural network that uses both spatial and temporal features to predict the partition map. It consists of several special designs, including stacked top-down and bottom-up processing, quantization parameter modulation layers, and partitioning-adaptive warping. Furthermore, we present a dual-threshold decision scheme to achieve a fine-grained trade-off between complexity reduction and rate-distortion performance loss. The experimental results demonstrate that the proposed method achieves an average 51.30% encoding time saving with a 2.12% Bjøntegaard-delta-bit-rate under the random access configuration.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"998-1013"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Granularity Query Network With Adaptive Category Feature Embedding for Behavior Recognition 基于自适应类别特征嵌入的多粒度查询网络行为识别
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632695
Nuoer Long;Yonghao Dang;Kaiwen Yang;Chengpeng Xiong;Shaobin Chen;Tao Tan;Wei Ke;Chan-Tong Lam;Jianqin Yin;Peter H. N. de With;Yue Sun
Behavior recognition is a highly challenging task, particularly in scenarios requiring unified recognition across both human and animal subjects. Most existing approaches primarily focus on single-species datasets or rely heavily on prior information such as species labels, positional annotations, or skeletal keypoints, which limits their applicability in real-world scenarios where species labels may be ambiguous or annotations are insufficient. To address these limitations, we propose a query-based Multi-Granularity Behavior Recognition Network that directly mines cross-species shared spatiotemporal behavior patterns from raw video inputs. Specifically, we design a Multi-Granularity Query module to effectively fuse fine-grained and coarse-grained features, thereby enhancing the model's capability in capturing spatiotemporal dynamics at different granularities. Additionally, we introduce a Category Query Decoder that leverages learnable category query vectors to achieve explicit behavior category modeling and mapping. Without relying on any extra annotations, the proposed method achieves unified recognition of multi-species and multi-category behaviors, setting a new state-of-the-art on the Animal Kingdom dataset and demonstrating strong generalization ability on the Charades dataset.
行为识别是一项极具挑战性的任务,特别是在需要统一识别人类和动物主体的场景中。大多数现有方法主要集中在单物种数据集上,或严重依赖于物种标记、位置注释或骨骼关键点等先验信息,这限制了它们在物种标记可能不明确或注释不足的现实场景中的适用性。为了解决这些限制,我们提出了一个基于查询的多粒度行为识别网络,该网络直接从原始视频输入中挖掘跨物种共享的时空行为模式。具体来说,我们设计了一个多粒度查询模块,有效地融合细粒度和粗粒度特征,从而增强了模型捕捉不同粒度时空动态的能力。此外,我们引入了一个类别查询解码器,它利用可学习的类别查询向量来实现显式的行为类别建模和映射。该方法在不依赖任何额外注释的情况下,实现了对多物种、多类别行为的统一识别,在动物王国数据集上开创了新局面,并在Charades数据集上展示了较强的泛化能力。
{"title":"Multi-Granularity Query Network With Adaptive Category Feature Embedding for Behavior Recognition","authors":"Nuoer Long;Yonghao Dang;Kaiwen Yang;Chengpeng Xiong;Shaobin Chen;Tao Tan;Wei Ke;Chan-Tong Lam;Jianqin Yin;Peter H. N. de With;Yue Sun","doi":"10.1109/TMM.2025.3632695","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632695","url":null,"abstract":"Behavior recognition is a highly challenging task, particularly in scenarios requiring unified recognition across both human and animal subjects. Most existing approaches primarily focus on single-species datasets or rely heavily on prior information such as species labels, positional annotations, or skeletal keypoints, which limits their applicability in real-world scenarios where species labels may be ambiguous or annotations are insufficient. To address these limitations, we propose a query-based Multi-Granularity Behavior Recognition Network that directly mines cross-species shared spatiotemporal behavior patterns from raw video inputs. Specifically, we design a Multi-Granularity Query module to effectively fuse fine-grained and coarse-grained features, thereby enhancing the model's capability in capturing spatiotemporal dynamics at different granularities. Additionally, we introduce a Category Query Decoder that leverages learnable category query vectors to achieve explicit behavior category modeling and mapping. Without relying on any extra annotations, the proposed method achieves unified recognition of multi-species and multi-category behaviors, setting a new state-of-the-art on the Animal Kingdom dataset and demonstrating strong generalization ability on the Charades dataset.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"878-890"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AesPrompt: Zero-Shot Image Aesthetics Assessment With Multi-Granularity Aesthetic Prompt Learning AesPrompt:基于多粒度美学提示学习的零镜头图像美学评估
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632637
Xiangfei Sheng;Leida Li;Pengfei Chen;Li Cai;Giuseppe Valenzise
Recent years have witnessed increasing interest towards image aesthetics assessment (IAA), which predicts the aesthetic appeal of images by simulating human perception. The state-of-the-art IAA methods, despite their significant advancements, typically rely heavily on time-consuming and labor-intensive human annotation of aesthetic scores. Furthermore, they are subject to the generalization challenge, which is highly desired in real-world applications. Motivated by this, zero-shot image aesthetics assessment (ZIAA) is investigated to achieve robust model generalization without relying on manual aesthetic annotations, which remains largely underexplored. Specifically, a novel aesthetic prompt learning framework for ZIAA, dubbed AesPrompt, is presented in this paper. The key insight of AesPrompt is to emulate the human aesthetic perception process for learning aesthetic-oriented prompts in a multi-granularity manner. First, we first develop a new pseudo aesthetic distribution generation paradigm based on multi-LLM ensemble. Then, external knowledge of multi-granularity prompts encompassing image themes, emotions, and aesthetics is acquired. Through learning the multi-granularity aesthetic-oriented prompts, the proposed method achieves better generalization and interpretability. Extensive experiments on five IAA benchmarks demonstrate that AesPrompt consistently outperforms the state-of-the-art ZIAA methods across diverse-sourced images, covering natural images, artistic images, and artificial intelligence-generated images.
近年来,人们对图像美学评价(IAA)越来越感兴趣,它通过模拟人类的感知来预测图像的审美吸引力。最先进的IAA方法尽管取得了重大进步,但通常严重依赖于耗时和劳动密集型的人类对美学分数的注释。此外,它们还受到泛化挑战的影响,这在实际应用中是非常需要的。基于此,zero-shot图像美学评估(zero-shot image aesthetics assessment, ZIAA)被用于在不依赖人工美学注释的情况下实现鲁棒模型泛化,这在很大程度上仍未得到充分的探索。具体而言,本文提出了一种新的用于ZIAA的美学提示学习框架,称为AesPrompt。AesPrompt的关键是模仿人类的审美感知过程,以多粒度的方式学习审美导向的提示。首先,我们开发了一种基于多llm集成的伪美学分布生成范式。然后,获得包含图像主题、情感和美学的多粒度提示的外部知识。通过对多粒度审美提示的学习,该方法具有更好的泛化性和可解释性。在五个IAA基准测试上进行的大量实验表明,AesPrompt在不同来源的图像(包括自然图像、艺术图像和人工智能生成的图像)上始终优于最先进的ZIAA方法。
{"title":"AesPrompt: Zero-Shot Image Aesthetics Assessment With Multi-Granularity Aesthetic Prompt Learning","authors":"Xiangfei Sheng;Leida Li;Pengfei Chen;Li Cai;Giuseppe Valenzise","doi":"10.1109/TMM.2025.3632637","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632637","url":null,"abstract":"Recent years have witnessed increasing interest towards image aesthetics assessment (IAA), which predicts the aesthetic appeal of images by simulating human perception. The state-of-the-art IAA methods, despite their significant advancements, typically rely heavily on time-consuming and labor-intensive human annotation of aesthetic scores. Furthermore, they are subject to the generalization challenge, which is highly desired in real-world applications. Motivated by this, zero-shot image aesthetics assessment (ZIAA) is investigated to achieve robust model generalization without relying on manual aesthetic annotations, which remains largely underexplored. Specifically, a novel aesthetic prompt learning framework for ZIAA, dubbed AesPrompt, is presented in this paper. The key insight of AesPrompt is to emulate the human aesthetic perception process for learning aesthetic-oriented prompts in a multi-granularity manner. First, we first develop a new pseudo aesthetic distribution generation paradigm based on multi-LLM ensemble. Then, external knowledge of multi-granularity prompts encompassing image themes, emotions, and aesthetics is acquired. Through learning the multi-granularity aesthetic-oriented prompts, the proposed method achieves better generalization and interpretability. Extensive experiments on five IAA benchmarks demonstrate that AesPrompt consistently outperforms the state-of-the-art ZIAA methods across diverse-sourced images, covering natural images, artistic images, and artificial intelligence-generated images.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"958-971"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive Multi-Modal Visual Tracking With Dynamic Semantic Prompts 具有动态语义提示的自适应多模态视觉跟踪
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632650
Jiahao Wang;Fang Liu;Licheng Jiao;Hao Wang;Shuo Li;Lingling Li;Puhua Chen;Xu Liu;Wenping Ma;Xinyi Wang
RGB-based object tracking is a fundamental task in computer vision, aiming to identify, locate, and continuously track objects of interest across sequential video frames. Despite the significant advancements in the performance of traditional RGB trackers, they still face challenges in maintaining accuracy and robustness in the presence of complex backgrounds, occlusions, and rapid movements. To tackle these challenges, combining visual auxiliary modalities has gained significant attention. Beyond this, integrating natural language information offers additional advantages by providing high-level semantic context, enhancing robustness, and clarifying target priorities, further elevating tracker performance. This work proposes the Adaptive Multi-modal Visual Tracking with Dynamic Semantic Prompts (AMVTrack) tracker, which efficiently incorporates image descriptions and avoids text dependency during tracking to improve flexibility and adaptability. AMVTrack significantly reduces computational resource consumption by freezing the parameters of the image encoder, text encoder, and Box Head and only optimizing a few learnable prompt parameters. Additionally, we introduce the Adaptive Dynamic Semantic Prompt Generator (ADSPG), which dynamically generates semantic prompts based on visual features, and the Visual-Language Fusion Adaptation (V-L FA) method, which integrates multi-modal features to ensure consistency and complementarity of information. Additionally, we partition the Image Encoder to conduct an in-depth investigation into the relationship between the importance of features across different depth and width regions. Experimental results demonstrate that AMVTrack achieves significant performance improvements on multiple benchmark datasets, proving its effectiveness and robustness in complex scenarios.
基于rgb的目标跟踪是计算机视觉中的一项基本任务,旨在识别、定位和连续跟踪连续视频帧中感兴趣的对象。尽管传统RGB跟踪器的性能有了显著的进步,但在复杂背景、遮挡和快速运动的情况下,它们仍然面临着保持准确性和鲁棒性的挑战。为了应对这些挑战,结合视觉辅助模式已经得到了极大的关注。除此之外,集成自然语言信息通过提供高级语义上下文、增强鲁棒性、澄清目标优先级、进一步提高跟踪器性能,提供了额外的优势。本文提出了一种带有动态语义提示的自适应多模态视觉跟踪(AMVTrack)跟踪器,该跟踪器在跟踪过程中有效地融合了图像描述,避免了对文本的依赖,提高了跟踪的灵活性和适应性。AMVTrack通过冻结图像编码器、文本编码器和Box Head的参数,只优化少数可学习的提示参数,显著降低了计算资源消耗。此外,我们还介绍了基于视觉特征动态生成语义提示的自适应动态语义提示生成器(ADSPG)和视觉语言融合适应(V-L FA)方法,该方法集成了多模态特征,以确保信息的一致性和互补性。此外,我们对图像编码器进行了划分,以深入研究不同深度和宽度区域的特征重要性之间的关系。实验结果表明,AMVTrack在多个基准数据集上取得了显著的性能提升,证明了其在复杂场景下的有效性和鲁棒性。
{"title":"Adaptive Multi-Modal Visual Tracking With Dynamic Semantic Prompts","authors":"Jiahao Wang;Fang Liu;Licheng Jiao;Hao Wang;Shuo Li;Lingling Li;Puhua Chen;Xu Liu;Wenping Ma;Xinyi Wang","doi":"10.1109/TMM.2025.3632650","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632650","url":null,"abstract":"RGB-based object tracking is a fundamental task in computer vision, aiming to identify, locate, and continuously track objects of interest across sequential video frames. Despite the significant advancements in the performance of traditional RGB trackers, they still face challenges in maintaining accuracy and robustness in the presence of complex backgrounds, occlusions, and rapid movements. To tackle these challenges, combining visual auxiliary modalities has gained significant attention. Beyond this, integrating natural language information offers additional advantages by providing high-level semantic context, enhancing robustness, and clarifying target priorities, further elevating tracker performance. This work proposes the <bold>A</b>daptive <bold>M</b>ulti-modal <bold>V</b>isual Tracking with Dynamic Semantic Prompts (<bold>AMVTrack</b>) tracker, which efficiently incorporates image descriptions and avoids text dependency during tracking to improve flexibility and adaptability. AMVTrack significantly reduces computational resource consumption by freezing the parameters of the image encoder, text encoder, and Box Head and only optimizing a few learnable prompt parameters. Additionally, we introduce the Adaptive Dynamic Semantic Prompt Generator (ADSPG), which dynamically generates semantic prompts based on visual features, and the <bold>V</b>isual-<bold>L</b>anguage <bold>F</b>usion <bold>A</b>daptation (<bold>V-L FA</b>) method, which integrates multi-modal features to ensure consistency and complementarity of information. Additionally, we partition the Image Encoder to conduct an in-depth investigation into the relationship between the importance of features across different depth and width regions. Experimental results demonstrate that AMVTrack achieves significant performance improvements on multiple benchmark datasets, proving its effectiveness and robustness in complex scenarios.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"915-928"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1