首页 > 最新文献

IEEE Transactions on Multimedia最新文献

英文 中文
Purified Zero-Shot Sketch-Based Image Retrieval 基于纯化零镜头草图的图像检索
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632682
Yang Zhou;Jingru Yang;Jin Wang;Kaixiang Huang;Guodong Lu;Shengfeng He
Sketches, as a new solution in multimedia systems that can replace natural language, are characterized by sparse visual cues such as simple strokes that differ significantly from natural images containing complex elements such as background, foreground, and texture. This misalignment poses substantial challenges for zero-shot sketch-based image retrieval (ZS-SBIR). Prior approaches match sketches to full images and tend to overlook redundant elements in natural images, leading to model distraction and semantic ambiguity. To address this issue, we introduce a distraction-agnostic framework, purified cross-domain matching (PuXIM), which operates on a straightforward principle: masking and matching. We devise a visual-cross-linguistic (VxL) sampler that generates linguistic masks based on semantic labels to obscure semantically irrelevant image features. Our novel contribution is the concept of purified masked matching (PMM), which comprises two processes: (1) reconstruction, which compels the image encoder to reconstruct the masked image feature, and (2) interaction, which involves a transformer decoder that processes both sketch and masked image features to investigate cross-domain relationships for effective matching. Evaluated on the TU-Berlin, Sketchy, and QuickDraw datasets, PuXIM sets new benchmarks in terms of performance. Importantly, the distraction-agnostic nature of the matching process renders PuXIM more conducive to training, enabling efficient adaptation to zero-shot scenarios with reduced data requirements and low data quality.
摘要速写作为多媒体系统中可以替代自然语言的一种新的解决方案,其特点是简单笔画等稀疏的视觉线索与包含背景、前景和纹理等复杂元素的自然图像有很大的不同。这种不对齐对基于零镜头草图的图像检索(ZS-SBIR)提出了实质性的挑战。先前的方法将草图与完整的图像相匹配,并倾向于忽略自然图像中的冗余元素,导致模型分散和语义模糊。为了解决这个问题,我们引入了一个干扰不可知的框架,纯化跨域匹配(PuXIM),它的工作原理很简单:屏蔽和匹配。我们设计了一个视觉跨语言(VxL)采样器,该采样器基于语义标签生成语言掩码,以模糊语义无关的图像特征。我们的新贡献是纯化屏蔽匹配(PMM)的概念,它包括两个过程:(1)重建,它迫使图像编码器重建被屏蔽的图像特征;(2)交互,它涉及一个转换器解码器,它处理草图和被屏蔽的图像特征,以研究跨域关系以进行有效匹配。通过对TU-Berlin、Sketchy和QuickDraw数据集的评估,PuXIM在性能方面设定了新的基准。重要的是,匹配过程的不可知性使得PuXIM更有利于训练,能够有效地适应零射击场景,减少数据需求,降低数据质量。
{"title":"Purified Zero-Shot Sketch-Based Image Retrieval","authors":"Yang Zhou;Jingru Yang;Jin Wang;Kaixiang Huang;Guodong Lu;Shengfeng He","doi":"10.1109/TMM.2025.3632682","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632682","url":null,"abstract":"Sketches, as a new solution in multimedia systems that can replace natural language, are characterized by sparse visual cues such as simple strokes that differ significantly from natural images containing complex elements such as background, foreground, and texture. This misalignment poses substantial challenges for zero-shot sketch-based image retrieval (ZS-SBIR). Prior approaches match sketches to full images and tend to overlook redundant elements in natural images, leading to model distraction and semantic ambiguity. To address this issue, we introduce a distraction-agnostic framework, purified cross-domain matching (PuXIM), which operates on a straightforward principle: masking and matching. We devise a visual-cross-linguistic (VxL) sampler that generates linguistic masks based on semantic labels to obscure semantically irrelevant image features. Our novel contribution is the concept of purified masked matching (PMM), which comprises two processes: (1) <italic>reconstruction</i>, which compels the image encoder to reconstruct the masked image feature, and (2) <italic>interaction</i>, which involves a transformer decoder that processes both sketch and masked image features to investigate cross-domain relationships for effective matching. Evaluated on the TU-Berlin, Sketchy, and QuickDraw datasets, PuXIM sets new benchmarks in terms of performance. Importantly, the distraction-agnostic nature of the matching process renders PuXIM more conducive to training, enabling efficient adaptation to zero-shot scenarios with reduced data requirements and low data quality.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"929-943"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-Scene Image Dehazing via Laplacian Pyramid-Based Conditional Diffusion Model 基于拉普拉斯金字塔的条件扩散模型的实景图像去雾
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632694
Yongzhen Wang;Jie Sun;Heng Liu;Xiao-Ping Zhang;Mingqiang Wei
Recent diffusion models have demonstrated exceptional efficacy across various image restoration tasks, but still suffer from time-consuming and substantial computational resource consumption. To address these challenges, we present LPCDiff, a novel Laplacian Pyramid-based Conditional Diffusion model designed for real-scene image dehazing. LPCDiff leverages the Laplacian pyramid decomposition to decouple the input image into two components: the low-resolution low-pass image and the high-frequency residuals. These components are subsequently reconstructed through a diffusion model and a well-designed high-frequency residual recovery module. With such a strategy, LPCDiff can substantially accelerate inference speed and reduce computational costs without sacrificing image fidelity. In addition, the framework empowers the model to capture intrinsic high-frequency details and low-frequency structural information within the image, resulting in sharper and more realistic haze-free outputs. Moreover, to extract more valuable information from the limited training data, we introduce a low-frequency refinement module to further enhance the intricate details of the final dehazed images. Through extensive experimentation, our method significantly outperforms 12 state-of-the-art approaches on three real-world and one synthetic image dehazing benchmarks.
最近的扩散模型在各种图像恢复任务中表现出优异的效果,但仍然存在耗时和大量计算资源消耗的问题。为了解决这些挑战,我们提出了LPCDiff,一种新的基于拉普拉斯金字塔的条件扩散模型,专为真实场景图像去雾而设计。LPCDiff利用拉普拉斯金字塔分解将输入图像解耦为两个组件:低分辨率低通图像和高频残差。这些组件随后通过扩散模型和精心设计的高频残余恢复模块进行重建。使用这种策略,LPCDiff可以在不牺牲图像保真度的情况下大大加快推理速度并降低计算成本。此外,该框架使模型能够捕获图像内固有的高频细节和低频结构信息,从而产生更清晰、更真实的无雾输出。此外,为了从有限的训练数据中提取更多有价值的信息,我们引入了低频细化模块,以进一步增强最终去雾图像的复杂细节。通过广泛的实验,我们的方法在三个真实世界和一个合成图像去雾基准上显著优于12种最先进的方法。
{"title":"Real-Scene Image Dehazing via Laplacian Pyramid-Based Conditional Diffusion Model","authors":"Yongzhen Wang;Jie Sun;Heng Liu;Xiao-Ping Zhang;Mingqiang Wei","doi":"10.1109/TMM.2025.3632694","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632694","url":null,"abstract":"Recent diffusion models have demonstrated exceptional efficacy across various image restoration tasks, but still suffer from time-consuming and substantial computational resource consumption. To address these challenges, we present LPCDiff, a novel Laplacian Pyramid-based Conditional Diffusion model designed for real-scene image dehazing. LPCDiff leverages the Laplacian pyramid decomposition to decouple the input image into two components: the low-resolution low-pass image and the high-frequency residuals. These components are subsequently reconstructed through a diffusion model and a well-designed high-frequency residual recovery module. With such a strategy, LPCDiff can substantially accelerate inference speed and reduce computational costs without sacrificing image fidelity. In addition, the framework empowers the model to capture intrinsic high-frequency details and low-frequency structural information within the image, resulting in sharper and more realistic haze-free outputs. Moreover, to extract more valuable information from the limited training data, we introduce a low-frequency refinement module to further enhance the intricate details of the final dehazed images. Through extensive experimentation, our method significantly outperforms 12 state-of-the-art approaches on three real-world and one synthetic image dehazing benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"944-957"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generic-to-Personalised Learning for Multimodal Image Synthesis With Bidirectional Variational GAN 基于双向变分GAN的多模态图像合成的从通用到个性化学习
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632663
Long Chen;Xirui Dong;Jiangrong Shen;Lu Zhang;Qi Xu;Gang Pan;Qiang Zhang
Multimodal image synthesis, which predicts target-modality images from source-modality images, has garnered considerable attention in the field of clinical diagnosis. Both unidirectional and bidirectional multimodal image synthesis methods have been explored in the medical domain, however, unidirectional models heavily rely on paired images, while current bidirectional models typically overlook local image details due to their unsupervised training patterns. In this work, we propose a Bidirectional Variational Generative Adversarial Network (BVGAN) for multimodal image synthesis, which achieves high-quality bidirectional translations between any two modalities using only a limited number paired images. Firstly, BVGAN’s generator incorporates a variational structure (VAS) to regularise the latent space for noise reduction. This regularisation imposes smoothness to the latent space, enabling BVGAN to produce high-quality, noise-free images. Secondly, a novel generic-to-personalised (GTP) learning strategy is introduced to train BVGAN and reduce its reliance on a large sets of paired images. GTP initially leverages an unsupervised learning model to capture the global mapping between two modalities using unpaired images from generic patients. It then applies a supervised learning model to refine the mapping for individual patient, enhancing image details. Finally, the GTP learning strategy along with VAS enables BVGAN to achieve state-of-the-art performance on two multi-modality medical datasets: Brain CTMRI and BRATS.
从源模态图像预测目标模态图像的多模态图像合成在临床诊断领域引起了相当大的关注。医学领域已经探索了单向和双向多模态图像合成方法,但是单向模型严重依赖于成对图像,而目前的双向模型由于其无监督的训练模式而忽略了图像的局部细节。在这项工作中,我们提出了一种用于多模态图像合成的双向变分生成对抗网络(BVGAN),它仅使用有限数量的配对图像在任意两个模态之间实现高质量的双向翻译。首先,BVGAN的发生器采用变分结构(VAS)对潜在空间进行正则化以降低噪声。这种正则化对潜在空间施加平滑性,使BVGAN能够产生高质量,无噪声的图像。其次,引入了一种新的通用到个性化(GTP)学习策略来训练BVGAN,减少了对大量配对图像的依赖。GTP最初利用无监督学习模型,利用来自普通患者的未配对图像来捕获两种模式之间的全局映射。然后,它应用监督学习模型来细化单个患者的映射,增强图像细节。最后,GTP学习策略以及VAS使BVGAN能够在两个多模态医疗数据集(脑CTMRI和BRATS)上实现最先进的性能。
{"title":"Generic-to-Personalised Learning for Multimodal Image Synthesis With Bidirectional Variational GAN","authors":"Long Chen;Xirui Dong;Jiangrong Shen;Lu Zhang;Qi Xu;Gang Pan;Qiang Zhang","doi":"10.1109/TMM.2025.3632663","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632663","url":null,"abstract":"Multimodal image synthesis, which predicts target-modality images from source-modality images, has garnered considerable attention in the field of clinical diagnosis. Both unidirectional and bidirectional multimodal image synthesis methods have been explored in the medical domain, however, unidirectional models heavily rely on paired images, while current bidirectional models typically overlook local image details due to their unsupervised training patterns. In this work, we propose a Bidirectional Variational Generative Adversarial Network (BVGAN) for multimodal image synthesis, which achieves high-quality bidirectional translations between any two modalities using only a limited number paired images. Firstly, BVGAN’s generator incorporates a variational structure (VAS) to regularise the latent space for noise reduction. This regularisation imposes smoothness to the latent space, enabling BVGAN to produce high-quality, noise-free images. Secondly, a novel generic-to-personalised (GTP) learning strategy is introduced to train BVGAN and reduce its reliance on a large sets of paired images. GTP initially leverages an unsupervised learning model to capture the global mapping between two modalities using unpaired images from generic patients. It then applies a supervised learning model to refine the mapping for individual patient, enhancing image details. Finally, the GTP learning strategy along with VAS enables BVGAN to achieve state-of-the-art performance on two multi-modality medical datasets: Brain CTMRI and BRATS.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"902-914"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Progressive Learning of Instance-Level Proxy Semantics for Few-Shot Action Recognition 基于实例级代理语义的少射动作识别
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632652
Fang Peng;Xiaoshan Yang;Yaowei Wang;Changsheng Xu
Few-shot action recognition is a crucial task for mitigating the challenges of data scarcity in video understanding. Recent advancements in large-scale pre-trained models have introduced the potential of incorporating semantic knowledge from multi-modal pre-trained models, such as CLIP, to alleviate these challenges. Although some progress have been made, existing methods still rely on class-level text embeddings that are inherently low in diversity, limiting their ability to generalize to unseen actions. To overcome this limitation, we propose a novel framework called Progressive Learning of Instance-Level Proxy Semantics (ProLIPS). ProLIPS integrates Proxy Semantic Diffusion (PSD) to generate rich, instance-level proxy semantic features with diverse semantic contents and temporal dynamics, utilizing a multi-step CLIP-guidance mechanism and a time-conditioned reverse diffusion process. Our approach preserves the diversity of semantic-aligned visual features, significantly improving the generalization and robustness of few-shot action recognition. Extensive experiments on five challenging benchmarks demonstrate the effectiveness of ProLIPS.
少镜头动作识别是缓解视频理解中数据稀缺性挑战的关键任务。大规模预训练模型的最新进展引入了整合多模态预训练模型(如CLIP)的语义知识的潜力,以缓解这些挑战。尽管已经取得了一些进展,但现有的方法仍然依赖于类级别的文本嵌入,其多样性本身就很低,限制了它们泛化到看不见的动作的能力。为了克服这一限制,我们提出了一个新的框架,称为实例级代理语义的渐进学习(ProLIPS)。ProLIPS集成了代理语义扩散(Proxy Semantic Diffusion, PSD),利用多步clip引导机制和时间条件反向扩散过程,生成具有多种语义内容和时间动态的丰富的实例级代理语义特征。我们的方法保留了语义对齐视觉特征的多样性,显著提高了少镜头动作识别的泛化和鲁棒性。在五个具有挑战性的基准上进行的大量实验证明了ProLIPS的有效性。
{"title":"Progressive Learning of Instance-Level Proxy Semantics for Few-Shot Action Recognition","authors":"Fang Peng;Xiaoshan Yang;Yaowei Wang;Changsheng Xu","doi":"10.1109/TMM.2025.3632652","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632652","url":null,"abstract":"Few-shot action recognition is a crucial task for mitigating the challenges of data scarcity in video understanding. Recent advancements in large-scale pre-trained models have introduced the potential of incorporating semantic knowledge from multi-modal pre-trained models, such as CLIP, to alleviate these challenges. Although some progress have been made, existing methods still rely on class-level text embeddings that are inherently low in diversity, limiting their ability to generalize to unseen actions. To overcome this limitation, we propose a novel framework called Progressive Learning of Instance-Level Proxy Semantics (ProLIPS). ProLIPS integrates Proxy Semantic Diffusion (PSD) to generate rich, instance-level proxy semantic features with diverse semantic contents and temporal dynamics, utilizing a multi-step CLIP-guidance mechanism and a time-conditioned reverse diffusion process. Our approach preserves the diversity of semantic-aligned visual features, significantly improving the generalization and robustness of few-shot action recognition. Extensive experiments on five challenging benchmarks demonstrate the effectiveness of ProLIPS.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"853-864"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cas-OVD: Cascaded Open-Vocabulary Detection of Small Objects Using Multi-Refined Region Proposal Network in Autonomous Driving Cas-OVD:基于多细化区域建议网络的自动驾驶小目标级联开放词汇检测
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632649
Zhenyu Fang;Yulong Wu;Jinchang Ren;Jiangbin Zheng;Yijun Yan;Lixiang Zhang
Although text information has aided existing models to achieve promising results in open vocabulary object detection (OVD), the lack of semantic information has led to the difficulty in small objects detection (SOD). Moreover, such semantic gap also causes failure when matching texts and image features, resulting in false negative instances being detected. To address these issues, we propose a Cascade Open Vocabulary Detector (Cas-OVD), which builds upon existing multi-stage detection pipelines but specializes in text-vision alignment for small objects. In particular, we adapt a multi-refined region proposal network, guided by a non-sampled anchor strategy, to reduce the missing and false detections of small objects. Meanwhile, a deformable convolution network based feature conversion module is proposed to enhance the semantic information of small objects even the potential ones with low confidence. Unlike existing methods that rely on coarse-grained image-based features for image-text matching, Cas-OVD refines these features through a cascade alignment process, allowing each stage to build on the results of the previous one. This can progressively enhance the feature correlation between the image regions and the textual descriptions through successive error correction. On the joint BDD100K-SODA-D dataset, Cas-OVD achieved 17.95% AP$_{mathrm{all}}$ and 14.6% AP$_{mathrm{s}}$, outperforming RegionCLIP by 3.5% AP$_{mathrm{all}}$ and 3.0% AP$_{mathrm{s}}$, respectively. On the OV_COCO dataset, Cas-OVD has the 32.71% AP$_{mathrm{all}}$ and 17.26% AP$_{mathrm{s}}$, surpassing the RegionCLIP by 6.6% AP$_{mathrm{all}}$ and 6.1% AP$_{mathrm{s}}$, respectively.
尽管文本信息帮助现有模型在开放词汇目标检测(OVD)中取得了很好的结果,但语义信息的缺乏导致了小目标检测(SOD)的困难。此外,这种语义差距也会导致文本和图像特征匹配失败,从而检测到假阴性实例。为了解决这些问题,我们提出了一个级联开放词汇检测器(Cas-OVD),它建立在现有的多阶段检测管道之上,但专门用于小对象的文本视觉对齐。特别地,我们采用了一种由非采样锚策略指导的多细化区域建议网络,以减少小目标的缺失和错误检测。同时,提出了一种基于可变形卷积网络的特征转换模块,增强小目标甚至潜在低置信度目标的语义信息。与依赖基于图像的粗粒度特征进行图像-文本匹配的现有方法不同,Cas-OVD通过级联对齐过程对这些特征进行细化,允许每个阶段都以前一个阶段的结果为基础。通过逐次纠错,逐步增强图像区域与文本描述之间的特征相关性。在联合BDD100K-SODA-D数据集上,Cas-OVD实现了17.95%的AP$_{mathrm{all}}$和14.6%的AP$_{mathrm{s}}$,分别比RegionCLIP高出3.5%的AP$_{mathrm{all}}$和3.0%的AP$_{mathrm{s}}$。在OV_COCO数据集上,Cas-OVD具有32.71%的AP$_{mathrm{all}}$和17.26%的AP$_{mathrm{s}}$,分别比RegionCLIP高出6.6%的AP$_{mathrm{all}}$和6.1%的AP$_{mathrm{s}}$。
{"title":"Cas-OVD: Cascaded Open-Vocabulary Detection of Small Objects Using Multi-Refined Region Proposal Network in Autonomous Driving","authors":"Zhenyu Fang;Yulong Wu;Jinchang Ren;Jiangbin Zheng;Yijun Yan;Lixiang Zhang","doi":"10.1109/TMM.2025.3632649","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632649","url":null,"abstract":"Although text information has aided existing models to achieve promising results in open vocabulary object detection (OVD), the lack of semantic information has led to the difficulty in small objects detection (SOD). Moreover, such semantic gap also causes failure when matching texts and image features, resulting in false negative instances being detected. To address these issues, we propose a Cascade Open Vocabulary Detector (Cas-OVD), which builds upon existing multi-stage detection pipelines but specializes in text-vision alignment for small objects. In particular, we adapt a multi-refined region proposal network, guided by a non-sampled anchor strategy, to reduce the missing and false detections of small objects. Meanwhile, a deformable convolution network based feature conversion module is proposed to enhance the semantic information of small objects even the potential ones with low confidence. Unlike existing methods that rely on coarse-grained image-based features for image-text matching, Cas-OVD refines these features through a cascade alignment process, allowing each stage to build on the results of the previous one. This can progressively enhance the feature correlation between the image regions and the textual descriptions through successive error correction. On the joint BDD100K-SODA-D dataset, Cas-OVD achieved 17.95% AP<inline-formula><tex-math>$_{mathrm{all}}$</tex-math></inline-formula> and 14.6% AP<inline-formula><tex-math>$_{mathrm{s}}$</tex-math></inline-formula>, outperforming RegionCLIP by 3.5% AP<inline-formula><tex-math>$_{mathrm{all}}$</tex-math></inline-formula> and 3.0% AP<inline-formula><tex-math>$_{mathrm{s}}$</tex-math></inline-formula>, respectively. On the OV_COCO dataset, Cas-OVD has the 32.71% AP<inline-formula><tex-math>$_{mathrm{all}}$</tex-math></inline-formula> and 17.26% AP<inline-formula><tex-math>$_{mathrm{s}}$</tex-math></inline-formula>, surpassing the RegionCLIP by 6.6% AP<inline-formula><tex-math>$_{mathrm{all}}$</tex-math></inline-formula> and 6.1% AP<inline-formula><tex-math>$_{mathrm{s}}$</tex-math></inline-formula>, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"757-771"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MDT-FI: Mask-Guided Dual-Branch Transformer With Texture and Structure Feature Interaction for Image Inpainting MDT-FI:具有纹理和结构特征交互的掩模引导双支路变压器
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632651
Dong Liu;Xiaofeng Wang;Ruidong Han;Jianghua Li;Shanmin Pang
Image inpainting has attracted considerable attention in computer vision and image processing due to its wide range of applications. While deep learning-based methods have shown promising potential, accurately recovering pixel-level details remains a significant challenge, particularly in the presence of large and irregular missing regions. Furthermore, existing methods are limited by unidirectional semantic guidance and a localized understanding of global structural context. In this study, we propose a mask-guided dual-branch Transformer-based framework, named MDT-FI, which effectively balances local detail restoration and global contextual reasoning by explicitly modeling long-range dependencies. MDT-FI consists of three key components: the Interactive Attention Module (IAM), the Spectral Harmonization Module (SHM), and the Lateral Adaptation Network (LAN). The model integrates multi-scale feature interaction, frequency-domain information fusion, and a mask-guided attention mechanism to progressively build cross-level feature associations. This design facilitates multi-level representation learning and optimization, thereby enhancing local texture synthesis while preserving global structural consistency. To further improve perceptual quality, a feature augmenter is employed to assess the fidelity of both texture and structure in the generated results. Extensive experiments on CelebA-HQ, Places2, and Paris Street View demonstrate that MDT-FI significantly outperforms state-of-the-art methods.
在计算机视觉和图像处理领域,图像补绘因其广泛的应用而备受关注。虽然基于深度学习的方法已经显示出很大的潜力,但准确地恢复像素级的细节仍然是一个重大的挑战,特别是在存在大型和不规则缺失区域的情况下。此外,现有的方法受到单向语义引导和对全局结构上下文的局部理解的限制。在这项研究中,我们提出了一个基于掩模引导的双支路变压器框架,命名为MDT-FI,它通过显式建模远程依赖关系,有效地平衡了局部细节恢复和全局上下文推理。MDT-FI由三个关键组件组成:交互注意模块(IAM)、频谱协调模块(SHM)和横向适应网络(LAN)。该模型结合多尺度特征交互、频域信息融合和掩模引导注意机制,逐步构建跨层次特征关联。这种设计促进了多级表示学习和优化,从而在保持全局结构一致性的同时增强了局部纹理合成。为了进一步提高感知质量,使用特征增强器来评估生成结果中纹理和结构的保真度。在CelebA-HQ、Places2和巴黎街景上进行的大量实验表明,MDT-FI明显优于最先进的方法。
{"title":"MDT-FI: Mask-Guided Dual-Branch Transformer With Texture and Structure Feature Interaction for Image Inpainting","authors":"Dong Liu;Xiaofeng Wang;Ruidong Han;Jianghua Li;Shanmin Pang","doi":"10.1109/TMM.2025.3632651","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632651","url":null,"abstract":"Image inpainting has attracted considerable attention in computer vision and image processing due to its wide range of applications. While deep learning-based methods have shown promising potential, accurately recovering pixel-level details remains a significant challenge, particularly in the presence of large and irregular missing regions. Furthermore, existing methods are limited by unidirectional semantic guidance and a localized understanding of global structural context. In this study, we propose a mask-guided dual-branch Transformer-based framework, named MDT-FI, which effectively balances local detail restoration and global contextual reasoning by explicitly modeling long-range dependencies. MDT-FI consists of three key components: the Interactive Attention Module (IAM), the Spectral Harmonization Module (SHM), and the Lateral Adaptation Network (LAN). The model integrates multi-scale feature interaction, frequency-domain information fusion, and a mask-guided attention mechanism to progressively build cross-level feature associations. This design facilitates multi-level representation learning and optimization, thereby enhancing local texture synthesis while preserving global structural consistency. To further improve perceptual quality, a feature augmenter is employed to assess the fidelity of both texture and structure in the generated results. Extensive experiments on CelebA-HQ, Places2, and Paris Street View demonstrate that MDT-FI significantly outperforms state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"985-997"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MUVOD: A Novel Multi-View Video Object Segm entation Dataset and a Benchmark for 3D Segmentation MUVOD:一种新的多视点视频对象分割数据集和三维分割基准
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632697
Bangning Wei;Joshua Maraval;Meriem Outtas;Kidiyo Kpalma;Nicolas Ramin;Lu Zhang
The application of methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS) have steadily gained popularity in the field of 3D object segmentation in static scenes. These approaches demonstrate efficacy in a range of 3D scene understanding and editing tasks. Nevertheless, the 4D object segmentation of dynamic scenes remains an underexplored field due to the absence of a sufficiently extensive and accurately labelled multi-view video dataset. In this paper, we present MUVOD, a new multi-view video dataset for training and evaluating object segmentation in reconstructed real-world scenarios. The 17 selected scenes, describing various indoor or outdoor activities, are collected from different sources of datasets originating from various types of camera rigs. Each scene contains a minimum of 9 views and a maximum of 46 views. We provide 7830 RGB images (30 frames per video) with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig. This dataset, which contains 459 instances of 73 categories, is intended as a basic benchmark for the evaluation of multi-view video segmentation methods. We also present an evaluation metric and a baseline segmentation approach to encourage and evaluate progress in this evolving field. Additionally, we propose a new benchmark for 3D object segmentation task with a subset of annotated multi-view images selected from our MUVOD dataset. This subset contains 50 objects of different conditions in different scenarios, providing a more comprehensive analysis of state-of-the-art 3D object segmentation methods.
基于神经辐射场(Neural Radiance Fields, NeRF)和三维高斯飞溅(3D Gaussian splplatting, 3D GS)的方法在静态场景下的三维目标分割领域得到了广泛的应用。这些方法在一系列3D场景理解和编辑任务中证明了有效性。然而,由于缺乏足够广泛和准确标记的多视点视频数据集,动态场景的4D物体分割仍然是一个未开发的领域。在本文中,我们提出了一种新的多视图视频数据集MUVOD,用于在重建的真实场景中训练和评估目标分割。17个选定的场景,描述了各种室内或室外活动,从不同来源的数据集收集,这些数据集来自不同类型的摄像机。每个场景包含最少9个视图,最多46个视图。我们在4D运动中提供7830个RGB图像(每个视频30帧)及其相应的分割掩码,这意味着场景中任何感兴趣的对象都可以跨给定视图的时间帧或跨属于同一摄像机的不同视图进行跟踪。该数据集包含73个类别的459个实例,旨在作为评估多视图视频分割方法的基本基准。我们还提出了一个评估指标和基线分割方法,以鼓励和评估这一不断发展的领域的进展。此外,我们提出了一个新的基准,用于3D物体分割任务,该任务使用从我们的MUVOD数据集中选择的带注释的多视图图像子集。该子集包含50个不同场景下不同条件的对象,为最先进的3D对象分割方法提供更全面的分析。
{"title":"MUVOD: A Novel Multi-View Video Object Segm entation Dataset and a Benchmark for 3D Segmentation","authors":"Bangning Wei;Joshua Maraval;Meriem Outtas;Kidiyo Kpalma;Nicolas Ramin;Lu Zhang","doi":"10.1109/TMM.2025.3632697","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632697","url":null,"abstract":"The application of methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS) have steadily gained popularity in the field of 3D object segmentation in static scenes. These approaches demonstrate efficacy in a range of 3D scene understanding and editing tasks. Nevertheless, the 4D object segmentation of dynamic scenes remains an underexplored field due to the absence of a sufficiently extensive and accurately labelled multi-view video dataset. In this paper, we present MUVOD, a new multi-view video dataset for training and evaluating object segmentation in reconstructed real-world scenarios. The 17 selected scenes, describing various indoor or outdoor activities, are collected from different sources of datasets originating from various types of camera rigs. Each scene contains a minimum of 9 views and a maximum of 46 views. We provide 7830 RGB images (30 frames per video) with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig. This dataset, which contains 459 instances of 73 categories, is intended as a basic benchmark for the evaluation of multi-view video segmentation methods. We also present an evaluation metric and a baseline segmentation approach to encourage and evaluate progress in this evolving field. Additionally, we propose a new benchmark for 3D object segmentation task with a subset of annotated multi-view images selected from our MUVOD dataset. This subset contains 50 objects of different conditions in different scenarios, providing a more comprehensive analysis of state-of-the-art 3D object segmentation methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"726-741"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ASK-HOI: Affordance-Scene Knowledge Prompting for Human-Object Interaction Detection 面向人-物交互检测的情景知识提示
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632627
Dongpan Chen;Dehui Kong;Junna Gao;Jinghua Li;Qianxing Li;Baocai Yin
Human-object interaction (HOI) detection task aims to learn how humans interact with surrounding objectsby inferring fine-grained triples of $leftlangle rm {{human, action, object}} rightrangle$, which plays a vital role in computer vision tasks such as human-centered scene understanding and visual question answering. However, HOI detection suffers from class long-tailed distributions and zero-shot problems. Current methods typically identify HOI only from input images or label spaces in a data-driven manner, lacking sufficient knowledge prompts, and consequently limits their potential for real-world scenes. Hence, to fill this gap, this paper introduces affordance and scene knowledge as prompts on different granularities to the HOI detector to improve its recognition ability. Concretely, we first construct a large-scale affordance-scene knowledge graph, named ASKG, whose knowledge can be divided into two categories according to the fields of image information, i.e., the knowledge related to affordances of object instances and the knowledge associated with the scene. Subsequently, the knowledge of affordance and scene specific to the input image is extracted by an ASKG-based prior knowledge embedding module. Since this knowledge corresponds to the image at different granularities, we then propose an instance field adaptive fusion module and a scene field adaptive fusion module to enable visual features fully absorb the knowledge prompts. These two encoded features of different fields and knowledge embeddings are finally fed into a proposed HOI recognition module to predict more accurate HOI results. Extensive experiments on both HICO-DET and V-COCO benchmarks demonstrate that the proposed method leads to competitive results compared with the state-of-the-art methods.
人类-物体交互(HOI)检测任务旨在通过推断$leftlangle rm {{human, action, object}} rightrangle$的细粒度三元组来了解人类与周围物体的交互方式,这在以人为中心的场景理解和视觉问答等计算机视觉任务中起着至关重要的作用。然而,HOI检测受到类长尾分布和零射击问题的困扰。目前的方法通常仅以数据驱动的方式从输入图像或标签空间中识别HOI,缺乏足够的知识提示,因此限制了它们在现实场景中的潜力。因此,为了填补这一空白,本文在HOI检测器中引入了不同粒度的可视性和场景知识作为提示,以提高其识别能力。具体而言,我们首先构建了一个大规模的情景情景知识图ASKG,根据图像信息的领域将其知识分为两类,即与对象实例的情景情景相关的知识和与场景相关的知识。随后,通过基于ask的先验知识嵌入模块提取输入图像特定的可见性和场景知识。由于这些知识对应于不同粒度的图像,我们提出了实例场自适应融合模块和场景场自适应融合模块,使视觉特征充分吸收知识提示。这两种不同领域的编码特征和知识嵌入最终被输入到所提出的HOI识别模块中,以预测更准确的HOI结果。在HICO-DET和V-COCO基准测试上进行的大量实验表明,与最先进的方法相比,所提出的方法可以产生具有竞争力的结果。
{"title":"ASK-HOI: Affordance-Scene Knowledge Prompting for Human-Object Interaction Detection","authors":"Dongpan Chen;Dehui Kong;Junna Gao;Jinghua Li;Qianxing Li;Baocai Yin","doi":"10.1109/TMM.2025.3632627","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632627","url":null,"abstract":"Human-object interaction (HOI) detection task aims to learn how humans interact with surrounding objectsby inferring fine-grained triples of <inline-formula><tex-math>$leftlangle rm {{human, action, object}} rightrangle$</tex-math></inline-formula>, which plays a vital role in computer vision tasks such as human-centered scene understanding and visual question answering. However, HOI detection suffers from class long-tailed distributions and zero-shot problems. Current methods typically identify HOI only from input images or label spaces in a data-driven manner, lacking sufficient knowledge prompts, and consequently limits their potential for real-world scenes. Hence, to fill this gap, this paper introduces affordance and scene knowledge as prompts on different granularities to the HOI detector to improve its recognition ability. Concretely, we first construct a large-scale affordance-scene knowledge graph, named ASKG, whose knowledge can be divided into two categories according to the fields of image information, i.e., the knowledge related to affordances of object instances and the knowledge associated with the scene. Subsequently, the knowledge of affordance and scene specific to the input image is extracted by an ASKG-based prior knowledge embedding module. Since this knowledge corresponds to the image at different granularities, we then propose an instance field adaptive fusion module and a scene field adaptive fusion module to enable visual features fully absorb the knowledge prompts. These two encoded features of different fields and knowledge embeddings are finally fed into a proposed HOI recognition module to predict more accurate HOI results. Extensive experiments on both HICO-DET and V-COCO benchmarks demonstrate that the proposed method leads to competitive results compared with the state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"742-756"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Partition Map-Based Fast Block Partitioning for VVC Inter Coding 基于分区映射的VVC编码快速块分区
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632639
Xinmin Feng;Zhuoyuan Li;Li Li;Dong Liu;Feng Wu
Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast block partitioning in inter coding. Based on our previous work on partition map-based methods for intra coding, we analyze the characteristics of VVC inter coding and improve the partition map by incorporating an MTT mask for early termination. Next, we develop a neural network that uses both spatial and temporal features to predict the partition map. It consists of several special designs, including stacked top-down and bottom-up processing, quantization parameter modulation layers, and partitioning-adaptive warping. Furthermore, we present a dual-threshold decision scheme to achieve a fine-grained trade-off between complexity reduction and rate-distortion performance loss. The experimental results demonstrate that the proposed method achieves an average 51.30% encoding time saving with a 2.12% Bjøntegaard-delta-bit-rate under the random access configuration.
在通用视频编码(VVC)的新技术中,嵌套多类型树(MTT)块结构的四叉树通过提供更灵活的块划分模式而获得了显著的编码增益。然而,VVC编码器中的递归分割搜索大大增加了编码器的复杂度。为了解决这个问题,我们提出了一种基于分区映射的算法来实现编码间的快速块划分。在前人基于分区映射编码方法的基础上,我们分析了VVC间编码的特点,并通过引入MTT掩码来改进分区映射。接下来,我们开发了一个使用空间和时间特征来预测分区图的神经网络。它由几种特殊的设计组成,包括自顶向下和自底向上的堆叠处理、量化参数调制层和分区自适应翘曲。此外,我们提出了一种双阈值决策方案,以实现复杂性降低和率失真性能损失之间的细粒度权衡。实验结果表明,在随机接入配置下,该方法平均节省51.30%的编码时间和2.12%的bj øntegaard-delta比特率。
{"title":"Partition Map-Based Fast Block Partitioning for VVC Inter Coding","authors":"Xinmin Feng;Zhuoyuan Li;Li Li;Dong Liu;Feng Wu","doi":"10.1109/TMM.2025.3632639","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632639","url":null,"abstract":"Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast block partitioning in inter coding. Based on our previous work on partition map-based methods for intra coding, we analyze the characteristics of VVC inter coding and improve the partition map by incorporating an MTT mask for early termination. Next, we develop a neural network that uses both spatial and temporal features to predict the partition map. It consists of several special designs, including stacked top-down and bottom-up processing, quantization parameter modulation layers, and partitioning-adaptive warping. Furthermore, we present a dual-threshold decision scheme to achieve a fine-grained trade-off between complexity reduction and rate-distortion performance loss. The experimental results demonstrate that the proposed method achieves an average 51.30% encoding time saving with a 2.12% Bjøntegaard-delta-bit-rate under the random access configuration.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"998-1013"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Granularity Query Network With Adaptive Category Feature Embedding for Behavior Recognition 基于自适应类别特征嵌入的多粒度查询网络行为识别
IF 9.7 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-11-14 DOI: 10.1109/TMM.2025.3632695
Nuoer Long;Yonghao Dang;Kaiwen Yang;Chengpeng Xiong;Shaobin Chen;Tao Tan;Wei Ke;Chan-Tong Lam;Jianqin Yin;Peter H. N. de With;Yue Sun
Behavior recognition is a highly challenging task, particularly in scenarios requiring unified recognition across both human and animal subjects. Most existing approaches primarily focus on single-species datasets or rely heavily on prior information such as species labels, positional annotations, or skeletal keypoints, which limits their applicability in real-world scenarios where species labels may be ambiguous or annotations are insufficient. To address these limitations, we propose a query-based Multi-Granularity Behavior Recognition Network that directly mines cross-species shared spatiotemporal behavior patterns from raw video inputs. Specifically, we design a Multi-Granularity Query module to effectively fuse fine-grained and coarse-grained features, thereby enhancing the model's capability in capturing spatiotemporal dynamics at different granularities. Additionally, we introduce a Category Query Decoder that leverages learnable category query vectors to achieve explicit behavior category modeling and mapping. Without relying on any extra annotations, the proposed method achieves unified recognition of multi-species and multi-category behaviors, setting a new state-of-the-art on the Animal Kingdom dataset and demonstrating strong generalization ability on the Charades dataset.
行为识别是一项极具挑战性的任务,特别是在需要统一识别人类和动物主体的场景中。大多数现有方法主要集中在单物种数据集上,或严重依赖于物种标记、位置注释或骨骼关键点等先验信息,这限制了它们在物种标记可能不明确或注释不足的现实场景中的适用性。为了解决这些限制,我们提出了一个基于查询的多粒度行为识别网络,该网络直接从原始视频输入中挖掘跨物种共享的时空行为模式。具体来说,我们设计了一个多粒度查询模块,有效地融合细粒度和粗粒度特征,从而增强了模型捕捉不同粒度时空动态的能力。此外,我们引入了一个类别查询解码器,它利用可学习的类别查询向量来实现显式的行为类别建模和映射。该方法在不依赖任何额外注释的情况下,实现了对多物种、多类别行为的统一识别,在动物王国数据集上开创了新局面,并在Charades数据集上展示了较强的泛化能力。
{"title":"Multi-Granularity Query Network With Adaptive Category Feature Embedding for Behavior Recognition","authors":"Nuoer Long;Yonghao Dang;Kaiwen Yang;Chengpeng Xiong;Shaobin Chen;Tao Tan;Wei Ke;Chan-Tong Lam;Jianqin Yin;Peter H. N. de With;Yue Sun","doi":"10.1109/TMM.2025.3632695","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632695","url":null,"abstract":"Behavior recognition is a highly challenging task, particularly in scenarios requiring unified recognition across both human and animal subjects. Most existing approaches primarily focus on single-species datasets or rely heavily on prior information such as species labels, positional annotations, or skeletal keypoints, which limits their applicability in real-world scenarios where species labels may be ambiguous or annotations are insufficient. To address these limitations, we propose a query-based Multi-Granularity Behavior Recognition Network that directly mines cross-species shared spatiotemporal behavior patterns from raw video inputs. Specifically, we design a Multi-Granularity Query module to effectively fuse fine-grained and coarse-grained features, thereby enhancing the model's capability in capturing spatiotemporal dynamics at different granularities. Additionally, we introduce a Category Query Decoder that leverages learnable category query vectors to achieve explicit behavior category modeling and mapping. Without relying on any extra annotations, the proposed method achieves unified recognition of multi-species and multi-category behaviors, setting a new state-of-the-art on the Animal Kingdom dataset and demonstrating strong generalization ability on the Charades dataset.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"878-890"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1