Pub Date : 2025-11-14DOI: 10.1109/TMM.2025.3632663
Long Chen;Xirui Dong;Jiangrong Shen;Lu Zhang;Qi Xu;Gang Pan;Qiang Zhang
Multimodal image synthesis, which predicts target-modality images from source-modality images, has garnered considerable attention in the field of clinical diagnosis. Both unidirectional and bidirectional multimodal image synthesis methods have been explored in the medical domain, however, unidirectional models heavily rely on paired images, while current bidirectional models typically overlook local image details due to their unsupervised training patterns. In this work, we propose a Bidirectional Variational Generative Adversarial Network (BVGAN) for multimodal image synthesis, which achieves high-quality bidirectional translations between any two modalities using only a limited number paired images. Firstly, BVGAN’s generator incorporates a variational structure (VAS) to regularise the latent space for noise reduction. This regularisation imposes smoothness to the latent space, enabling BVGAN to produce high-quality, noise-free images. Secondly, a novel generic-to-personalised (GTP) learning strategy is introduced to train BVGAN and reduce its reliance on a large sets of paired images. GTP initially leverages an unsupervised learning model to capture the global mapping between two modalities using unpaired images from generic patients. It then applies a supervised learning model to refine the mapping for individual patient, enhancing image details. Finally, the GTP learning strategy along with VAS enables BVGAN to achieve state-of-the-art performance on two multi-modality medical datasets: Brain CTMRI and BRATS.
{"title":"Generic-to-Personalised Learning for Multimodal Image Synthesis With Bidirectional Variational GAN","authors":"Long Chen;Xirui Dong;Jiangrong Shen;Lu Zhang;Qi Xu;Gang Pan;Qiang Zhang","doi":"10.1109/TMM.2025.3632663","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632663","url":null,"abstract":"Multimodal image synthesis, which predicts target-modality images from source-modality images, has garnered considerable attention in the field of clinical diagnosis. Both unidirectional and bidirectional multimodal image synthesis methods have been explored in the medical domain, however, unidirectional models heavily rely on paired images, while current bidirectional models typically overlook local image details due to their unsupervised training patterns. In this work, we propose a Bidirectional Variational Generative Adversarial Network (BVGAN) for multimodal image synthesis, which achieves high-quality bidirectional translations between any two modalities using only a limited number paired images. Firstly, BVGAN’s generator incorporates a variational structure (VAS) to regularise the latent space for noise reduction. This regularisation imposes smoothness to the latent space, enabling BVGAN to produce high-quality, noise-free images. Secondly, a novel generic-to-personalised (GTP) learning strategy is introduced to train BVGAN and reduce its reliance on a large sets of paired images. GTP initially leverages an unsupervised learning model to capture the global mapping between two modalities using unpaired images from generic patients. It then applies a supervised learning model to refine the mapping for individual patient, enhancing image details. Finally, the GTP learning strategy along with VAS enables BVGAN to achieve state-of-the-art performance on two multi-modality medical datasets: Brain CTMRI and BRATS.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"902-914"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-14DOI: 10.1109/TMM.2025.3632652
Fang Peng;Xiaoshan Yang;Yaowei Wang;Changsheng Xu
Few-shot action recognition is a crucial task for mitigating the challenges of data scarcity in video understanding. Recent advancements in large-scale pre-trained models have introduced the potential of incorporating semantic knowledge from multi-modal pre-trained models, such as CLIP, to alleviate these challenges. Although some progress have been made, existing methods still rely on class-level text embeddings that are inherently low in diversity, limiting their ability to generalize to unseen actions. To overcome this limitation, we propose a novel framework called Progressive Learning of Instance-Level Proxy Semantics (ProLIPS). ProLIPS integrates Proxy Semantic Diffusion (PSD) to generate rich, instance-level proxy semantic features with diverse semantic contents and temporal dynamics, utilizing a multi-step CLIP-guidance mechanism and a time-conditioned reverse diffusion process. Our approach preserves the diversity of semantic-aligned visual features, significantly improving the generalization and robustness of few-shot action recognition. Extensive experiments on five challenging benchmarks demonstrate the effectiveness of ProLIPS.
{"title":"Progressive Learning of Instance-Level Proxy Semantics for Few-Shot Action Recognition","authors":"Fang Peng;Xiaoshan Yang;Yaowei Wang;Changsheng Xu","doi":"10.1109/TMM.2025.3632652","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632652","url":null,"abstract":"Few-shot action recognition is a crucial task for mitigating the challenges of data scarcity in video understanding. Recent advancements in large-scale pre-trained models have introduced the potential of incorporating semantic knowledge from multi-modal pre-trained models, such as CLIP, to alleviate these challenges. Although some progress have been made, existing methods still rely on class-level text embeddings that are inherently low in diversity, limiting their ability to generalize to unseen actions. To overcome this limitation, we propose a novel framework called Progressive Learning of Instance-Level Proxy Semantics (ProLIPS). ProLIPS integrates Proxy Semantic Diffusion (PSD) to generate rich, instance-level proxy semantic features with diverse semantic contents and temporal dynamics, utilizing a multi-step CLIP-guidance mechanism and a time-conditioned reverse diffusion process. Our approach preserves the diversity of semantic-aligned visual features, significantly improving the generalization and robustness of few-shot action recognition. Extensive experiments on five challenging benchmarks demonstrate the effectiveness of ProLIPS.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"853-864"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although text information has aided existing models to achieve promising results in open vocabulary object detection (OVD), the lack of semantic information has led to the difficulty in small objects detection (SOD). Moreover, such semantic gap also causes failure when matching texts and image features, resulting in false negative instances being detected. To address these issues, we propose a Cascade Open Vocabulary Detector (Cas-OVD), which builds upon existing multi-stage detection pipelines but specializes in text-vision alignment for small objects. In particular, we adapt a multi-refined region proposal network, guided by a non-sampled anchor strategy, to reduce the missing and false detections of small objects. Meanwhile, a deformable convolution network based feature conversion module is proposed to enhance the semantic information of small objects even the potential ones with low confidence. Unlike existing methods that rely on coarse-grained image-based features for image-text matching, Cas-OVD refines these features through a cascade alignment process, allowing each stage to build on the results of the previous one. This can progressively enhance the feature correlation between the image regions and the textual descriptions through successive error correction. On the joint BDD100K-SODA-D dataset, Cas-OVD achieved 17.95% AP$_{mathrm{all}}$ and 14.6% AP$_{mathrm{s}}$, outperforming RegionCLIP by 3.5% AP$_{mathrm{all}}$ and 3.0% AP$_{mathrm{s}}$, respectively. On the OV_COCO dataset, Cas-OVD has the 32.71% AP$_{mathrm{all}}$ and 17.26% AP$_{mathrm{s}}$, surpassing the RegionCLIP by 6.6% AP$_{mathrm{all}}$ and 6.1% AP$_{mathrm{s}}$, respectively.
{"title":"Cas-OVD: Cascaded Open-Vocabulary Detection of Small Objects Using Multi-Refined Region Proposal Network in Autonomous Driving","authors":"Zhenyu Fang;Yulong Wu;Jinchang Ren;Jiangbin Zheng;Yijun Yan;Lixiang Zhang","doi":"10.1109/TMM.2025.3632649","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632649","url":null,"abstract":"Although text information has aided existing models to achieve promising results in open vocabulary object detection (OVD), the lack of semantic information has led to the difficulty in small objects detection (SOD). Moreover, such semantic gap also causes failure when matching texts and image features, resulting in false negative instances being detected. To address these issues, we propose a Cascade Open Vocabulary Detector (Cas-OVD), which builds upon existing multi-stage detection pipelines but specializes in text-vision alignment for small objects. In particular, we adapt a multi-refined region proposal network, guided by a non-sampled anchor strategy, to reduce the missing and false detections of small objects. Meanwhile, a deformable convolution network based feature conversion module is proposed to enhance the semantic information of small objects even the potential ones with low confidence. Unlike existing methods that rely on coarse-grained image-based features for image-text matching, Cas-OVD refines these features through a cascade alignment process, allowing each stage to build on the results of the previous one. This can progressively enhance the feature correlation between the image regions and the textual descriptions through successive error correction. On the joint BDD100K-SODA-D dataset, Cas-OVD achieved 17.95% AP<inline-formula><tex-math>$_{mathrm{all}}$</tex-math></inline-formula> and 14.6% AP<inline-formula><tex-math>$_{mathrm{s}}$</tex-math></inline-formula>, outperforming RegionCLIP by 3.5% AP<inline-formula><tex-math>$_{mathrm{all}}$</tex-math></inline-formula> and 3.0% AP<inline-formula><tex-math>$_{mathrm{s}}$</tex-math></inline-formula>, respectively. On the OV_COCO dataset, Cas-OVD has the 32.71% AP<inline-formula><tex-math>$_{mathrm{all}}$</tex-math></inline-formula> and 17.26% AP<inline-formula><tex-math>$_{mathrm{s}}$</tex-math></inline-formula>, surpassing the RegionCLIP by 6.6% AP<inline-formula><tex-math>$_{mathrm{all}}$</tex-math></inline-formula> and 6.1% AP<inline-formula><tex-math>$_{mathrm{s}}$</tex-math></inline-formula>, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"757-771"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-14DOI: 10.1109/TMM.2025.3632651
Dong Liu;Xiaofeng Wang;Ruidong Han;Jianghua Li;Shanmin Pang
Image inpainting has attracted considerable attention in computer vision and image processing due to its wide range of applications. While deep learning-based methods have shown promising potential, accurately recovering pixel-level details remains a significant challenge, particularly in the presence of large and irregular missing regions. Furthermore, existing methods are limited by unidirectional semantic guidance and a localized understanding of global structural context. In this study, we propose a mask-guided dual-branch Transformer-based framework, named MDT-FI, which effectively balances local detail restoration and global contextual reasoning by explicitly modeling long-range dependencies. MDT-FI consists of three key components: the Interactive Attention Module (IAM), the Spectral Harmonization Module (SHM), and the Lateral Adaptation Network (LAN). The model integrates multi-scale feature interaction, frequency-domain information fusion, and a mask-guided attention mechanism to progressively build cross-level feature associations. This design facilitates multi-level representation learning and optimization, thereby enhancing local texture synthesis while preserving global structural consistency. To further improve perceptual quality, a feature augmenter is employed to assess the fidelity of both texture and structure in the generated results. Extensive experiments on CelebA-HQ, Places2, and Paris Street View demonstrate that MDT-FI significantly outperforms state-of-the-art methods.
{"title":"MDT-FI: Mask-Guided Dual-Branch Transformer With Texture and Structure Feature Interaction for Image Inpainting","authors":"Dong Liu;Xiaofeng Wang;Ruidong Han;Jianghua Li;Shanmin Pang","doi":"10.1109/TMM.2025.3632651","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632651","url":null,"abstract":"Image inpainting has attracted considerable attention in computer vision and image processing due to its wide range of applications. While deep learning-based methods have shown promising potential, accurately recovering pixel-level details remains a significant challenge, particularly in the presence of large and irregular missing regions. Furthermore, existing methods are limited by unidirectional semantic guidance and a localized understanding of global structural context. In this study, we propose a mask-guided dual-branch Transformer-based framework, named MDT-FI, which effectively balances local detail restoration and global contextual reasoning by explicitly modeling long-range dependencies. MDT-FI consists of three key components: the Interactive Attention Module (IAM), the Spectral Harmonization Module (SHM), and the Lateral Adaptation Network (LAN). The model integrates multi-scale feature interaction, frequency-domain information fusion, and a mask-guided attention mechanism to progressively build cross-level feature associations. This design facilitates multi-level representation learning and optimization, thereby enhancing local texture synthesis while preserving global structural consistency. To further improve perceptual quality, a feature augmenter is employed to assess the fidelity of both texture and structure in the generated results. Extensive experiments on CelebA-HQ, Places2, and Paris Street View demonstrate that MDT-FI significantly outperforms state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"985-997"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The application of methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS) have steadily gained popularity in the field of 3D object segmentation in static scenes. These approaches demonstrate efficacy in a range of 3D scene understanding and editing tasks. Nevertheless, the 4D object segmentation of dynamic scenes remains an underexplored field due to the absence of a sufficiently extensive and accurately labelled multi-view video dataset. In this paper, we present MUVOD, a new multi-view video dataset for training and evaluating object segmentation in reconstructed real-world scenarios. The 17 selected scenes, describing various indoor or outdoor activities, are collected from different sources of datasets originating from various types of camera rigs. Each scene contains a minimum of 9 views and a maximum of 46 views. We provide 7830 RGB images (30 frames per video) with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig. This dataset, which contains 459 instances of 73 categories, is intended as a basic benchmark for the evaluation of multi-view video segmentation methods. We also present an evaluation metric and a baseline segmentation approach to encourage and evaluate progress in this evolving field. Additionally, we propose a new benchmark for 3D object segmentation task with a subset of annotated multi-view images selected from our MUVOD dataset. This subset contains 50 objects of different conditions in different scenarios, providing a more comprehensive analysis of state-of-the-art 3D object segmentation methods.
基于神经辐射场(Neural Radiance Fields, NeRF)和三维高斯飞溅(3D Gaussian splplatting, 3D GS)的方法在静态场景下的三维目标分割领域得到了广泛的应用。这些方法在一系列3D场景理解和编辑任务中证明了有效性。然而,由于缺乏足够广泛和准确标记的多视点视频数据集,动态场景的4D物体分割仍然是一个未开发的领域。在本文中,我们提出了一种新的多视图视频数据集MUVOD,用于在重建的真实场景中训练和评估目标分割。17个选定的场景,描述了各种室内或室外活动,从不同来源的数据集收集,这些数据集来自不同类型的摄像机。每个场景包含最少9个视图,最多46个视图。我们在4D运动中提供7830个RGB图像(每个视频30帧)及其相应的分割掩码,这意味着场景中任何感兴趣的对象都可以跨给定视图的时间帧或跨属于同一摄像机的不同视图进行跟踪。该数据集包含73个类别的459个实例,旨在作为评估多视图视频分割方法的基本基准。我们还提出了一个评估指标和基线分割方法,以鼓励和评估这一不断发展的领域的进展。此外,我们提出了一个新的基准,用于3D物体分割任务,该任务使用从我们的MUVOD数据集中选择的带注释的多视图图像子集。该子集包含50个不同场景下不同条件的对象,为最先进的3D对象分割方法提供更全面的分析。
{"title":"MUVOD: A Novel Multi-View Video Object Segm entation Dataset and a Benchmark for 3D Segmentation","authors":"Bangning Wei;Joshua Maraval;Meriem Outtas;Kidiyo Kpalma;Nicolas Ramin;Lu Zhang","doi":"10.1109/TMM.2025.3632697","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632697","url":null,"abstract":"The application of methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS) have steadily gained popularity in the field of 3D object segmentation in static scenes. These approaches demonstrate efficacy in a range of 3D scene understanding and editing tasks. Nevertheless, the 4D object segmentation of dynamic scenes remains an underexplored field due to the absence of a sufficiently extensive and accurately labelled multi-view video dataset. In this paper, we present MUVOD, a new multi-view video dataset for training and evaluating object segmentation in reconstructed real-world scenarios. The 17 selected scenes, describing various indoor or outdoor activities, are collected from different sources of datasets originating from various types of camera rigs. Each scene contains a minimum of 9 views and a maximum of 46 views. We provide 7830 RGB images (30 frames per video) with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig. This dataset, which contains 459 instances of 73 categories, is intended as a basic benchmark for the evaluation of multi-view video segmentation methods. We also present an evaluation metric and a baseline segmentation approach to encourage and evaluate progress in this evolving field. Additionally, we propose a new benchmark for 3D object segmentation task with a subset of annotated multi-view images selected from our MUVOD dataset. This subset contains 50 objects of different conditions in different scenarios, providing a more comprehensive analysis of state-of-the-art 3D object segmentation methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"726-741"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human-object interaction (HOI) detection task aims to learn how humans interact with surrounding objectsby inferring fine-grained triples of $leftlangle rm {{human, action, object}} rightrangle$, which plays a vital role in computer vision tasks such as human-centered scene understanding and visual question answering. However, HOI detection suffers from class long-tailed distributions and zero-shot problems. Current methods typically identify HOI only from input images or label spaces in a data-driven manner, lacking sufficient knowledge prompts, and consequently limits their potential for real-world scenes. Hence, to fill this gap, this paper introduces affordance and scene knowledge as prompts on different granularities to the HOI detector to improve its recognition ability. Concretely, we first construct a large-scale affordance-scene knowledge graph, named ASKG, whose knowledge can be divided into two categories according to the fields of image information, i.e., the knowledge related to affordances of object instances and the knowledge associated with the scene. Subsequently, the knowledge of affordance and scene specific to the input image is extracted by an ASKG-based prior knowledge embedding module. Since this knowledge corresponds to the image at different granularities, we then propose an instance field adaptive fusion module and a scene field adaptive fusion module to enable visual features fully absorb the knowledge prompts. These two encoded features of different fields and knowledge embeddings are finally fed into a proposed HOI recognition module to predict more accurate HOI results. Extensive experiments on both HICO-DET and V-COCO benchmarks demonstrate that the proposed method leads to competitive results compared with the state-of-the-art methods.
{"title":"ASK-HOI: Affordance-Scene Knowledge Prompting for Human-Object Interaction Detection","authors":"Dongpan Chen;Dehui Kong;Junna Gao;Jinghua Li;Qianxing Li;Baocai Yin","doi":"10.1109/TMM.2025.3632627","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632627","url":null,"abstract":"Human-object interaction (HOI) detection task aims to learn how humans interact with surrounding objectsby inferring fine-grained triples of <inline-formula><tex-math>$leftlangle rm {{human, action, object}} rightrangle$</tex-math></inline-formula>, which plays a vital role in computer vision tasks such as human-centered scene understanding and visual question answering. However, HOI detection suffers from class long-tailed distributions and zero-shot problems. Current methods typically identify HOI only from input images or label spaces in a data-driven manner, lacking sufficient knowledge prompts, and consequently limits their potential for real-world scenes. Hence, to fill this gap, this paper introduces affordance and scene knowledge as prompts on different granularities to the HOI detector to improve its recognition ability. Concretely, we first construct a large-scale affordance-scene knowledge graph, named ASKG, whose knowledge can be divided into two categories according to the fields of image information, i.e., the knowledge related to affordances of object instances and the knowledge associated with the scene. Subsequently, the knowledge of affordance and scene specific to the input image is extracted by an ASKG-based prior knowledge embedding module. Since this knowledge corresponds to the image at different granularities, we then propose an instance field adaptive fusion module and a scene field adaptive fusion module to enable visual features fully absorb the knowledge prompts. These two encoded features of different fields and knowledge embeddings are finally fed into a proposed HOI recognition module to predict more accurate HOI results. Extensive experiments on both HICO-DET and V-COCO benchmarks demonstrate that the proposed method leads to competitive results compared with the state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"742-756"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-14DOI: 10.1109/TMM.2025.3632639
Xinmin Feng;Zhuoyuan Li;Li Li;Dong Liu;Feng Wu
Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast block partitioning in inter coding. Based on our previous work on partition map-based methods for intra coding, we analyze the characteristics of VVC inter coding and improve the partition map by incorporating an MTT mask for early termination. Next, we develop a neural network that uses both spatial and temporal features to predict the partition map. It consists of several special designs, including stacked top-down and bottom-up processing, quantization parameter modulation layers, and partitioning-adaptive warping. Furthermore, we present a dual-threshold decision scheme to achieve a fine-grained trade-off between complexity reduction and rate-distortion performance loss. The experimental results demonstrate that the proposed method achieves an average 51.30% encoding time saving with a 2.12% Bjøntegaard-delta-bit-rate under the random access configuration.
{"title":"Partition Map-Based Fast Block Partitioning for VVC Inter Coding","authors":"Xinmin Feng;Zhuoyuan Li;Li Li;Dong Liu;Feng Wu","doi":"10.1109/TMM.2025.3632639","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632639","url":null,"abstract":"Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast block partitioning in inter coding. Based on our previous work on partition map-based methods for intra coding, we analyze the characteristics of VVC inter coding and improve the partition map by incorporating an MTT mask for early termination. Next, we develop a neural network that uses both spatial and temporal features to predict the partition map. It consists of several special designs, including stacked top-down and bottom-up processing, quantization parameter modulation layers, and partitioning-adaptive warping. Furthermore, we present a dual-threshold decision scheme to achieve a fine-grained trade-off between complexity reduction and rate-distortion performance loss. The experimental results demonstrate that the proposed method achieves an average 51.30% encoding time saving with a 2.12% Bjøntegaard-delta-bit-rate under the random access configuration.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"998-1013"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-14DOI: 10.1109/TMM.2025.3632695
Nuoer Long;Yonghao Dang;Kaiwen Yang;Chengpeng Xiong;Shaobin Chen;Tao Tan;Wei Ke;Chan-Tong Lam;Jianqin Yin;Peter H. N. de With;Yue Sun
Behavior recognition is a highly challenging task, particularly in scenarios requiring unified recognition across both human and animal subjects. Most existing approaches primarily focus on single-species datasets or rely heavily on prior information such as species labels, positional annotations, or skeletal keypoints, which limits their applicability in real-world scenarios where species labels may be ambiguous or annotations are insufficient. To address these limitations, we propose a query-based Multi-Granularity Behavior Recognition Network that directly mines cross-species shared spatiotemporal behavior patterns from raw video inputs. Specifically, we design a Multi-Granularity Query module to effectively fuse fine-grained and coarse-grained features, thereby enhancing the model's capability in capturing spatiotemporal dynamics at different granularities. Additionally, we introduce a Category Query Decoder that leverages learnable category query vectors to achieve explicit behavior category modeling and mapping. Without relying on any extra annotations, the proposed method achieves unified recognition of multi-species and multi-category behaviors, setting a new state-of-the-art on the Animal Kingdom dataset and demonstrating strong generalization ability on the Charades dataset.
{"title":"Multi-Granularity Query Network With Adaptive Category Feature Embedding for Behavior Recognition","authors":"Nuoer Long;Yonghao Dang;Kaiwen Yang;Chengpeng Xiong;Shaobin Chen;Tao Tan;Wei Ke;Chan-Tong Lam;Jianqin Yin;Peter H. N. de With;Yue Sun","doi":"10.1109/TMM.2025.3632695","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632695","url":null,"abstract":"Behavior recognition is a highly challenging task, particularly in scenarios requiring unified recognition across both human and animal subjects. Most existing approaches primarily focus on single-species datasets or rely heavily on prior information such as species labels, positional annotations, or skeletal keypoints, which limits their applicability in real-world scenarios where species labels may be ambiguous or annotations are insufficient. To address these limitations, we propose a query-based Multi-Granularity Behavior Recognition Network that directly mines cross-species shared spatiotemporal behavior patterns from raw video inputs. Specifically, we design a Multi-Granularity Query module to effectively fuse fine-grained and coarse-grained features, thereby enhancing the model's capability in capturing spatiotemporal dynamics at different granularities. Additionally, we introduce a Category Query Decoder that leverages learnable category query vectors to achieve explicit behavior category modeling and mapping. Without relying on any extra annotations, the proposed method achieves unified recognition of multi-species and multi-category behaviors, setting a new state-of-the-art on the Animal Kingdom dataset and demonstrating strong generalization ability on the Charades dataset.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"878-890"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent years have witnessed increasing interest towards image aesthetics assessment (IAA), which predicts the aesthetic appeal of images by simulating human perception. The state-of-the-art IAA methods, despite their significant advancements, typically rely heavily on time-consuming and labor-intensive human annotation of aesthetic scores. Furthermore, they are subject to the generalization challenge, which is highly desired in real-world applications. Motivated by this, zero-shot image aesthetics assessment (ZIAA) is investigated to achieve robust model generalization without relying on manual aesthetic annotations, which remains largely underexplored. Specifically, a novel aesthetic prompt learning framework for ZIAA, dubbed AesPrompt, is presented in this paper. The key insight of AesPrompt is to emulate the human aesthetic perception process for learning aesthetic-oriented prompts in a multi-granularity manner. First, we first develop a new pseudo aesthetic distribution generation paradigm based on multi-LLM ensemble. Then, external knowledge of multi-granularity prompts encompassing image themes, emotions, and aesthetics is acquired. Through learning the multi-granularity aesthetic-oriented prompts, the proposed method achieves better generalization and interpretability. Extensive experiments on five IAA benchmarks demonstrate that AesPrompt consistently outperforms the state-of-the-art ZIAA methods across diverse-sourced images, covering natural images, artistic images, and artificial intelligence-generated images.
{"title":"AesPrompt: Zero-Shot Image Aesthetics Assessment With Multi-Granularity Aesthetic Prompt Learning","authors":"Xiangfei Sheng;Leida Li;Pengfei Chen;Li Cai;Giuseppe Valenzise","doi":"10.1109/TMM.2025.3632637","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632637","url":null,"abstract":"Recent years have witnessed increasing interest towards image aesthetics assessment (IAA), which predicts the aesthetic appeal of images by simulating human perception. The state-of-the-art IAA methods, despite their significant advancements, typically rely heavily on time-consuming and labor-intensive human annotation of aesthetic scores. Furthermore, they are subject to the generalization challenge, which is highly desired in real-world applications. Motivated by this, zero-shot image aesthetics assessment (ZIAA) is investigated to achieve robust model generalization without relying on manual aesthetic annotations, which remains largely underexplored. Specifically, a novel aesthetic prompt learning framework for ZIAA, dubbed AesPrompt, is presented in this paper. The key insight of AesPrompt is to emulate the human aesthetic perception process for learning aesthetic-oriented prompts in a multi-granularity manner. First, we first develop a new pseudo aesthetic distribution generation paradigm based on multi-LLM ensemble. Then, external knowledge of multi-granularity prompts encompassing image themes, emotions, and aesthetics is acquired. Through learning the multi-granularity aesthetic-oriented prompts, the proposed method achieves better generalization and interpretability. Extensive experiments on five IAA benchmarks demonstrate that AesPrompt consistently outperforms the state-of-the-art ZIAA methods across diverse-sourced images, covering natural images, artistic images, and artificial intelligence-generated images.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"958-971"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RGB-based object tracking is a fundamental task in computer vision, aiming to identify, locate, and continuously track objects of interest across sequential video frames. Despite the significant advancements in the performance of traditional RGB trackers, they still face challenges in maintaining accuracy and robustness in the presence of complex backgrounds, occlusions, and rapid movements. To tackle these challenges, combining visual auxiliary modalities has gained significant attention. Beyond this, integrating natural language information offers additional advantages by providing high-level semantic context, enhancing robustness, and clarifying target priorities, further elevating tracker performance. This work proposes the Adaptive Multi-modal Visual Tracking with Dynamic Semantic Prompts (AMVTrack) tracker, which efficiently incorporates image descriptions and avoids text dependency during tracking to improve flexibility and adaptability. AMVTrack significantly reduces computational resource consumption by freezing the parameters of the image encoder, text encoder, and Box Head and only optimizing a few learnable prompt parameters. Additionally, we introduce the Adaptive Dynamic Semantic Prompt Generator (ADSPG), which dynamically generates semantic prompts based on visual features, and the Visual-Language Fusion Adaptation (V-L FA) method, which integrates multi-modal features to ensure consistency and complementarity of information. Additionally, we partition the Image Encoder to conduct an in-depth investigation into the relationship between the importance of features across different depth and width regions. Experimental results demonstrate that AMVTrack achieves significant performance improvements on multiple benchmark datasets, proving its effectiveness and robustness in complex scenarios.
{"title":"Adaptive Multi-Modal Visual Tracking With Dynamic Semantic Prompts","authors":"Jiahao Wang;Fang Liu;Licheng Jiao;Hao Wang;Shuo Li;Lingling Li;Puhua Chen;Xu Liu;Wenping Ma;Xinyi Wang","doi":"10.1109/TMM.2025.3632650","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632650","url":null,"abstract":"RGB-based object tracking is a fundamental task in computer vision, aiming to identify, locate, and continuously track objects of interest across sequential video frames. Despite the significant advancements in the performance of traditional RGB trackers, they still face challenges in maintaining accuracy and robustness in the presence of complex backgrounds, occlusions, and rapid movements. To tackle these challenges, combining visual auxiliary modalities has gained significant attention. Beyond this, integrating natural language information offers additional advantages by providing high-level semantic context, enhancing robustness, and clarifying target priorities, further elevating tracker performance. This work proposes the <bold>A</b>daptive <bold>M</b>ulti-modal <bold>V</b>isual Tracking with Dynamic Semantic Prompts (<bold>AMVTrack</b>) tracker, which efficiently incorporates image descriptions and avoids text dependency during tracking to improve flexibility and adaptability. AMVTrack significantly reduces computational resource consumption by freezing the parameters of the image encoder, text encoder, and Box Head and only optimizing a few learnable prompt parameters. Additionally, we introduce the Adaptive Dynamic Semantic Prompt Generator (ADSPG), which dynamically generates semantic prompts based on visual features, and the <bold>V</b>isual-<bold>L</b>anguage <bold>F</b>usion <bold>A</b>daptation (<bold>V-L FA</b>) method, which integrates multi-modal features to ensure consistency and complementarity of information. Additionally, we partition the Image Encoder to conduct an in-depth investigation into the relationship between the importance of features across different depth and width regions. Experimental results demonstrate that AMVTrack achieves significant performance improvements on multiple benchmark datasets, proving its effectiveness and robustness in complex scenarios.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"915-928"},"PeriodicalIF":9.7,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}