IEEE Transactions on Multimedia最新文献_第3页

Auxiliary Representation Guided Network for Visible-Infrared Person Re-Identification 可见-红外人再识别辅助表示引导网络

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521773

Mengzan Qi;Sixian Chan;Chen Hang;Guixu Zhang;Tieyong Zeng;Zhi Li

Visible-Infrared Person Re-identification aims to retrieve images of specific identities across modalities. To relieve the large cross-modality discrepancy, researchers introduce the auxiliary modality within the image space to assist modality-invariant representation learning. However, the challenge persists in constraining the inherent quality of generated auxiliary images, further leading to a bottleneck in retrieval performance. In this paper, we propose a novel Auxiliary Representation Guided Network (ARGN) to explore the potential of auxiliary representations, which are directly generated within the modality-shared embedding space. In contrast to the original visible and infrared representations, which contain information solely from their respective modalities, these auxiliary representations integrate cross-modality information by fusing both modalities. In our framework, we utilize these auxiliary representations as modality guidance to reduce the cross-modality discrepancy. First, we propose a High-quality Auxiliary Representation Learning (HARL) framework to generate identity-consistent auxiliary representations. The primary objective of our HARL is to ensure that auxiliary representations capture diverse modality information from both modalities while concurrently preserving identity-related discrimination. Second, guided by these auxiliary representations, we design an Auxiliary Representation Guided Constraint (ARGC) to optimize the modality-shared embedding space. By incorporating this constraint, the modality-shared embedding space is optimized to achieve enhanced intra-identity compactness and inter-identity separability, further improving the retrieval performance. In addition, to improve the robustness of our framework against the modality variation, we introduce a Part-based Adaptive Gaussian Module (PAGM) to adaptively extract discriminative information across modalities. Finally, extensive experiments are conducted to demonstrate the superiority of our method over state-of-the-art approaches on three VI-ReID datasets.

可见-红外人体再识别旨在跨模态检索特定身份的图像。为了缓解较大的跨模态差异，研究者在图像空间中引入辅助模态来辅助模态不变表征学习。然而，所生成的辅助图像的固有质量仍然受到限制，这进一步导致了检索性能的瓶颈。在本文中，我们提出了一种新的辅助表示引导网络（ARGN）来探索辅助表示的潜力，这些辅助表示直接在模态共享的嵌入空间中生成。原始的可见光和红外表征只包含各自模态的信息，与之相反，这些辅助表征通过融合两种模态来整合跨模态信息。在我们的框架中，我们利用这些辅助表示作为情态指导来减少跨情态差异。首先，我们提出了一个高质量辅助表征学习（HARL）框架来生成身份一致的辅助表征。我们的HARL的主要目标是确保辅助表征从两种模态中捕获不同的模态信息，同时保留与身份相关的歧视。其次，在辅助表示的指导下，我们设计了辅助表示引导约束（ARGC）来优化模态共享嵌入空间。通过结合这一约束，优化模态共享嵌入空间，增强同一性内的紧密性和同一性间的可分离性，进一步提高检索性能。此外，为了提高我们的框架对模态变化的鲁棒性，我们引入了一个基于部分的自适应高斯模块（PAGM）来自适应地提取模态间的判别信息。最后，进行了广泛的实验，以证明我们的方法在三个VI-ReID数据集上优于最先进的方法。

{"title":"Auxiliary Representation Guided Network for Visible-Infrared Person Re-Identification","authors":"Mengzan Qi;Sixian Chan;Chen Hang;Guixu Zhang;Tieyong Zeng;Zhi Li","doi":"10.1109/TMM.2024.3521773","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521773","url":null,"abstract":"Visible-Infrared Person Re-identification aims to retrieve images of specific identities across modalities. To relieve the large cross-modality discrepancy, researchers introduce the auxiliary modality within the image space to assist modality-invariant representation learning. However, the challenge persists in constraining the inherent quality of generated auxiliary images, further leading to a bottleneck in retrieval performance. In this paper, we propose a novel Auxiliary Representation Guided Network (ARGN) to explore the potential of auxiliary representations, which are directly generated within the modality-shared embedding space. In contrast to the original visible and infrared representations, which contain information solely from their respective modalities, these auxiliary representations integrate cross-modality information by fusing both modalities. In our framework, we utilize these auxiliary representations as modality guidance to reduce the cross-modality discrepancy. First, we propose a High-quality Auxiliary Representation Learning (HARL) framework to generate identity-consistent auxiliary representations. The primary objective of our HARL is to ensure that auxiliary representations capture diverse modality information from both modalities while concurrently preserving identity-related discrimination. Second, guided by these auxiliary representations, we design an Auxiliary Representation Guided Constraint (ARGC) to optimize the modality-shared embedding space. By incorporating this constraint, the modality-shared embedding space is optimized to achieve enhanced intra-identity compactness and inter-identity separability, further improving the retrieval performance. In addition, to improve the robustness of our framework against the modality variation, we introduce a Part-based Adaptive Gaussian Module (PAGM) to adaptively extract discriminative information across modalities. Finally, extensive experiments are conducted to demonstrate the superiority of our method over state-of-the-art approaches on three VI-ReID datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"340-355"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive Fusion Learning for Compositional Zero-Shot Recognition

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521852

Lingtong Min;Ziman Fan;Shunzhou Wang;Feiyang Dou;Xin Li;Binglu Wang

Compositional Zero-Shot Learning (CZSL) aims to learn visual concepts (i.e., attributes and objects) from seen compositions and combine them to predict unseen compositions. Existing visual encoders in CZSL typically use traditional visual encoders (i.e., CNN and Transformer) or image encoders from Visual-Language Models (VLMs) to encode image features. However, traditional visual encoders need more multi-modal textual information, and image encoders of VLMs exhibit dependence on pre-training data, making them less effective when used independently for predicting unseen compositions. To overcome this limitation, we propose a novel approach based on the joint modeling of traditional visual encoders and VLMs visual encoders to enhance the prediction ability for uncommon and unseen compositions. Specifically, we design an adaptive fusion module that automatically adjusts the weighted parameters of similarity scores between traditional and VLMs methods during training, and these weighted parameters are inherited during the inference process. Given the significance of disentangling attributes and objects, we design a Multi-Attribute Object Module that, during the training phase, incorporates multiple pairs of attributes and objects as prior knowledge, leveraging this rich prior knowledge to facilitate the disentanglement of attributes and objects. Building upon this, we select the text encoder from VLMs to construct the Adaptive Fusion Network. We conduct extensive experiments on the Clothing16 K, UT-Zappos50 K, and C-GQA datasets, achieving excellent performance on the Clothing16 K and UT-Zappos50 K datasets.

{"title":"Adaptive Fusion Learning for Compositional Zero-Shot Recognition","authors":"Lingtong Min;Ziman Fan;Shunzhou Wang;Feiyang Dou;Xin Li;Binglu Wang","doi":"10.1109/TMM.2024.3521852","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521852","url":null,"abstract":"Compositional Zero-Shot Learning (CZSL) aims to learn visual concepts (i.e., attributes and objects) from seen compositions and combine them to predict unseen compositions. Existing visual encoders in CZSL typically use traditional visual encoders (i.e., CNN and Transformer) or image encoders from Visual-Language Models (VLMs) to encode image features. However, traditional visual encoders need more multi-modal textual information, and image encoders of VLMs exhibit dependence on pre-training data, making them less effective when used independently for predicting unseen compositions. To overcome this limitation, we propose a novel approach based on the joint modeling of traditional visual encoders and VLMs visual encoders to enhance the prediction ability for uncommon and unseen compositions. Specifically, we design an adaptive fusion module that automatically adjusts the weighted parameters of similarity scores between traditional and VLMs methods during training, and these weighted parameters are inherited during the inference process. Given the significance of disentangling attributes and objects, we design a Multi-Attribute Object Module that, during the training phase, incorporates multiple pairs of attributes and objects as prior knowledge, leveraging this rich prior knowledge to facilitate the disentanglement of attributes and objects. Building upon this, we select the text encoder from VLMs to construct the Adaptive Fusion Network. We conduct extensive experiments on the Clothing16 K, UT-Zappos50 K, and C-GQA datasets, achieving excellent performance on the Clothing16 K and UT-Zappos50 K datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1193-1204"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GCN-Based Multi-Modality Fusion Network for Action Recognition

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521749

Shaocan Liu;Xingtao Wang;Ruiqin Xiong;Xiaopeng Fan

Thanks to the remarkably expressive power for depicting structural data, Graph Convolutional Network (GCN) has been extensively adopted for skeleton-based action recognition in recent years. However, GCN is designed to operate on irregular graphs of skeletons, making it difficult to deal with other modalities represented on regular grids directly. Thus, although existing works have demonstrated the necessity of multi-modality fusion, few methods in the literature explore the fusion of skeleton and other modalities within a GCN architecture. In this paper, we present a novel GCN-based framework, termed GCN-based Multi-modality Fusion Network (GMFNet), to efficiently utilize complementary information in RGB and skeleton data. GMFNet is constructed by connecting a main stream with a GCN-based multi-modality fusion module (GMFM), whose goal is to gradually combine finer and coarse action-related information extracted from skeletons and RGB videos, respectively. Specifically, a cross-modality data mapping method is designed to transform an RGB video into a

$mathit{skeleton-like}$

(SL) sequence, which is then integrated with the skeleton sequence under a gradual fusion scheme in GMFM. The fusion results are fed into the following main stream to extract more discriminative features and produce the final prediction. In addition, a spatio-temporal joint attention mechanism is introduced for more accurate action recognition. Compared to the multi-stream approaches, GMFNet can be implemented within an end-to-end training pipeline and thereby reduces the training complexity. Experimental results show the proposed GMFNet achieves impressive performance on two large-scale data sets of NTU RGB+D 60 and 120.

{"title":"GCN-Based Multi-Modality Fusion Network for Action Recognition","authors":"Shaocan Liu;Xingtao Wang;Ruiqin Xiong;Xiaopeng Fan","doi":"10.1109/TMM.2024.3521749","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521749","url":null,"abstract":"Thanks to the remarkably expressive power for depicting structural data, Graph Convolutional Network (GCN) has been extensively adopted for skeleton-based action recognition in recent years. However, GCN is designed to operate on irregular graphs of skeletons, making it difficult to deal with other modalities represented on regular grids directly. Thus, although existing works have demonstrated the necessity of multi-modality fusion, few methods in the literature explore the fusion of skeleton and other modalities within a GCN architecture. In this paper, we present a novel GCN-based framework, termed GCN-based Multi-modality Fusion Network (GMFNet), to efficiently utilize complementary information in RGB and skeleton data. GMFNet is constructed by connecting a main stream with a GCN-based multi-modality fusion module (GMFM), whose goal is to gradually combine finer and coarse action-related information extracted from skeletons and RGB videos, respectively. Specifically, a cross-modality data mapping method is designed to transform an RGB video into a <inline-formula><tex-math>$mathit{skeleton-like}$</tex-math></inline-formula> (SL) sequence, which is then integrated with the skeleton sequence under a gradual fusion scheme in GMFM. The fusion results are fed into the following main stream to extract more discriminative features and produce the final prediction. In addition, a spatio-temporal joint attention mechanism is introduced for more accurate action recognition. Compared to the multi-stream approaches, GMFNet can be implemented within an end-to-end training pipeline and thereby reduces the training complexity. Experimental results show the proposed GMFNet achieves impressive performance on two large-scale data sets of NTU RGB+D 60 and 120.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1242-1253"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hard-Sample Style Guided Patch Attack With RL-Enhanced Motion Pattern for Video Recognition

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521832

Jian Yang;Jun Li;Yunong Cai;Guoming Wu;Zhiping Shi;Chaodong Tan;Xianglong Liu

Adversarial attacks have been extensively studied in the image field. In recent years, research has shown that video recognition models are also vulnerable to adversarial examples. However, most studies about adversarial attacks for video models have focused on perturbation-based methods, while patch-based black-box attacks have received less attention. Despite the excellent performance of perturbation-based attacks, these attacks are impractical for real-world implementation. Most existing patch-based black-box attacks require occluding larger areas and performing more queries to the target model. In this paper, we propose a hard-sample style guided patch attack with reinforcement learning (RL) enhanced motion patterns for video recognition (HSPA). Specifically, we utilize the style features of video hard samples and transfer their multi-dimensional style features to images to obtain a texture patch set. Then we use reinforcement learning to locate the patch coordinates and obtain a specific adversarial motion pattern of the patch to successfully perform an effective attack on a video recognition model in both the spatial and temporal dimensions. Our experiments on three widely-used video action recognition models (C3D, LRCN, and TDN) and two mainstream datasets (UCF-101 and HMDB-51) demonstrate the superior performance of our method compared to other state-of-the-art approaches.

{"title":"Hard-Sample Style Guided Patch Attack With RL-Enhanced Motion Pattern for Video Recognition","authors":"Jian Yang;Jun Li;Yunong Cai;Guoming Wu;Zhiping Shi;Chaodong Tan;Xianglong Liu","doi":"10.1109/TMM.2024.3521832","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521832","url":null,"abstract":"Adversarial attacks have been extensively studied in the image field. In recent years, research has shown that video recognition models are also vulnerable to adversarial examples. However, most studies about adversarial attacks for video models have focused on perturbation-based methods, while patch-based black-box attacks have received less attention. Despite the excellent performance of perturbation-based attacks, these attacks are impractical for real-world implementation. Most existing patch-based black-box attacks require occluding larger areas and performing more queries to the target model. In this paper, we propose a hard-sample style guided patch attack with reinforcement learning (RL) enhanced motion patterns for video recognition (HSPA). Specifically, we utilize the style features of video hard samples and transfer their multi-dimensional style features to images to obtain a texture patch set. Then we use reinforcement learning to locate the patch coordinates and obtain a specific adversarial motion pattern of the patch to successfully perform an effective attack on a video recognition model in both the spatial and temporal dimensions. Our experiments on three widely-used video action recognition models (C3D, LRCN, and TDN) and two mainstream datasets (UCF-101 and HMDB-51) demonstrate the superior performance of our method compared to other state-of-the-art approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1205-1215"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection 聚焦整体和感知环境的任意形状文本检测

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521797

Xu Han;Junyu Gao;Chuang Yang;Yuan Yuan;Qi Wang

Due to the diversity of scene text in aspects such as font, color, shape, and size, accurately and efficiently detecting text is still a formidable challenge. Among the various detection approaches, segmentation-based approaches have emerged as prominent contenders owing to their flexible pixel-level predictions. However, these methods typically model text instances in a bottom-up manner, which is highly susceptible to noise. In addition, the prediction of pixels is isolated without introducing pixel-feature interaction, which also influences the detection performance. To alleviate these problems, we propose a multi-information level arbitrary-shaped text detector consisting of a focus entirety module (FEM) and a perceive environment module (PEM). The former extracts instance-level features and adopts a top-down scheme to model texts to reduce the influence of noises. Specifically, it assigns consistent entirety information to pixels within the same instance to improve their cohesion. In addition, it emphasizes the scale information, enabling the model to distinguish varying scale texts effectively. The latter extracts region-level information and encourages the model to focus on the distribution of positive samples in the vicinity of a pixel, which perceives environment information. It treats the kernel pixels as positive samples and helps the model differentiate text and kernel features. Extensive experiments demonstrate the FEM's ability to efficiently support the model in handling different scale texts and confirm the PEM can assist in perceiving pixels more accurately by focusing on pixel vicinities. Comparisons show the proposed model outperforms existing state-of-the-art approaches on four public datasets.

由于场景文本在字体、颜色、形状、大小等方面的多样性，准确、高效地检测文本仍然是一个艰巨的挑战。在各种检测方法中，基于分割的方法由于其灵活的像素级预测而成为突出的竞争者。然而，这些方法通常以自下而上的方式对文本实例建模，这很容易受到噪声的影响。此外，像素的预测是孤立的，没有引入像素-特征交互，这也影响了检测性能。为了解决这些问题，我们提出了一种由焦点整体模块（FEM）和感知环境模块（PEM）组成的多信息级任意形状文本检测器。前者提取实例级特征，采用自顶向下的方法对文本进行建模，降低噪声的影响。具体来说，它为同一实例中的像素分配一致的整体信息，以提高它们的内聚性。此外，它强调尺度信息，使模型能够有效地区分不同尺度的文本。后者提取区域级信息，鼓励模型关注像素附近正样本的分布，感知环境信息。它将核像素作为正样本，帮助模型区分文本和核特征。大量的实验证明了FEM能够有效地支持模型处理不同尺度的文本，并证实了PEM可以通过聚焦像素附近来帮助更准确地感知像素。比较表明，所提出的模型在四个公共数据集上优于现有的最先进的方法。

{"title":"Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection","authors":"Xu Han;Junyu Gao;Chuang Yang;Yuan Yuan;Qi Wang","doi":"10.1109/TMM.2024.3521797","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521797","url":null,"abstract":"Due to the diversity of scene text in aspects such as font, color, shape, and size, accurately and efficiently detecting text is still a formidable challenge. Among the various detection approaches, segmentation-based approaches have emerged as prominent contenders owing to their flexible pixel-level predictions. However, these methods typically model text instances in a bottom-up manner, which is highly susceptible to noise. In addition, the prediction of pixels is isolated without introducing pixel-feature interaction, which also influences the detection performance. To alleviate these problems, we propose a multi-information level arbitrary-shaped text detector consisting of a focus entirety module (FEM) and a perceive environment module (PEM). The former extracts instance-level features and adopts a top-down scheme to model texts to reduce the influence of noises. Specifically, it assigns consistent entirety information to pixels within the same instance to improve their cohesion. In addition, it emphasizes the scale information, enabling the model to distinguish varying scale texts effectively. The latter extracts region-level information and encourages the model to focus on the distribution of positive samples in the vicinity of a pixel, which perceives environment information. It treats the kernel pixels as positive samples and helps the model differentiate text and kernel features. Extensive experiments demonstrate the FEM's ability to efficiently support the model in handling different scale texts and confirm the PEM can assist in perceiving pixels more accurately by focusing on pixel vicinities. Comparisons show the proposed model outperforms existing state-of-the-art approaches on four public datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"287-299"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Vision Anomaly Detection With the Guidance of Language Modality

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521813

Dong Chen;Kaihang Pan;Guangyu Dai;Guoming Wang;Yueting Zhuang;Siliang Tang;Mingliang Xu

Recent years have seen a surge of interest in anomaly detection. However, existing unsupervised anomaly detectors, particularly those for the vision modality, face significant challenges due to redundant information and sparse latent space. In contrast, anomaly detectors demonstrate superior performance in the language modality due to the unimodal nature of the data. This paper tackles the aforementioned challenges for vision modality from a multimodal point of view. Specifically, we propose Cross-modal Guidance (CMG), comprising of Cross-modal Entropy Reduction (CMER) and Cross-modal Linear Embedding (CMLE), to address the issues of redundant information and sparse latent space, respectively. CMER involves masking portions of the raw image and computing the matching score with the corresponding text. Essentially, CMER eliminates irrelevant pixels to direct the detector's focus towards critical content. To learn a more compact latent space for the vision anomaly detection, CMLE learns a correlation structure matrix from the language modality. Then, the acquired matrix compels the distribution of images to resemble that of texts in the latent space. Extensive experiments demonstrate the effectiveness of the proposed methods. Particularly, compared to the baseline that only utilizes images, the performance of CMG has been improved by 16.81%. Ablation experiments further confirm the synergy among the proposed CMER and CMLE, as each component depends on the other to achieve optimal performance.

{"title":"Improving Vision Anomaly Detection With the Guidance of Language Modality","authors":"Dong Chen;Kaihang Pan;Guangyu Dai;Guoming Wang;Yueting Zhuang;Siliang Tang;Mingliang Xu","doi":"10.1109/TMM.2024.3521813","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521813","url":null,"abstract":"Recent years have seen a surge of interest in anomaly detection. However, existing unsupervised anomaly detectors, particularly those for the vision modality, face significant challenges due to redundant information and sparse latent space. In contrast, anomaly detectors demonstrate superior performance in the language modality due to the unimodal nature of the data. This paper tackles the aforementioned challenges for vision modality from a multimodal point of view. Specifically, we propose Cross-modal Guidance (CMG), comprising of Cross-modal Entropy Reduction (CMER) and Cross-modal Linear Embedding (CMLE), to address the issues of redundant information and sparse latent space, respectively. CMER involves masking portions of the raw image and computing the matching score with the corresponding text. Essentially, CMER eliminates irrelevant pixels to direct the detector's focus towards critical content. To learn a more compact latent space for the vision anomaly detection, CMLE learns a correlation structure matrix from the language modality. Then, the acquired matrix compels the distribution of images to resemble that of texts in the latent space. Extensive experiments demonstrate the effectiveness of the proposed methods. Particularly, compared to the baseline that only utilizes images, the performance of CMG has been improved by 16.81%. Ablation experiments further confirm the synergy among the proposed CMER and CMLE, as each component depends on the other to achieve optimal performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1410-1419"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Category-Level Multi-Object 9D State Tracking Using Object-Centric Multi-Scale Transformer in Point Cloud Stream

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521664

Jingtao Sun;Yaonan Wang;Mingtao Feng;Xiaofeng Guo;Huimin Lu;Xieyuanli Chen

Category-level object pose estimation and tracking has achieved impressive progress in computer vision, augmented reality, and robotics. Existing methods either estimate the object states from a single observation or only track the 6-DoF pose of a single object. In this paper, we focus on category-level multi-object 9-Dimensional (9D) state tracking from the point cloud stream. We propose a novel 9D state estimation network to estimate the 6-DoF pose and 3D size of each instance in the scene. It uses our devised multi-scale global attention and object-level local attention modules to obtain representative latent features to estimate the 9D state of each object in the current observation. We then integrate our network estimation into a Kalman filter to combine previous states with the current estimates and achieve multi-object 9D state tracking. Experiment results on two public datasets show that our method achieves state-of-the-art performance on both category-level multi-object state estimation and pose tracking tasks. Furthermore, we directly apply the pre-trained model of our method to our air-ground robot system with multiple moving objects. Experiments on our collected real-world dataset show our method's strong generalization ability and real-time pose tracking performance.

{"title":"Category-Level Multi-Object 9D State Tracking Using Object-Centric Multi-Scale Transformer in Point Cloud Stream","authors":"Jingtao Sun;Yaonan Wang;Mingtao Feng;Xiaofeng Guo;Huimin Lu;Xieyuanli Chen","doi":"10.1109/TMM.2024.3521664","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521664","url":null,"abstract":"Category-level object pose estimation and tracking has achieved impressive progress in computer vision, augmented reality, and robotics. Existing methods either estimate the object states from a single observation or only track the 6-DoF pose of a single object. In this paper, we focus on category-level multi-object 9-Dimensional (9D) state tracking from the point cloud stream. We propose a novel 9D state estimation network to estimate the 6-DoF pose and 3D size of each instance in the scene. It uses our devised multi-scale global attention and object-level local attention modules to obtain representative latent features to estimate the 9D state of each object in the current observation. We then integrate our network estimation into a Kalman filter to combine previous states with the current estimates and achieve multi-object 9D state tracking. Experiment results on two public datasets show that our method achieves state-of-the-art performance on both category-level multi-object state estimation and pose tracking tasks. Furthermore, we directly apply the pre-trained model of our method to our air-ground robot system with multiple moving objects. Experiments on our collected real-world dataset show our method's strong generalization ability and real-time pose tracking performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1072-1085"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143465735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Federated Hallucination Translation and Source-Free Regularization Adaptation in Decentralized Domain Adaptation for Foggy Scene Understanding

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521711

Xiating Jin;Jiajun Bu;Zhi Yu;Hui Zhang;Yaonan Wang

Semantic foggy scene understanding (SFSU) emerges a challenging task under out-of-domain distribution (OD) due to uncertain cognition caused by degraded visibility. With the strong assumption of data centralization, unsupervised domain adaptation (UDA) reduces vulnerability under OD scenario. Whereas, enlarged domain gap and growing privacy concern heavily challenge conventional UDA. Motivated by gap decomposition and data decentralization, we establish a decentralized domain adaptation (DDA) framework called Translate thEn Adapt (abbr. TEA) for privacy preservation. Our highlights lie in. (1) Regarding federated hallucination translation, a Disentanglement and Contrastive-learning based Generative Adversarial Network (abbr. DisCoGAN) is proposed to impose contrastive prior and disentangle latent space in cycle-consistent translation. To yield domain hallucination, client minimizes cross-entropy of local classifier but maximizes entropy of global model to train translator. (2) Regarding source-free regularization adaptation, a Prototypical-knowledge based Regularization Adaptation (abbr. ProRA) is presented to align joint distribution in output space. Soft adversarial learning relaxes binary label to rectify inter-domain discrepancy and inner-domain divergence. Structure clustering and entropy minimization drive intra-class features closer and inter-class features apart. Extensive experiments exhibit efficacy of our TEA which achieves 55.26% or 46.25% mIoU in adaptation from GTA5 to Foggy Cityscapes or Foggy Zurich, outperforming other DDA methods for SFSU.

{"title":"Federated Hallucination Translation and Source-Free Regularization Adaptation in Decentralized Domain Adaptation for Foggy Scene Understanding","authors":"Xiating Jin;Jiajun Bu;Zhi Yu;Hui Zhang;Yaonan Wang","doi":"10.1109/TMM.2024.3521711","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521711","url":null,"abstract":"Semantic foggy scene understanding (SFSU) emerges a challenging task under out-of-domain distribution (OD) due to uncertain cognition caused by degraded visibility. With the strong assumption of data centralization, unsupervised domain adaptation (UDA) reduces vulnerability under OD scenario. Whereas, enlarged domain gap and growing privacy concern heavily challenge conventional UDA. Motivated by gap decomposition and data decentralization, we establish a decentralized domain adaptation (DDA) framework called <bold>Translate th<bold>En <bold>Adapt (abbr. <bold>TEA) for privacy preservation. Our highlights lie in. (1) Regarding federated hallucination translation, a <bold>Disentanglement and <bold>Contrastive-learning based <bold>Generative <bold>Adversarial <bold>Network (abbr. <bold>DisCoGAN) is proposed to impose contrastive prior and disentangle latent space in cycle-consistent translation. To yield domain hallucination, client minimizes cross-entropy of local classifier but maximizes entropy of global model to train translator. (2) Regarding source-free regularization adaptation, a <bold>Prototypical-knowledge based <bold>Regularization <bold>Adaptation (abbr. <bold>ProRA) is presented to align joint distribution in output space. Soft adversarial learning relaxes binary label to rectify inter-domain discrepancy and inner-domain divergence. Structure clustering and entropy minimization drive intra-class features closer and inter-class features apart. Extensive experiments exhibit efficacy of our TEA which achieves 55.26% or 46.25% mIoU in adaptation from GTA5 to Foggy Cityscapes or Foggy Zurich, outperforming other DDA methods for SFSU.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1601-1616"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Leveraging Enriched Skeleton Representation With Multi-Relational Metrics for Few-Shot Action Recognition

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521767

Jingyun Tian;Jinjing Gu;Yuanyuan Pu;Zhengpeng Zhao

Few-shot action recognition aims to identify new action classes with limited training samples. Most existing methods overlook the low information content and diversity of skeleton features, failing to exploit useful information in rare samples during meta-training. This leads to poor feature discriminability and recognition accuracy. To address both issues, we propose a novel Enriched Skeleton Representation and Multi-relational Metrics (ESR-MM) method for skeleton-based few-shot action recognition. First, a Frobenius Norm Diversity Loss is introduced to enrich skeleton representation by maximizing the Frobenius norm of the skeleton feature matrix. This mitigates over-smoothing and boosts information content and diversity. Leveraging these enriched features, we propose a multi-relational metrics strategy exploiting cross-sample task-specific information, intra-sample temporal order, and inter-sample distance. Specifically, Support-Adaptive Attention leverages task-specific cues between samples to generate attention-enhanced features. Then, the Bidirectional Temporal Coherent Mean Hausdorff Metric integrates Temporal Coherence Measure into the Bidirectional Mean Hausdorff Metric for class separation by accounting for temporal order. Finally, Prototype-discriminative Contrastive Loss exploits distances from class prototypes to query samples. ESR-MM demonstrates superior performance on two benchmarks.

{"title":"Leveraging Enriched Skeleton Representation With Multi-Relational Metrics for Few-Shot Action Recognition","authors":"Jingyun Tian;Jinjing Gu;Yuanyuan Pu;Zhengpeng Zhao","doi":"10.1109/TMM.2024.3521767","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521767","url":null,"abstract":"Few-shot action recognition aims to identify new action classes with limited training samples. Most existing methods overlook the low information content and diversity of skeleton features, failing to exploit useful information in rare samples during meta-training. This leads to poor feature discriminability and recognition accuracy. To address both issues, we propose a novel Enriched Skeleton Representation and Multi-relational Metrics (ESR-MM) method for skeleton-based few-shot action recognition. First, a Frobenius Norm Diversity Loss is introduced to enrich skeleton representation by maximizing the Frobenius norm of the skeleton feature matrix. This mitigates over-smoothing and boosts information content and diversity. Leveraging these enriched features, we propose a multi-relational metrics strategy exploiting cross-sample task-specific information, intra-sample temporal order, and inter-sample distance. Specifically, Support-Adaptive Attention leverages task-specific cues between samples to generate attention-enhanced features. Then, the Bidirectional Temporal Coherent Mean Hausdorff Metric integrates Temporal Coherence Measure into the Bidirectional Mean Hausdorff Metric for class separation by accounting for temporal order. Finally, Prototype-discriminative Contrastive Loss exploits distances from class prototypes to query samples. ESR-MM demonstrates superior performance on two benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1228-1241"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing 3D Human Pose Estimation Amidst Severe Occlusion With Dual Transformer Fusion

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-25 DOI: 10.1109/TMM.2024.3521755

Mehwish Ghafoor;Arif Mahmood;Muhammad Bilal

In the field of 3D Human Pose Estimation from monocular videos, the presence of diverse occlusion types presents a formidable challenge. Prior research has made progress by harnessing spatial and temporal cues to infer 3D poses from 2D joint observations. This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation, even in the presence of severe occlusions. Confronting the issue of occlusion-induced missing joint data, we propose a temporal interpolation-based occlusion guidance mechanism. To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views. Each intermediate-view undergoes spatial refinement through a self-refinement schema. Subsequently, these intermediate-views are fused to yield the final 3D human pose estimation. The entire system is end-to-end trainable. Through extensive experiments conducted on the Human3.6 M and MPI-INF-3DHP datasets, our method's performance is rigorously evaluated. Notably, our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements.

{"title":"Enhancing 3D Human Pose Estimation Amidst Severe Occlusion With Dual Transformer Fusion","authors":"Mehwish Ghafoor;Arif Mahmood;Muhammad Bilal","doi":"10.1109/TMM.2024.3521755","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521755","url":null,"abstract":"In the field of 3D Human Pose Estimation from monocular videos, the presence of diverse occlusion types presents a formidable challenge. Prior research has made progress by harnessing spatial and temporal cues to infer 3D poses from 2D joint observations. This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation, even in the presence of severe occlusions. Confronting the issue of occlusion-induced missing joint data, we propose a temporal interpolation-based occlusion guidance mechanism. To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views. Each intermediate-view undergoes spatial refinement through a self-refinement schema. Subsequently, these intermediate-views are fused to yield the final 3D human pose estimation. The entire system is end-to-end trainable. Through extensive experiments conducted on the Human3.6 M and MPI-INF-3DHP datasets, our method's performance is rigorously evaluated. Notably, our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1617-1624"},"PeriodicalIF":8.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0