IEEE Transactions on Multimedia最新文献_第9页

MDSC-Net: Multi-Modal Discriminative Sparse Coding Driven RGB-D Classification Network 多模态判别稀疏编码驱动的RGB-D分类网络

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521720

Jingyi Xu;Xin Deng;Yibing Fu;Mai Xu;Shengxi Li

In this paper, we propose a novel sparsity-driven deep neural network to solve the RGB-D image classification problem. Different from existing classification networks, our network architecture is designed by drawing inspirations from a new proposed multi-modal discriminative sparse coding (MDSC) model. The key feature of this model is that it can gradually separate the discriminative and non-discriminative features in RGB-D images in a coarse-to-fine manner. Only the discriminative features are integrated and refined for classification, while the non-discriminative features are discarded, to improve the classification accuracy and efficiency. Derived from the MDSC model, the proposed network is composed of three modules, i.e., the shared feature extraction (SFE) module, discriminative feature refinement (DFR) module, and classification module. The architecture of each module is derived from the optimization solution in the MDSC model. To the best of our knowledge, this is the first time a fully sparsity-driven network has been proposed for RGB-D image classification. Extensive results verify the effectiveness of our method on different RGB-D image datasets.

在本文中，我们提出了一种新的稀疏驱动深度神经网络来解决RGB-D图像分类问题。与现有的分类网络不同，我们的网络架构是从一种新的多模态判别稀疏编码（MDSC）模型中汲取灵感设计的。该模型的关键特点是能够逐步将RGB-D图像中的判别特征和非判别特征进行从粗到精的分离。为了提高分类的准确率和效率，只对判别特征进行整合和细化，而对非判别特征进行丢弃。该网络基于MDSC模型，由三个模块组成，即共享特征提取（SFE）模块、判别特征细化（DFR）模块和分类模块。每个模块的体系结构都是由MDSC模型中的优化方案推导出来的。据我们所知，这是第一次为RGB-D图像分类提出一个完全稀疏驱动的网络。大量的结果验证了我们的方法在不同RGB-D图像数据集上的有效性。

{"title":"MDSC-Net: Multi-Modal Discriminative Sparse Coding Driven RGB-D Classification Network","authors":"Jingyi Xu;Xin Deng;Yibing Fu;Mai Xu;Shengxi Li","doi":"10.1109/TMM.2024.3521720","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521720","url":null,"abstract":"In this paper, we propose a novel sparsity-driven deep neural network to solve the RGB-D image classification problem. Different from existing classification networks, our network architecture is designed by drawing inspirations from a new proposed multi-modal discriminative sparse coding (MDSC) model. The key feature of this model is that it can gradually separate the discriminative and non-discriminative features in RGB-D images in a coarse-to-fine manner. Only the discriminative features are integrated and refined for classification, while the non-discriminative features are discarded, to improve the classification accuracy and efficiency. Derived from the MDSC model, the proposed network is composed of three modules, i.e., the shared feature extraction (SFE) module, discriminative feature refinement (DFR) module, and classification module. The architecture of each module is derived from the optimization solution in the MDSC model. To the best of our knowledge, this is the first time a fully sparsity-driven network has been proposed for RGB-D image classification. Extensive results verify the effectiveness of our method on different RGB-D image datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"442-454"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Image-Based Freeform Handwriting Authentication With Energy-Oriented Self-Supervised Learning

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521807

Jingyao Wang;Luntian Mou;Changwen Zheng;Wen Gao

Freeform handwriting authentication verifies a person's identity from their writing style and habits in messy handwriting data. This technique has gained widespread attention in recent years as a valuable tool for various fields, e.g., fraud prevention and cultural heritage protection. However, it still remains a challenging task in reality due to three reasons: (i) severe damage, (ii) complex high-dimensional features, and (iii) lack of supervision. To address these issues, we propose SherlockNet, an energy-oriented two-branch contrastive self-supervised learning framework for robust and fast freeform handwriting authentication. It consists of four stages: (i) pre-processing: converting manuscripts into energy distributions using a novel plug-and-play energy-oriented operator to eliminate the influence of noise; (ii) generalized pre-training: learning general representation through two-branch momentum-based adaptive contrastive learning with the energy distributions, which handles the high-dimensional features and spatial dependencies of handwriting; (iii) personalized fine-tuning: calibrating the learned knowledge using a small amount of labeled data from downstream tasks; and (iv) practical application: identifying individual handwriting from scrambled, missing, or forged data efficiently and conveniently. Considering the practicality, we construct EN-HA, a novel dataset that simulates data forgery and severe damage in real applications. Finally, we conduct extensive experiments on six benchmark datasets including our EN-HA, and the results prove the robustness and efficiency of SherlockNet.

{"title":"Image-Based Freeform Handwriting Authentication With Energy-Oriented Self-Supervised Learning","authors":"Jingyao Wang;Luntian Mou;Changwen Zheng;Wen Gao","doi":"10.1109/TMM.2024.3521807","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521807","url":null,"abstract":"Freeform handwriting authentication verifies a person's identity from their writing style and habits in messy handwriting data. This technique has gained widespread attention in recent years as a valuable tool for various fields, e.g., fraud prevention and cultural heritage protection. However, it still remains a challenging task in reality due to three reasons: (i) severe damage, (ii) complex high-dimensional features, and (iii) lack of supervision. To address these issues, we propose SherlockNet, an energy-oriented two-branch contrastive self-supervised learning framework for robust and fast freeform handwriting authentication. It consists of four stages: (i) pre-processing: converting manuscripts into energy distributions using a novel plug-and-play energy-oriented operator to eliminate the influence of noise; (ii) generalized pre-training: learning general representation through two-branch momentum-based adaptive contrastive learning with the energy distributions, which handles the high-dimensional features and spatial dependencies of handwriting; (iii) personalized fine-tuning: calibrating the learned knowledge using a small amount of labeled data from downstream tasks; and (iv) practical application: identifying individual handwriting from scrambled, missing, or forged data efficiently and conveniently. Considering the practicality, we construct EN-HA, a novel dataset that simulates data forgery and severe damage in real applications. Finally, we conduct extensive experiments on six benchmark datasets including our EN-HA, and the results prove the robustness and efficiency of SherlockNet.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1397-1409"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast Disentangled Slim Tensor Learning for Multi-View Clustering

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521754

Deng Xu;Chao Zhang;Zechao Li;Chunlin Chen;Huaxiong Li

Tensor-based multi-view clustering has recently received significant attention due to its exceptional ability to explore cross-view high-order correlations. However, most existing methods still encounter some limitations. (1) Most of them explore the correlations among different affinity matrices, making them unscalable to large-scale data. (2) Although some methods address it by introducing bipartite graphs, they may result in sub-optimal solutions caused by an unstable anchor selection process. (3) They generally ignore the negative impact of latent semantic-unrelated information in each view. To tackle these issues, we propose a new approach termed fast Disentangled Slim Tensor Learning (DSTL) for multi-view clustering. Instead of focusing on the multi-view graph structures, DSTL directly explores the high-order correlations among multi-view latent semantic representations based on matrix factorization. To alleviate the negative influence of feature redundancy, inspired by robust PCA, DSTL disentangles the latent low-dimensional representation into a semantic-unrelated part and a semantic-related part for each view. Subsequently, two slim tensors are constructed with tensor-based regularization. To further enhance the quality of feature disentanglement, the semantic-related representations are aligned across views through a consensus alignment indicator. Our proposed model is computationally efficient and can be solved effectively. Extensive experiments demonstrate the superiority and efficiency of DSTL over state-of-the-art approaches.

{"title":"Fast Disentangled Slim Tensor Learning for Multi-View Clustering","authors":"Deng Xu;Chao Zhang;Zechao Li;Chunlin Chen;Huaxiong Li","doi":"10.1109/TMM.2024.3521754","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521754","url":null,"abstract":"Tensor-based multi-view clustering has recently received significant attention due to its exceptional ability to explore cross-view high-order correlations. However, most existing methods still encounter some limitations. (1) Most of them explore the correlations among different affinity matrices, making them unscalable to large-scale data. (2) Although some methods address it by introducing bipartite graphs, they may result in sub-optimal solutions caused by an unstable anchor selection process. (3) They generally ignore the negative impact of latent semantic-unrelated information in each view. To tackle these issues, we propose a new approach termed fast Disentangled Slim Tensor Learning (DSTL) for multi-view clustering. Instead of focusing on the multi-view graph structures, DSTL directly explores the high-order correlations among multi-view latent semantic representations based on matrix factorization. To alleviate the negative influence of feature redundancy, inspired by robust PCA, DSTL disentangles the latent low-dimensional representation into a semantic-unrelated part and a semantic-related part for each view. Subsequently, two slim tensors are constructed with tensor-based regularization. To further enhance the quality of feature disentanglement, the semantic-related representations are aligned across views through a consensus alignment indicator. Our proposed model is computationally efficient and can be solved effectively. Extensive experiments demonstrate the superiority and efficiency of DSTL over state-of-the-art approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1254-1265"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143594414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Weakly Supervised Referring Video Object Segmentation With Object-Centric Pseudo-Guidance 利用以对象为中心的伪向导进行弱监督参考视频对象分割

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521741

Weikang Wang;Yuting Su;Jing Liu;Wei Sun;Guangtao Zhai

Referring video object segmentation (RVOS) is an emerging task for multimodal video comprehension while the expensive annotating process of object masks restricts the scalability and diversity of RVOS datasets. To relax the dependency on expensive mask annotations and take advantage from large-scale partially annotated data, in this paper, we explore a novel extended RVOS task, namely weakly supervised referring video object segmentation (WRVOS), which employs multiple weak supervision sources, including object points and bounding boxes. Correspondingly, we propose a unified WRVOS framework. Specifically, an object-centric pseudo mask generation method is introduced to provide effective shape priors for the pseudo guidance of spatial object location. Then, a pseudo-guided optimization strategy is proposed to effectively optimize the object outlines in terms of spatial location and projection density with a multi-stage online learning strategy. Furthermore, a multimodal cross-frame level set evolution method is proposed to iteratively refine the object boundaries considering both temporal consistency and cross-modal interactions. Extensive experiments are conducted on four publicly available RVOS datasets, including A2D Sentences, J-HMDB Sentences, Ref-DAVIS, and Ref-YoutubeVOS. Performance comparison shows that the proposed method achieves state-of-the-art performance in both point-supervised and box-supervised settings.

{"title":"Weakly Supervised Referring Video Object Segmentation With Object-Centric Pseudo-Guidance","authors":"Weikang Wang;Yuting Su;Jing Liu;Wei Sun;Guangtao Zhai","doi":"10.1109/TMM.2024.3521741","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521741","url":null,"abstract":"Referring video object segmentation (RVOS) is an emerging task for multimodal video comprehension while the expensive annotating process of object masks restricts the scalability and diversity of RVOS datasets. To relax the dependency on expensive mask annotations and take advantage from large-scale partially annotated data, in this paper, we explore a novel extended RVOS task, namely weakly supervised referring video object segmentation (WRVOS), which employs multiple weak supervision sources, including object points and bounding boxes. Correspondingly, we propose a unified WRVOS framework. Specifically, an object-centric pseudo mask generation method is introduced to provide effective shape priors for the pseudo guidance of spatial object location. Then, a pseudo-guided optimization strategy is proposed to effectively optimize the object outlines in terms of spatial location and projection density with a multi-stage online learning strategy. Furthermore, a multimodal cross-frame level set evolution method is proposed to iteratively refine the object boundaries considering both temporal consistency and cross-modal interactions. Extensive experiments are conducted on four publicly available RVOS datasets, including A2D Sentences, J-HMDB Sentences, Ref-DAVIS, and Ref-YoutubeVOS. Performance comparison shows that the proposed method achieves state-of-the-art performance in both point-supervised and box-supervised settings.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1320-1333"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic Strategy Prompt Reasoning for Emotional Support Conversation 情感支持对话的动态策略提示推理

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521669

Yiting Liu;Liang Li;Yunbin Tu;Beichen Zhang;Zheng-Jun Zha;Qingming Huang

An emotional support conversation (ESC) system aims to reduce users' emotional distress by engaging in conversation using various reply strategies as guidance. To develop instructive reply strategies for an ESC system, it is essential to consider the dynamic transitions of users' emotional states through the conversational turns. However, existing methods for strategy-guided ESC systems struggle to capture these transitions as they overlook the inference of fine-grained user intentions. This oversight poses a significant obstacle, impeding the model's ability to derive pertinent strategy information and, consequently, hindering its capacity to generate emotionally supportive responses. To tackle this limitation, we propose a novel dynamic strategy prompt reasoning model (DSR), which leverages sparse context relation deduction to acquire adaptive representation of reply strategies as prompts for guiding the response generation process. Specifically, we first perform turn-level commonsense reasoning with different approaches to extract auxiliary knowledge, which enhances the comprehension of user intention. Then we design a context relation deduction module to dynamically integrate interdependent dialogue information, capturing granular user intentions and generating effective strategy prompts. Finally, we utilize the strategy prompts to guide the generation of more relevant and supportive responses. DSR model is validated through extensive experiments conducted on a benchmark dataset, demonstrating its superior performance compared to the latest competitive methods in the field.

情感支持对话（ESC）系统旨在通过各种回复策略作为引导，参与对话，减少用户的情绪困扰。要为ESC系统制定有指导意义的回复策略，必须考虑用户情绪状态在对话回合中的动态转变。然而，策略引导ESC系统的现有方法很难捕捉这些转换，因为它们忽略了细粒度用户意图的推断。这种疏忽构成了一个重大障碍，阻碍了模型获得相关战略信息的能力，从而阻碍了其产生情感支持反应的能力。为了解决这一限制，我们提出了一种新的动态策略提示推理模型（DSR），该模型利用稀疏上下文关系推理来获取回复策略的自适应表示，作为指导响应生成过程的提示。具体而言，我们首先使用不同的方法进行回合级常识推理，提取辅助知识，增强对用户意图的理解。然后设计上下文关系推理模块，动态整合相互依存的对话信息，捕捉粒度级用户意图，生成有效的策略提示。最后，我们利用策略提示来指导产生更相关和支持性的回应。通过在基准数据集上进行的大量实验验证了DSR模型，与该领域最新的竞争方法相比，显示了其优越的性能。

{"title":"Dynamic Strategy Prompt Reasoning for Emotional Support Conversation","authors":"Yiting Liu;Liang Li;Yunbin Tu;Beichen Zhang;Zheng-Jun Zha;Qingming Huang","doi":"10.1109/TMM.2024.3521669","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521669","url":null,"abstract":"An emotional support conversation (ESC) system aims to reduce users' emotional distress by engaging in conversation using various reply strategies as guidance. To develop instructive reply strategies for an ESC system, it is essential to consider the dynamic transitions of users' emotional states through the conversational turns. However, existing methods for strategy-guided ESC systems struggle to capture these transitions as they overlook the inference of fine-grained user intentions. This oversight poses a significant obstacle, impeding the model's ability to derive pertinent strategy information and, consequently, hindering its capacity to generate emotionally supportive responses. To tackle this limitation, we propose a novel dynamic strategy prompt reasoning model (DSR), which leverages sparse context relation deduction to acquire adaptive representation of reply strategies as prompts for guiding the response generation process. Specifically, we first perform turn-level commonsense reasoning with different approaches to extract auxiliary knowledge, which enhances the comprehension of user intention. Then we design a context relation deduction module to dynamically integrate interdependent dialogue information, capturing granular user intentions and generating effective strategy prompts. Finally, we utilize the strategy prompts to guide the generation of more relevant and supportive responses. DSR model is validated through extensive experiments conducted on a benchmark dataset, demonstrating its superior performance compared to the latest competitive methods in the field.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"108-119"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-Modal Cognitive Consensus Guided Audio–Visual Segmentation 跨模态认知共识引导的视听分割

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521746

Zhaofeng Shi;Qingbo Wu;Fanman Meng;Linfeng Xu;Hongliang Li

Audio-Visual Segmentation (AVS) aims to extract the sounding object from a video frame, which is represented by a pixel-wise segmentation mask for application scenarios such as multi-modal video editing, augmented reality, and intelligent robot systems. The pioneering work conducts this task through dense feature-level audio-visual interaction, which ignores the dimension gap between different modalities. More specifically, the audio clip could only provide a Global semantic label in each sequence, but the video frame covers multiple semantic objects across different Local regions, which leads to mislocalization of the representationally similar but semantically different object. In this paper, we propose a Cross-modal Cognitive Consensus guided Network (C3N) to align the audio-visual semantics from the global dimension and progressively inject them into the local regions via an attention mechanism. Firstly, a Cross-modal Cognitive Consensus Inference Module (C3IM) is developed to extract a unified-modal label by integrating audio/visual classification confidence and similarities of modality-agnostic label embeddings. Then, we feed the unified-modal label back to the visual backbone as the explicit semantic-level guidance via a Cognitive Consensus guided Attention Module (CCAM), which highlights the local features corresponding to the interested object. Extensive experiments on the Single Sound Source Segmentation (S4) setting and Multiple Sound Source Segmentation (MS3) setting of the AVSBench dataset demonstrate the effectiveness of the proposed method, which achieves state-of-the-art performance.

音频-视觉分割（AVS）旨在从视频帧中提取声音对象，该对象由像素级分割掩码表示，适用于多模态视频编辑、增强现实和智能机器人系统等应用场景。开创性的工作是通过密集的特征级视听交互来完成这项任务，而忽略了不同模态之间的维度差距。更具体地说，音频片段只能在每个序列中提供一个全局语义标签，但视频帧覆盖了不同Local区域的多个语义对象，这导致了表征相似但语义不同的对象的错误定位。在本文中，我们提出了一个跨模态认知共识引导网络（Cross-modal Cognitive Consensus guided Network, C3N），从全局维度对齐视听语义，并通过注意机制逐步注入局部区域。首先，开发了跨模态认知共识推理模块（C3IM），通过整合音视频分类置信度和模态不可知标签嵌入的相似度提取统一模态标签；然后，我们通过认知共识引导注意力模块（CCAM）将统一模态标签作为显式语义级引导反馈给视觉主干，该模块突出显示感兴趣对象对应的局部特征。在AVSBench数据集的单声源分割（S4）设置和多声源分割（MS3）设置上进行的大量实验证明了该方法的有效性，达到了最先进的性能。

{"title":"Cross-Modal Cognitive Consensus Guided Audio–Visual Segmentation","authors":"Zhaofeng Shi;Qingbo Wu;Fanman Meng;Linfeng Xu;Hongliang Li","doi":"10.1109/TMM.2024.3521746","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521746","url":null,"abstract":"Audio-Visual Segmentation (AVS) aims to extract the sounding object from a video frame, which is represented by a pixel-wise segmentation mask for application scenarios such as multi-modal video editing, augmented reality, and intelligent robot systems. The pioneering work conducts this task through dense feature-level audio-visual interaction, which ignores the dimension gap between different modalities. More specifically, the audio clip could only provide a <italic>Global</i> semantic label in each sequence, but the video frame covers multiple semantic objects across different <italic>Local</i> regions, which leads to mislocalization of the representationally similar but semantically different object. In this paper, we propose a Cross-modal Cognitive Consensus guided Network (C3N) to align the audio-visual semantics from the global dimension and progressively inject them into the local regions via an attention mechanism. Firstly, a Cross-modal Cognitive Consensus Inference Module (C3IM) is developed to extract a unified-modal label by integrating audio/visual classification confidence and similarities of modality-agnostic label embeddings. Then, we feed the unified-modal label back to the visual backbone as the explicit semantic-level guidance via a Cognitive Consensus guided Attention Module (CCAM), which highlights the local features corresponding to the interested object. Extensive experiments on the Single Sound Source Segmentation (S4) setting and Multiple Sound Source Segmentation (MS3) setting of the AVSBench dataset demonstrate the effectiveness of the proposed method, which achieves state-of-the-art performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"209-223"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Role of ViT Design and Training in Robustness to Common Corruptions

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521721

Rui Tian;Zuxuan Wu;Qi Dai;Micah Goldblum;Han Hu;Yu-Gang Jiang

Vision transformer (ViT) variants have made rapid advances on a variety of computer vision tasks. However, their performance on corrupted inputs, which are inevitable in realistic use cases due to variations in lighting and weather, has not been explored comprehensively. In this paper, we probe the robustness gap among ViT variants and ask how these modern architectural developments affect performance under common types of corruption. Through extensive and rigorous benchmarking, we demonstrate that simple architectural designs such as overlapping patch embedding and convolutional feed-forward networks can promote the robustness of ViTs. Moreover, since the de facto training of ViTs relies heavily on data augmentation, exactly which augmentation strategies make ViTs more robust is worth investigating. We survey the efficacy of previous methods and verify that adversarial noise training is powerful. In addition, we introduce a novel conditional method for generating dynamic augmentation parameters conditioned on input images, which offers state-of-the-art robustness to common corruptions.

引用次数: 0

Polarization State Attention Dehazing Network With a Simulated Polar-Haze Dataset 基于极化状态注意力去雾网络的模拟极化雾数据集

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521827

Sijia Wen;Yinqiang Zheng;Feng Lu

Image dehazing under harsh weather conditions remains a challenging and ill-posed problem. In addition, acquiring real-time haze-free counterparts of hazy images poses difficulties. Existing approaches commonly synthesize hazy data by relying on estimated depth information, which is prone to errors due to its physical unreliability. While generative networks can transfer some hazy features to clear images, the resulting hazy images still exhibit an artificial appearance. In this paper, we introduce polarization cues to propose a haze simulation strategy to synthesize hazy data, ensuring visually pleasing results that adhere to physical laws. Leveraging on the simulated Polar-Haze dataset, we present a polarization state attention dehazing network (PSADNet), which consists of a polarization extraction module and a polarization dehazing module. The proposed polarization extraction model incorporates an attention mechanism to capture high-level image features related to polarization and chromaticity. The polarization dehazing module utilizes these features derived from the polarization analysis to enhance image dehazing capabilities while preserving the accuracy of the polarization information. Promising results are observed in both qualitative and quantitative experiments, supporting the effectiveness of the proposed PSADNet and the validity of polarization-based haze simulation strategy.

恶劣天气条件下的图像除雾仍然是一个具有挑战性和不适定性的问题。此外，获取朦胧图像的实时无雾对应物也存在困难。现有方法一般依靠估计深度信息合成雾霾数据，由于其物理不可靠，容易产生误差。虽然生成网络可以将一些模糊的特征转移到清晰的图像中，但生成的模糊图像仍然呈现出人工的外观。在本文中，我们引入偏振线索，提出了一种雾霾模拟策略来合成雾霾数据，确保视觉上令人愉悦的结果符合物理定律。利用模拟的极地雾霾数据集，我们提出了一个极化状态关注去雾网络（PSADNet），该网络由极化提取模块和极化去雾模块组成。所提出的偏振提取模型结合了注意机制来捕获与偏振和色度相关的高级图像特征。偏振去雾模块利用从偏振分析中得到的这些特征来增强图像去雾能力，同时保持偏振信息的准确性。在定性和定量实验中都观察到令人满意的结果，支持了所提出的PSADNet的有效性和基于偏振的雾霾模拟策略的有效性。

{"title":"Polarization State Attention Dehazing Network With a Simulated Polar-Haze Dataset","authors":"Sijia Wen;Yinqiang Zheng;Feng Lu","doi":"10.1109/TMM.2024.3521827","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521827","url":null,"abstract":"Image dehazing under harsh weather conditions remains a challenging and ill-posed problem. In addition, acquiring real-time haze-free counterparts of hazy images poses difficulties. Existing approaches commonly synthesize hazy data by relying on estimated depth information, which is prone to errors due to its physical unreliability. While generative networks can transfer some hazy features to clear images, the resulting hazy images still exhibit an artificial appearance. In this paper, we introduce polarization cues to propose a haze simulation strategy to synthesize hazy data, ensuring visually pleasing results that adhere to physical laws. Leveraging on the simulated Polar-Haze dataset, we present a polarization state attention dehazing network (PSADNet), which consists of a polarization extraction module and a polarization dehazing module. The proposed polarization extraction model incorporates an attention mechanism to capture high-level image features related to polarization and chromaticity. The polarization dehazing module utilizes these features derived from the polarization analysis to enhance image dehazing capabilities while preserving the accuracy of the polarization information. Promising results are observed in both qualitative and quantitative experiments, supporting the effectiveness of the proposed PSADNet and the validity of polarization-based haze simulation strategy.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"263-274"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SDE2D: Semantic-Guided Discriminability Enhancement Feature Detector and Descriptor 语义引导的可区别性增强特征检测器和描述符

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521748

Jiapeng Li;Ruonan Zhang;Ge Li;Thomas H. Li

Local feature detectors and descriptors serve various computer vision tasks, such as image matching, visual localization, and 3D reconstruction. To address the extreme variations of rotation and light in the real world, most detectors and descriptors capture as much invariance as possible. However, these methods ignore feature discriminability and perform poorly in indoor scenes. Indoor scenes have too many weak-textured and even repeatedly textured regions, so it is necessary for the extracted features to possess sufficient discriminability. Therefore, we propose a semantic-guided method (called SDE2D) enhancing feature discriminability to improve the performance of descriptors for indoor scenes. We develop a kind of semantic-guided discriminability enhancement (SDE) loss function that uses semantic information from indoor scenes. To the best of our knowledge, this is the first deep research that applies semantic segmentation to enhance discriminability. In addition, we design a novel framework that allows semantic segmentation network to be well embedded as a module in the overall framework and provides guidance information for training. Besides, we explore the impact of different semantic segmentation models on our method. The experimental results on indoor scenes datasets demonstrate that the proposed SDE2D performs well compared with the state-of-the-art models.

局部特征检测器和描述符服务于各种计算机视觉任务，如图像匹配、视觉定位和3D重建。为了解决现实世界中旋转和光的极端变化，大多数检测器和描述符捕获尽可能多的不变性。然而，这些方法忽略了特征可判别性，在室内场景中表现不佳。室内场景有太多弱纹理甚至重复纹理的区域，因此需要提取的特征具有足够的可分辨性。因此，我们提出了一种增强特征可分辨性的语义引导方法（SDE2D）来提高描述符在室内场景中的性能。本文提出了一种基于室内场景语义信息的语义引导可判别性增强（SDE）损失函数。据我们所知，这是第一次应用语义分割来增强可辨别性的深入研究。此外，我们设计了一个新的框架，使语义分割网络作为一个模块很好地嵌入到整个框架中，并为训练提供指导信息。此外，我们还探讨了不同的语义分割模型对我们方法的影响。室内场景数据集的实验结果表明，与现有模型相比，所提出的SDE2D模型具有良好的性能。

{"title":"SDE2D: Semantic-Guided Discriminability Enhancement Feature Detector and Descriptor","authors":"Jiapeng Li;Ruonan Zhang;Ge Li;Thomas H. Li","doi":"10.1109/TMM.2024.3521748","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521748","url":null,"abstract":"Local feature detectors and descriptors serve various computer vision tasks, such as image matching, visual localization, and 3D reconstruction. To address the extreme variations of rotation and light in the real world, most detectors and descriptors capture as much invariance as possible. However, these methods ignore feature discriminability and perform poorly in indoor scenes. Indoor scenes have too many weak-textured and even repeatedly textured regions, so it is necessary for the extracted features to possess sufficient discriminability. Therefore, we propose a semantic-guided method (called SDE2D) enhancing feature discriminability to improve the performance of descriptors for indoor scenes. We develop a kind of semantic-guided discriminability enhancement (SDE) loss function that uses semantic information from indoor scenes. To the best of our knowledge, this is the first deep research that applies semantic segmentation to enhance discriminability. In addition, we design a novel framework that allows semantic segmentation network to be well embedded as a module in the overall framework and provides guidance information for training. Besides, we explore the impact of different semantic segmentation models on our method. The experimental results on indoor scenes datasets demonstrate that the proposed SDE2D performs well compared with the state-of-the-art models.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"275-286"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WHANet:Wavelet-Based Hybrid Asymmetric Network for Spectral Super-Resolution From RGB Inputs WHANet：基于小波的RGB输入光谱超分辨率混合不对称网络

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521713

Nan Wang;Shaohui Mei;Yi Wang;Yifan Zhang;Duo Zhan

The reconstruction from three to dozens of spectral bands, known as spectral super resolution (SSR) has achieved remarkable progress with the continuous development of deep learning. However, the reconstructed hyperspectral images (HSIs) still suffer from the spatial degeneration due to the insufficient retention of high-frequency (HF) information during the SSR process. To remedy this issue, a novel Wavelet-based Hybrid Asymmetric Network (WHANet) is proposed to establish a RGB-to-HSI translation in wavelet domain, thus reserving and emphasizing the HF features in hyperspectral space. Basically, the backbone is designed in a hybrid asymmetric structure that learns the exact representations of decomposed wavelet coefficients in hyperspectral domain in a parallel way. Innovatively, a CNN-based HF reconstruction module (HFRM) and a transformer-based low frequency (LF) reconstruction module (LFRM) are delicately devised to perform the SSR process individually, which are able to process the discriminative wavelet coefficients contrapuntally. Furthermore, a hybrid loss function incorporated with the Fast Fourier loss (FFL) is proposed to directly regularize and emphasis the missing HF components. Eventually, experimental results over three benchmark datasets and one remote sensing dataset demonstrate that our WHANet is able to reach the state-of-the-art performance quantitatively and qualitatively.

随着深度学习技术的不断发展，从3到几十个光谱波段的重建，即光谱超分辨率（SSR）技术已经取得了显著的进展。然而，由于SSR过程中高频信息的保留不足，重构的高光谱图像仍然存在空间退化的问题。为了解决这一问题，提出了一种新的基于小波的混合不对称网络（WHANet），在小波域建立rgb到hsi的转换，从而保留和强调高光谱空间中的高频特征。基本上，主干被设计成一种混合不对称结构，以并行的方式学习分解后的小波系数在高光谱域的精确表示。创新地，设计了基于cnn的高频重构模块（HFRM）和基于变压器的低频重构模块（LFRM）分别执行SSR过程，能够对位处理判别小波系数。此外，提出了一种结合快速傅立叶损失（FFL）的混合损失函数来直接正则化和强调缺失的高频分量。最后，在三个基准数据集和一个遥感数据集上的实验结果表明，我们的WHANet能够在定量和定性上达到最先进的性能。

{"title":"WHANet:Wavelet-Based Hybrid Asymmetric Network for Spectral Super-Resolution From RGB Inputs","authors":"Nan Wang;Shaohui Mei;Yi Wang;Yifan Zhang;Duo Zhan","doi":"10.1109/TMM.2024.3521713","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521713","url":null,"abstract":"The reconstruction from three to dozens of spectral bands, known as spectral super resolution (SSR) has achieved remarkable progress with the continuous development of deep learning. However, the reconstructed hyperspectral images (HSIs) still suffer from the spatial degeneration due to the insufficient retention of high-frequency (HF) information during the SSR process. To remedy this issue, a novel Wavelet-based Hybrid Asymmetric Network (WHANet) is proposed to establish a RGB-to-HSI translation in wavelet domain, thus reserving and emphasizing the HF features in hyperspectral space. Basically, the backbone is designed in a hybrid asymmetric structure that learns the exact representations of decomposed wavelet coefficients in hyperspectral domain in a parallel way. Innovatively, a CNN-based HF reconstruction module (HFRM) and a transformer-based low frequency (LF) reconstruction module (LFRM) are delicately devised to perform the SSR process individually, which are able to process the discriminative wavelet coefficients contrapuntally. Furthermore, a hybrid loss function incorporated with the Fast Fourier loss (FFL) is proposed to directly regularize and emphasis the missing HF components. Eventually, experimental results over three benchmark datasets and one remote sensing dataset demonstrate that our WHANet is able to reach the state-of-the-art performance quantitatively and qualitatively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"414-428"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0