IEEE Transactions on Multimedia最新文献_第4页

Neuromorphic Vision-Based Motion Segmentation With Graph Transformer Neural Network 基于图变换神经网络的神经形态视觉运动分割

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521662

Yusra Alkendi;Rana Azzam;Sajid Javed;Lakmal Seneviratne;Yahya Zweiri

Moving object segmentation is critical to interpret scene dynamics for robotic navigation systems in challenging environments. Neuromorphic vision sensors are tailored for motion perception due to their asynchronous nature, high temporal resolution, and reduced power consumption. However, their unconventional output requires novel perception paradigms to leverage their spatially sparse and temporally dense nature. In this work, we propose a novel event-based motion segmentation algorithm using a Graph Transformer Neural Network, dubbed GTNN. Our proposed algorithm processes event streams as 3D graphs by a series of nonlinear transformations to unveil local and global spatiotemporal correlations between events. Based on these correlations, events belonging to moving objects are segmented from the background without prior knowledge of the dynamic scene geometry. The algorithm is trained on publicly available datasets including MOD, EV-IMO, and EV-IMO2 using the proposed training scheme to facilitate efficient training on extensive datasets. Moreover, we introduce the Dynamic Object Mask-aware Event Labeling (DOMEL) approach for generating approximate ground-truth labels for event-based motion segmentation datasets. We use DOMEL to label our own recorded Event dataset for Motion Segmentation (EMS-DOMEL), which we release to the public for further research and benchmarking. Rigorous experiments are conducted on several unseen publicly-available datasets where the results revealed that GTNN outperforms state-of-the-art methods in the presence of dynamic background variations, motion patterns, and multiple dynamic objects with varying sizes and velocities. GTNN achieves significant performance gains with an average increase of 9.4% and 4.5% in terms of motion segmentation accuracy (IoU%) and detection rate (DR%), respectively.

运动目标分割是机器人导航系统在复杂环境中解释场景动态的关键。神经形态视觉传感器由于其异步特性，高时间分辨率和降低功耗而适合运动感知。然而，它们的非常规输出需要新颖的感知范式来利用其空间稀疏和时间密集的性质。在这项工作中，我们提出了一种新的基于事件的运动分割算法，使用图形转换神经网络，称为GTNN。我们提出的算法通过一系列非线性转换将事件流处理为3D图，以揭示事件之间的局部和全局时空相关性。基于这些相关性，属于运动物体的事件从背景中分割出来，而不需要事先了解动态场景的几何形状。该算法在公开可用的数据集上进行训练，包括MOD、EV-IMO和EV-IMO2，使用所提出的训练方案，以促进对广泛数据集的有效训练。此外，我们还引入了动态对象掩码感知事件标记（DOMEL）方法，用于为基于事件的运动分割数据集生成近似的ground-truth标签。我们使用DOMEL来标记我们自己记录的运动分割事件数据集（EMS-DOMEL），我们向公众发布以进一步研究和基准测试。在几个看不见的公开数据集上进行了严格的实验，结果表明，GTNN在动态背景变化、运动模式和具有不同大小和速度的多个动态对象的存在下优于最先进的方法。GTNN在运动分割准确率（IoU%）和检测率（DR%）方面取得了显著的性能提升，平均提高了9.4%和4.5%。

{"title":"Neuromorphic Vision-Based Motion Segmentation With Graph Transformer Neural Network","authors":"Yusra Alkendi;Rana Azzam;Sajid Javed;Lakmal Seneviratne;Yahya Zweiri","doi":"10.1109/TMM.2024.3521662","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521662","url":null,"abstract":"Moving object segmentation is critical to interpret scene dynamics for robotic navigation systems in challenging environments. Neuromorphic vision sensors are tailored for motion perception due to their asynchronous nature, high temporal resolution, and reduced power consumption. However, their unconventional output requires novel perception paradigms to leverage their spatially sparse and temporally dense nature. In this work, we propose a novel event-based motion segmentation algorithm using a Graph Transformer Neural Network, dubbed GTNN. Our proposed algorithm processes event streams as 3D graphs by a series of nonlinear transformations to unveil local and global spatiotemporal correlations between events. Based on these correlations, events belonging to moving objects are segmented from the background without prior knowledge of the dynamic scene geometry. The algorithm is trained on publicly available datasets including MOD, EV-IMO, and EV-IMO2 using the proposed training scheme to facilitate efficient training on extensive datasets. Moreover, we introduce the Dynamic Object Mask-aware Event Labeling (DOMEL) approach for generating approximate ground-truth labels for event-based motion segmentation datasets. We use DOMEL to label our own recorded Event dataset for Motion Segmentation (EMS-DOMEL), which we release to the public for further research and benchmarking. Rigorous experiments are conducted on several unseen publicly-available datasets where the results revealed that GTNN outperforms state-of-the-art methods in the presence of dynamic background variations, motion patterns, and multiple dynamic objects with varying sizes and velocities. GTNN achieves significant performance gains with an average increase of 9.4% and 4.5% in terms of motion segmentation accuracy (<italic>IoU%) and detection rate (<italic>DR%), respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"385-400"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10812712","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DNP-AUT: Image Compression Using Double-Layer Non-Uniform Partition and Adaptive U Transform DNP-AUT：基于双层非均匀分割和自适应U变换的图像压缩

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521853

Yumo Zhang;Zhanchuan Cai

To provide an image compression method with better compression performance and lower computational complexity, a new image compression algorithm is proposed in this paper. First, a double-layer non-uniform partition algorithm is proposed, which analyzes the texture complexity of image blocks and performs partitioning and merging of the image blocks at different scales to provide a priori information that helps to reduce the spatial redundancy for subsequent compression against the blocks. Next, by considering the multi-transform cores, we propose an adaptive U transform scheme, which performs more specific coding for different types of image blocks to enhance the coding performance. Finally, in order that the bit allocation can be more flexible and accurate, a fully adaptive quantization technique is proposed. It not only formulates the quantization coefficient relationship between image blocks of different sizes but also further refines the quantization coefficient relationship between image blocks under different topologies. Extensive experiments indicate that the compression performance of the proposed algorithm not only significantly surpasses the JPEG but also surpasses some state-of-the-art compression algorithms with similar computational complexity. In addition, compared with the JPEG2000 compression algorithm, which has greater with higher computational complexity, its compression performance also has certain advantages.

为了提供一种压缩性能更好、计算复杂度更低的图像压缩方法，本文提出了一种新的图像压缩算法。首先，提出了一种双层非均匀分割算法，该算法分析图像块的纹理复杂度，对不同尺度的图像块进行分割和合并，为后续对图像块的压缩提供先验信息，减少空间冗余。其次，在考虑多变换核的基础上，提出了一种自适应U变换方案，对不同类型的图像块进行更有针对性的编码，以提高编码性能。最后，为了使比特分配更加灵活和准确，提出了一种全自适应量化技术。不仅给出了不同尺寸图像块之间的量化系数关系，而且进一步细化了不同拓扑下图像块之间的量化系数关系。大量实验表明，该算法的压缩性能不仅明显优于JPEG格式，而且也超过了目前一些计算复杂度相近的压缩算法。此外，与计算复杂度更高的JPEG2000压缩算法相比，其压缩性能也具有一定的优势。

{"title":"DNP-AUT: Image Compression Using Double-Layer Non-Uniform Partition and Adaptive U Transform","authors":"Yumo Zhang;Zhanchuan Cai","doi":"10.1109/TMM.2024.3521853","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521853","url":null,"abstract":"To provide an image compression method with better compression performance and lower computational complexity, a new image compression algorithm is proposed in this paper. First, a double-layer non-uniform partition algorithm is proposed, which analyzes the texture complexity of image blocks and performs partitioning and merging of the image blocks at different scales to provide a priori information that helps to reduce the spatial redundancy for subsequent compression against the blocks. Next, by considering the multi-transform cores, we propose an adaptive U transform scheme, which performs more specific coding for different types of image blocks to enhance the coding performance. Finally, in order that the bit allocation can be more flexible and accurate, a fully adaptive quantization technique is proposed. It not only formulates the quantization coefficient relationship between image blocks of different sizes but also further refines the quantization coefficient relationship between image blocks under different topologies. Extensive experiments indicate that the compression performance of the proposed algorithm not only significantly surpasses the JPEG but also surpasses some state-of-the-art compression algorithms with similar computational complexity. In addition, compared with the JPEG2000 compression algorithm, which has greater with higher computational complexity, its compression performance also has certain advantages.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"249-262"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Vision Transformer With Relation Exploration for Pedestrian Attribute Recognition 基于关系探索的视觉变换行人属性识别

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521677

Hao Tan;Zichang Tan;Dunfang Weng;Ajian Liu;Jun Wan;Zhen Lei;Stan Z. Li

Pedestrian attribute recognition has achieved high accuracy by exploring the relations between image regions and attributes. However, existing methods typically adopt features directly extracted from the backbone or utilize a single structure (e.g., transformer) to explore the relations, leading to inefficient and incomplete relation mining. To overcome these limitations, this paper proposes a comprehensive relationship framework called Vision Transformer with Relation Exploration (ViT-RE) for pedestrian attribute recognition, which includes two novel modules, namely Attribute and Contextual Feature Projection (ACFP) and Relation Exploration Module (REM). In ACFP, attribute-specific features and contextual-aware features are learned individually to capture discriminative information tailored for attributes and image regions, respectively. Then, REM employs Graph Convolutional Network (GCN) Blocks and Transformer Blocks to concurrently explore attribute, contextual, and attribute-contextual relations. To enable fine-grained relation mining, a Dynamic Adjacency Module (DAM) is further proposed to construct instance-wise adjacency matrix for the GCN Block. Equipped with comprehensive relation information, ViT-RE achieves promising performance on three popular benchmarks, including PETA, RAP, and PA-100 K datasets. Moreover, ViT-RE achieves the first place in the WACV 2023 UPAR Challenge.

通过探索图像区域与属性之间的关系，行人属性识别达到了较高的准确率。然而，现有方法通常采用直接从主干提取特征或利用单一结构（如变压器）来探索关系，导致关系挖掘效率低且不完整。为了克服这些局限性，本文提出了一种用于行人属性识别的综合关系框架Vision Transformer with Relation Exploration (vitr - re)，该框架包括属性与上下文特征投影（ACFP）和关系探索模块（REM）两个新颖的模块。在ACFP中，分别学习属性特定特征和上下文感知特征，分别捕获针对属性和图像区域量身定制的判别信息。然后，REM使用图卷积网络（GCN）块和转换块来并发地探索属性、上下文和属性-上下文关系。为了实现细粒度的关系挖掘，进一步提出了动态邻接模块（DAM）来构造GCN块的逐实例邻接矩阵。配备了全面的关系信息，vitr - re在三个流行的基准上取得了令人满意的性能，包括PETA， RAP和pa - 100k数据集。此外，vitre在WACV 2023 UPAR挑战赛中获得第一名。

{"title":"Vision Transformer With Relation Exploration for Pedestrian Attribute Recognition","authors":"Hao Tan;Zichang Tan;Dunfang Weng;Ajian Liu;Jun Wan;Zhen Lei;Stan Z. Li","doi":"10.1109/TMM.2024.3521677","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521677","url":null,"abstract":"Pedestrian attribute recognition has achieved high accuracy by exploring the relations between image regions and attributes. However, existing methods typically adopt features directly extracted from the backbone or utilize a single structure (e.g., transformer) to explore the relations, leading to inefficient and incomplete relation mining. To overcome these limitations, this paper proposes a comprehensive relationship framework called Vision Transformer with Relation Exploration (ViT-RE) for pedestrian attribute recognition, which includes two novel modules, namely Attribute and Contextual Feature Projection (ACFP) and Relation Exploration Module (REM). In ACFP, attribute-specific features and contextual-aware features are learned individually to capture discriminative information tailored for attributes and image regions, respectively. Then, REM employs Graph Convolutional Network (GCN) Blocks and Transformer Blocks to concurrently explore attribute, contextual, and attribute-contextual relations. To enable fine-grained relation mining, a Dynamic Adjacency Module (DAM) is further proposed to construct instance-wise adjacency matrix for the GCN Block. Equipped with comprehensive relation information, ViT-RE achieves promising performance on three popular benchmarks, including PETA, RAP, and PA-100 K datasets. Moreover, ViT-RE achieves the first place in the <italic>WACV 2023 UPAR Challenge.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"198-208"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MDSC-Net: Multi-Modal Discriminative Sparse Coding Driven RGB-D Classification Network 多模态判别稀疏编码驱动的RGB-D分类网络

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521720

Jingyi Xu;Xin Deng;Yibing Fu;Mai Xu;Shengxi Li

In this paper, we propose a novel sparsity-driven deep neural network to solve the RGB-D image classification problem. Different from existing classification networks, our network architecture is designed by drawing inspirations from a new proposed multi-modal discriminative sparse coding (MDSC) model. The key feature of this model is that it can gradually separate the discriminative and non-discriminative features in RGB-D images in a coarse-to-fine manner. Only the discriminative features are integrated and refined for classification, while the non-discriminative features are discarded, to improve the classification accuracy and efficiency. Derived from the MDSC model, the proposed network is composed of three modules, i.e., the shared feature extraction (SFE) module, discriminative feature refinement (DFR) module, and classification module. The architecture of each module is derived from the optimization solution in the MDSC model. To the best of our knowledge, this is the first time a fully sparsity-driven network has been proposed for RGB-D image classification. Extensive results verify the effectiveness of our method on different RGB-D image datasets.

在本文中，我们提出了一种新的稀疏驱动深度神经网络来解决RGB-D图像分类问题。与现有的分类网络不同，我们的网络架构是从一种新的多模态判别稀疏编码（MDSC）模型中汲取灵感设计的。该模型的关键特点是能够逐步将RGB-D图像中的判别特征和非判别特征进行从粗到精的分离。为了提高分类的准确率和效率，只对判别特征进行整合和细化，而对非判别特征进行丢弃。该网络基于MDSC模型，由三个模块组成，即共享特征提取（SFE）模块、判别特征细化（DFR）模块和分类模块。每个模块的体系结构都是由MDSC模型中的优化方案推导出来的。据我们所知，这是第一次为RGB-D图像分类提出一个完全稀疏驱动的网络。大量的结果验证了我们的方法在不同RGB-D图像数据集上的有效性。

{"title":"MDSC-Net: Multi-Modal Discriminative Sparse Coding Driven RGB-D Classification Network","authors":"Jingyi Xu;Xin Deng;Yibing Fu;Mai Xu;Shengxi Li","doi":"10.1109/TMM.2024.3521720","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521720","url":null,"abstract":"In this paper, we propose a novel sparsity-driven deep neural network to solve the RGB-D image classification problem. Different from existing classification networks, our network architecture is designed by drawing inspirations from a new proposed multi-modal discriminative sparse coding (MDSC) model. The key feature of this model is that it can gradually separate the discriminative and non-discriminative features in RGB-D images in a coarse-to-fine manner. Only the discriminative features are integrated and refined for classification, while the non-discriminative features are discarded, to improve the classification accuracy and efficiency. Derived from the MDSC model, the proposed network is composed of three modules, i.e., the shared feature extraction (SFE) module, discriminative feature refinement (DFR) module, and classification module. The architecture of each module is derived from the optimization solution in the MDSC model. To the best of our knowledge, this is the first time a fully sparsity-driven network has been proposed for RGB-D image classification. Extensive results verify the effectiveness of our method on different RGB-D image datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"442-454"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic Strategy Prompt Reasoning for Emotional Support Conversation 情感支持对话的动态策略提示推理

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521669

Yiting Liu;Liang Li;Yunbin Tu;Beichen Zhang;Zheng-Jun Zha;Qingming Huang

An emotional support conversation (ESC) system aims to reduce users' emotional distress by engaging in conversation using various reply strategies as guidance. To develop instructive reply strategies for an ESC system, it is essential to consider the dynamic transitions of users' emotional states through the conversational turns. However, existing methods for strategy-guided ESC systems struggle to capture these transitions as they overlook the inference of fine-grained user intentions. This oversight poses a significant obstacle, impeding the model's ability to derive pertinent strategy information and, consequently, hindering its capacity to generate emotionally supportive responses. To tackle this limitation, we propose a novel dynamic strategy prompt reasoning model (DSR), which leverages sparse context relation deduction to acquire adaptive representation of reply strategies as prompts for guiding the response generation process. Specifically, we first perform turn-level commonsense reasoning with different approaches to extract auxiliary knowledge, which enhances the comprehension of user intention. Then we design a context relation deduction module to dynamically integrate interdependent dialogue information, capturing granular user intentions and generating effective strategy prompts. Finally, we utilize the strategy prompts to guide the generation of more relevant and supportive responses. DSR model is validated through extensive experiments conducted on a benchmark dataset, demonstrating its superior performance compared to the latest competitive methods in the field.

情感支持对话（ESC）系统旨在通过各种回复策略作为引导，参与对话，减少用户的情绪困扰。要为ESC系统制定有指导意义的回复策略，必须考虑用户情绪状态在对话回合中的动态转变。然而，策略引导ESC系统的现有方法很难捕捉这些转换，因为它们忽略了细粒度用户意图的推断。这种疏忽构成了一个重大障碍，阻碍了模型获得相关战略信息的能力，从而阻碍了其产生情感支持反应的能力。为了解决这一限制，我们提出了一种新的动态策略提示推理模型（DSR），该模型利用稀疏上下文关系推理来获取回复策略的自适应表示，作为指导响应生成过程的提示。具体而言，我们首先使用不同的方法进行回合级常识推理，提取辅助知识，增强对用户意图的理解。然后设计上下文关系推理模块，动态整合相互依存的对话信息，捕捉粒度级用户意图，生成有效的策略提示。最后，我们利用策略提示来指导产生更相关和支持性的回应。通过在基准数据集上进行的大量实验验证了DSR模型，与该领域最新的竞争方法相比，显示了其优越的性能。

{"title":"Dynamic Strategy Prompt Reasoning for Emotional Support Conversation","authors":"Yiting Liu;Liang Li;Yunbin Tu;Beichen Zhang;Zheng-Jun Zha;Qingming Huang","doi":"10.1109/TMM.2024.3521669","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521669","url":null,"abstract":"An emotional support conversation (ESC) system aims to reduce users' emotional distress by engaging in conversation using various reply strategies as guidance. To develop instructive reply strategies for an ESC system, it is essential to consider the dynamic transitions of users' emotional states through the conversational turns. However, existing methods for strategy-guided ESC systems struggle to capture these transitions as they overlook the inference of fine-grained user intentions. This oversight poses a significant obstacle, impeding the model's ability to derive pertinent strategy information and, consequently, hindering its capacity to generate emotionally supportive responses. To tackle this limitation, we propose a novel dynamic strategy prompt reasoning model (DSR), which leverages sparse context relation deduction to acquire adaptive representation of reply strategies as prompts for guiding the response generation process. Specifically, we first perform turn-level commonsense reasoning with different approaches to extract auxiliary knowledge, which enhances the comprehension of user intention. Then we design a context relation deduction module to dynamically integrate interdependent dialogue information, capturing granular user intentions and generating effective strategy prompts. Finally, we utilize the strategy prompts to guide the generation of more relevant and supportive responses. DSR model is validated through extensive experiments conducted on a benchmark dataset, demonstrating its superior performance compared to the latest competitive methods in the field.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"108-119"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-Modal Cognitive Consensus Guided Audio–Visual Segmentation 跨模态认知共识引导的视听分割

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521746

Zhaofeng Shi;Qingbo Wu;Fanman Meng;Linfeng Xu;Hongliang Li

Audio-Visual Segmentation (AVS) aims to extract the sounding object from a video frame, which is represented by a pixel-wise segmentation mask for application scenarios such as multi-modal video editing, augmented reality, and intelligent robot systems. The pioneering work conducts this task through dense feature-level audio-visual interaction, which ignores the dimension gap between different modalities. More specifically, the audio clip could only provide a Global semantic label in each sequence, but the video frame covers multiple semantic objects across different Local regions, which leads to mislocalization of the representationally similar but semantically different object. In this paper, we propose a Cross-modal Cognitive Consensus guided Network (C3N) to align the audio-visual semantics from the global dimension and progressively inject them into the local regions via an attention mechanism. Firstly, a Cross-modal Cognitive Consensus Inference Module (C3IM) is developed to extract a unified-modal label by integrating audio/visual classification confidence and similarities of modality-agnostic label embeddings. Then, we feed the unified-modal label back to the visual backbone as the explicit semantic-level guidance via a Cognitive Consensus guided Attention Module (CCAM), which highlights the local features corresponding to the interested object. Extensive experiments on the Single Sound Source Segmentation (S4) setting and Multiple Sound Source Segmentation (MS3) setting of the AVSBench dataset demonstrate the effectiveness of the proposed method, which achieves state-of-the-art performance.

音频-视觉分割（AVS）旨在从视频帧中提取声音对象，该对象由像素级分割掩码表示，适用于多模态视频编辑、增强现实和智能机器人系统等应用场景。开创性的工作是通过密集的特征级视听交互来完成这项任务，而忽略了不同模态之间的维度差距。更具体地说，音频片段只能在每个序列中提供一个全局语义标签，但视频帧覆盖了不同Local区域的多个语义对象，这导致了表征相似但语义不同的对象的错误定位。在本文中，我们提出了一个跨模态认知共识引导网络（Cross-modal Cognitive Consensus guided Network, C3N），从全局维度对齐视听语义，并通过注意机制逐步注入局部区域。首先，开发了跨模态认知共识推理模块（C3IM），通过整合音视频分类置信度和模态不可知标签嵌入的相似度提取统一模态标签；然后，我们通过认知共识引导注意力模块（CCAM）将统一模态标签作为显式语义级引导反馈给视觉主干，该模块突出显示感兴趣对象对应的局部特征。在AVSBench数据集的单声源分割（S4）设置和多声源分割（MS3）设置上进行的大量实验证明了该方法的有效性，达到了最先进的性能。

{"title":"Cross-Modal Cognitive Consensus Guided Audio–Visual Segmentation","authors":"Zhaofeng Shi;Qingbo Wu;Fanman Meng;Linfeng Xu;Hongliang Li","doi":"10.1109/TMM.2024.3521746","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521746","url":null,"abstract":"Audio-Visual Segmentation (AVS) aims to extract the sounding object from a video frame, which is represented by a pixel-wise segmentation mask for application scenarios such as multi-modal video editing, augmented reality, and intelligent robot systems. The pioneering work conducts this task through dense feature-level audio-visual interaction, which ignores the dimension gap between different modalities. More specifically, the audio clip could only provide a <italic>Global semantic label in each sequence, but the video frame covers multiple semantic objects across different <italic>Local regions, which leads to mislocalization of the representationally similar but semantically different object. In this paper, we propose a Cross-modal Cognitive Consensus guided Network (C3N) to align the audio-visual semantics from the global dimension and progressively inject them into the local regions via an attention mechanism. Firstly, a Cross-modal Cognitive Consensus Inference Module (C3IM) is developed to extract a unified-modal label by integrating audio/visual classification confidence and similarities of modality-agnostic label embeddings. Then, we feed the unified-modal label back to the visual backbone as the explicit semantic-level guidance via a Cognitive Consensus guided Attention Module (CCAM), which highlights the local features corresponding to the interested object. Extensive experiments on the Single Sound Source Segmentation (S4) setting and Multiple Sound Source Segmentation (MS3) setting of the AVSBench dataset demonstrate the effectiveness of the proposed method, which achieves state-of-the-art performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"209-223"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Polarization State Attention Dehazing Network With a Simulated Polar-Haze Dataset 基于极化状态注意力去雾网络的模拟极化雾数据集

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521827

Sijia Wen;Yinqiang Zheng;Feng Lu

Image dehazing under harsh weather conditions remains a challenging and ill-posed problem. In addition, acquiring real-time haze-free counterparts of hazy images poses difficulties. Existing approaches commonly synthesize hazy data by relying on estimated depth information, which is prone to errors due to its physical unreliability. While generative networks can transfer some hazy features to clear images, the resulting hazy images still exhibit an artificial appearance. In this paper, we introduce polarization cues to propose a haze simulation strategy to synthesize hazy data, ensuring visually pleasing results that adhere to physical laws. Leveraging on the simulated Polar-Haze dataset, we present a polarization state attention dehazing network (PSADNet), which consists of a polarization extraction module and a polarization dehazing module. The proposed polarization extraction model incorporates an attention mechanism to capture high-level image features related to polarization and chromaticity. The polarization dehazing module utilizes these features derived from the polarization analysis to enhance image dehazing capabilities while preserving the accuracy of the polarization information. Promising results are observed in both qualitative and quantitative experiments, supporting the effectiveness of the proposed PSADNet and the validity of polarization-based haze simulation strategy.

恶劣天气条件下的图像除雾仍然是一个具有挑战性和不适定性的问题。此外，获取朦胧图像的实时无雾对应物也存在困难。现有方法一般依靠估计深度信息合成雾霾数据，由于其物理不可靠，容易产生误差。虽然生成网络可以将一些模糊的特征转移到清晰的图像中，但生成的模糊图像仍然呈现出人工的外观。在本文中，我们引入偏振线索，提出了一种雾霾模拟策略来合成雾霾数据，确保视觉上令人愉悦的结果符合物理定律。利用模拟的极地雾霾数据集，我们提出了一个极化状态关注去雾网络（PSADNet），该网络由极化提取模块和极化去雾模块组成。所提出的偏振提取模型结合了注意机制来捕获与偏振和色度相关的高级图像特征。偏振去雾模块利用从偏振分析中得到的这些特征来增强图像去雾能力，同时保持偏振信息的准确性。在定性和定量实验中都观察到令人满意的结果，支持了所提出的PSADNet的有效性和基于偏振的雾霾模拟策略的有效性。

{"title":"Polarization State Attention Dehazing Network With a Simulated Polar-Haze Dataset","authors":"Sijia Wen;Yinqiang Zheng;Feng Lu","doi":"10.1109/TMM.2024.3521827","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521827","url":null,"abstract":"Image dehazing under harsh weather conditions remains a challenging and ill-posed problem. In addition, acquiring real-time haze-free counterparts of hazy images poses difficulties. Existing approaches commonly synthesize hazy data by relying on estimated depth information, which is prone to errors due to its physical unreliability. While generative networks can transfer some hazy features to clear images, the resulting hazy images still exhibit an artificial appearance. In this paper, we introduce polarization cues to propose a haze simulation strategy to synthesize hazy data, ensuring visually pleasing results that adhere to physical laws. Leveraging on the simulated Polar-Haze dataset, we present a polarization state attention dehazing network (PSADNet), which consists of a polarization extraction module and a polarization dehazing module. The proposed polarization extraction model incorporates an attention mechanism to capture high-level image features related to polarization and chromaticity. The polarization dehazing module utilizes these features derived from the polarization analysis to enhance image dehazing capabilities while preserving the accuracy of the polarization information. Promising results are observed in both qualitative and quantitative experiments, supporting the effectiveness of the proposed PSADNet and the validity of polarization-based haze simulation strategy.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"263-274"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SDE2D: Semantic-Guided Discriminability Enhancement Feature Detector and Descriptor 语义引导的可区别性增强特征检测器和描述符

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521748

Jiapeng Li;Ruonan Zhang;Ge Li;Thomas H. Li

Local feature detectors and descriptors serve various computer vision tasks, such as image matching, visual localization, and 3D reconstruction. To address the extreme variations of rotation and light in the real world, most detectors and descriptors capture as much invariance as possible. However, these methods ignore feature discriminability and perform poorly in indoor scenes. Indoor scenes have too many weak-textured and even repeatedly textured regions, so it is necessary for the extracted features to possess sufficient discriminability. Therefore, we propose a semantic-guided method (called SDE2D) enhancing feature discriminability to improve the performance of descriptors for indoor scenes. We develop a kind of semantic-guided discriminability enhancement (SDE) loss function that uses semantic information from indoor scenes. To the best of our knowledge, this is the first deep research that applies semantic segmentation to enhance discriminability. In addition, we design a novel framework that allows semantic segmentation network to be well embedded as a module in the overall framework and provides guidance information for training. Besides, we explore the impact of different semantic segmentation models on our method. The experimental results on indoor scenes datasets demonstrate that the proposed SDE2D performs well compared with the state-of-the-art models.

局部特征检测器和描述符服务于各种计算机视觉任务，如图像匹配、视觉定位和3D重建。为了解决现实世界中旋转和光的极端变化，大多数检测器和描述符捕获尽可能多的不变性。然而，这些方法忽略了特征可判别性，在室内场景中表现不佳。室内场景有太多弱纹理甚至重复纹理的区域，因此需要提取的特征具有足够的可分辨性。因此，我们提出了一种增强特征可分辨性的语义引导方法（SDE2D）来提高描述符在室内场景中的性能。本文提出了一种基于室内场景语义信息的语义引导可判别性增强（SDE）损失函数。据我们所知，这是第一次应用语义分割来增强可辨别性的深入研究。此外，我们设计了一个新的框架，使语义分割网络作为一个模块很好地嵌入到整个框架中，并为训练提供指导信息。此外，我们还探讨了不同的语义分割模型对我们方法的影响。室内场景数据集的实验结果表明，与现有模型相比，所提出的SDE2D模型具有良好的性能。

{"title":"SDE2D: Semantic-Guided Discriminability Enhancement Feature Detector and Descriptor","authors":"Jiapeng Li;Ruonan Zhang;Ge Li;Thomas H. Li","doi":"10.1109/TMM.2024.3521748","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521748","url":null,"abstract":"Local feature detectors and descriptors serve various computer vision tasks, such as image matching, visual localization, and 3D reconstruction. To address the extreme variations of rotation and light in the real world, most detectors and descriptors capture as much invariance as possible. However, these methods ignore feature discriminability and perform poorly in indoor scenes. Indoor scenes have too many weak-textured and even repeatedly textured regions, so it is necessary for the extracted features to possess sufficient discriminability. Therefore, we propose a semantic-guided method (called SDE2D) enhancing feature discriminability to improve the performance of descriptors for indoor scenes. We develop a kind of semantic-guided discriminability enhancement (SDE) loss function that uses semantic information from indoor scenes. To the best of our knowledge, this is the first deep research that applies semantic segmentation to enhance discriminability. In addition, we design a novel framework that allows semantic segmentation network to be well embedded as a module in the overall framework and provides guidance information for training. Besides, we explore the impact of different semantic segmentation models on our method. The experimental results on indoor scenes datasets demonstrate that the proposed SDE2D performs well compared with the state-of-the-art models.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"275-286"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WHANet:Wavelet-Based Hybrid Asymmetric Network for Spectral Super-Resolution From RGB Inputs WHANet：基于小波的RGB输入光谱超分辨率混合不对称网络

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-12-23 DOI: 10.1109/TMM.2024.3521713

Nan Wang;Shaohui Mei;Yi Wang;Yifan Zhang;Duo Zhan

The reconstruction from three to dozens of spectral bands, known as spectral super resolution (SSR) has achieved remarkable progress with the continuous development of deep learning. However, the reconstructed hyperspectral images (HSIs) still suffer from the spatial degeneration due to the insufficient retention of high-frequency (HF) information during the SSR process. To remedy this issue, a novel Wavelet-based Hybrid Asymmetric Network (WHANet) is proposed to establish a RGB-to-HSI translation in wavelet domain, thus reserving and emphasizing the HF features in hyperspectral space. Basically, the backbone is designed in a hybrid asymmetric structure that learns the exact representations of decomposed wavelet coefficients in hyperspectral domain in a parallel way. Innovatively, a CNN-based HF reconstruction module (HFRM) and a transformer-based low frequency (LF) reconstruction module (LFRM) are delicately devised to perform the SSR process individually, which are able to process the discriminative wavelet coefficients contrapuntally. Furthermore, a hybrid loss function incorporated with the Fast Fourier loss (FFL) is proposed to directly regularize and emphasis the missing HF components. Eventually, experimental results over three benchmark datasets and one remote sensing dataset demonstrate that our WHANet is able to reach the state-of-the-art performance quantitatively and qualitatively.

随着深度学习技术的不断发展，从3到几十个光谱波段的重建，即光谱超分辨率（SSR）技术已经取得了显著的进展。然而，由于SSR过程中高频信息的保留不足，重构的高光谱图像仍然存在空间退化的问题。为了解决这一问题，提出了一种新的基于小波的混合不对称网络（WHANet），在小波域建立rgb到hsi的转换，从而保留和强调高光谱空间中的高频特征。基本上，主干被设计成一种混合不对称结构，以并行的方式学习分解后的小波系数在高光谱域的精确表示。创新地，设计了基于cnn的高频重构模块（HFRM）和基于变压器的低频重构模块（LFRM）分别执行SSR过程，能够对位处理判别小波系数。此外，提出了一种结合快速傅立叶损失（FFL）的混合损失函数来直接正则化和强调缺失的高频分量。最后，在三个基准数据集和一个遥感数据集上的实验结果表明，我们的WHANet能够在定量和定性上达到最先进的性能。

{"title":"WHANet:Wavelet-Based Hybrid Asymmetric Network for Spectral Super-Resolution From RGB Inputs","authors":"Nan Wang;Shaohui Mei;Yi Wang;Yifan Zhang;Duo Zhan","doi":"10.1109/TMM.2024.3521713","DOIUrl":"https://doi.org/10.1109/TMM.2024.3521713","url":null,"abstract":"The reconstruction from three to dozens of spectral bands, known as spectral super resolution (SSR) has achieved remarkable progress with the continuous development of deep learning. However, the reconstructed hyperspectral images (HSIs) still suffer from the spatial degeneration due to the insufficient retention of high-frequency (HF) information during the SSR process. To remedy this issue, a novel Wavelet-based Hybrid Asymmetric Network (WHANet) is proposed to establish a RGB-to-HSI translation in wavelet domain, thus reserving and emphasizing the HF features in hyperspectral space. Basically, the backbone is designed in a hybrid asymmetric structure that learns the exact representations of decomposed wavelet coefficients in hyperspectral domain in a parallel way. Innovatively, a CNN-based HF reconstruction module (HFRM) and a transformer-based low frequency (LF) reconstruction module (LFRM) are delicately devised to perform the SSR process individually, which are able to process the discriminative wavelet coefficients contrapuntally. Furthermore, a hybrid loss function incorporated with the Fast Fourier loss (FFL) is proposed to directly regularize and emphasis the missing HF components. Eventually, experimental results over three benchmark datasets and one remote sensing dataset demonstrate that our WHANet is able to reach the state-of-the-art performance quantitatively and qualitatively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"414-428"},"PeriodicalIF":8.4,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Network Interpretability via Explanation Consistency Evaluation 通过解释一致性评估提高网络可解释性

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-09-16 DOI: 10.1109/TMM.2024.3453058

Hefeng Wu;Hao Jiang;Keze Wang;Ziyi Tang;Xianghuan He;Liang Lin

While deep neural networks have achieved remarkable performance, they tend to lack transparency in prediction. The pursuit of greater interpretability in neural networks often results in a degradation of their original performance. Some works strive to improve both interpretability and performance, but they primarily depend on meticulously imposed conditions. In this paper, we propose a simple yet effective framework that acquires more explainable activation heatmaps and simultaneously increases the model performance, without the need for any extra supervision. Specifically, our concise framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning. The explanation consistency metric is utilized to measure the similarity between the model's visual explanations of the original samples and those of semantic-preserved adversarial samples, whose background regions are perturbed by using image adversarial attack techniques. Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations (i.e., low explanation consistency), for which the current model cannot provide robust interpretations. Comprehensive experimental results on various benchmarks demonstrate the superiority of our framework in multiple aspects, including higher recognition accuracy, greater data debiasing capability, stronger network robustness, and more precise localization ability on both regular networks and interpretable networks. We also provide extensive ablation studies and qualitative analyses to unveil the detailed contribution of each component.

虽然深度神经网络已经取得了显著的性能，但它们在预测方面往往缺乏透明度。追求神经网络更高的可解释性往往会导致其原有性能下降。一些研究致力于同时提高可解释性和性能，但它们主要依赖于精心设置的条件。在本文中，我们提出了一个简单而有效的框架，它能获取更多可解释的激活热图，同时提高模型性能，而无需任何额外的监督。具体来说，我们的简洁框架引入了一个新指标，即解释一致性，以便在模型学习过程中对训练样本进行自适应重新加权。解释一致性指标用于衡量模型对原始样本的视觉解释与对语义保留的对抗样本的视觉解释之间的相似性，对抗样本的背景区域通过图像对抗攻击技术进行了扰动。然后，我们的框架会更密切地关注那些解释差异较大（即解释一致性较低）的训练样本，从而促进模型学习。各种基准的综合实验结果证明了我们的框架在多个方面的优越性，包括更高的识别准确率、更强的数据去杂能力、更强的网络鲁棒性，以及在常规网络和可解释网络上更精确的定位能力。我们还提供了广泛的消融研究和定性分析，以揭示每个组件的详细贡献。

{"title":"Improving Network Interpretability via Explanation Consistency Evaluation","authors":"Hefeng Wu;Hao Jiang;Keze Wang;Ziyi Tang;Xianghuan He;Liang Lin","doi":"10.1109/TMM.2024.3453058","DOIUrl":"https://doi.org/10.1109/TMM.2024.3453058","url":null,"abstract":"While deep neural networks have achieved remarkable performance, they tend to lack transparency in prediction. The pursuit of greater interpretability in neural networks often results in a degradation of their original performance. Some works strive to improve both interpretability and performance, but they primarily depend on meticulously imposed conditions. In this paper, we propose a simple yet effective framework that acquires more explainable activation heatmaps and simultaneously increases the model performance, without the need for any extra supervision. Specifically, our concise framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning. The explanation consistency metric is utilized to measure the similarity between the model's visual explanations of the original samples and those of semantic-preserved adversarial samples, whose background regions are perturbed by using image adversarial attack techniques. Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations (i.e., low explanation consistency), for which the current model cannot provide robust interpretations. Comprehensive experimental results on various benchmarks demonstrate the superiority of our framework in multiple aspects, including higher recognition accuracy, greater data debiasing capability, stronger network robustness, and more precise localization ability on both regular networks and interpretable networks. We also provide extensive ablation studies and qualitative analyses to unveil the detailed contribution of each component.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11261-11273"},"PeriodicalIF":8.4,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0