IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献_第3页

Key-Axis-Based Localization of Symmetry Axes in 3D Objects Utilizing Geometry and Texture 基于键轴的三维物体对称轴几何和纹理定位

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-17 DOI: 10.1109/TIP.2024.3515801

Yulin Wang;Chen Luo

In pose estimation for objects with rotational symmetry, ambiguous poses may arise, and the symmetry axes of objects are crucial for eliminating such ambiguities. Currently, in pose estimation, reliance on manual settings of symmetry axes decreases the accuracy of pose estimation. To address this issue, this method proposes determining the orders of symmetry axes and angles between axes based on a given rotational symmetry type or polyhedron, reducing the need for manual settings of symmetry axes. Subsequently, two key axes with the highest orders are defined and localized, then three orthogonal axes are generated based on key axes, while each symmetry axis can be computed utilizing orthogonal axes. Compared to localizing symmetry axes one by one, the key-axis-based symmetry axis localization is more efficient. To support geometric and texture symmetry, the method utilizes the ADI metric for key axis localization in geometrically symmetric objects and proposes a novel metric, ADI-C, for objects with texture symmetry. Experimental results on the LM-O and HB datasets demonstrate a 9.80% reduction in symmetry axis localization error and a 1.64% improvement in pose estimation accuracy. Additionally, the method introduces a new dataset, DSRSTO, to illustrate its performance across seven types of geometrically and texturally symmetric objects. The GitHub link for the open-source tool based on this method is https://github.com/WangYuLin-SEU/KASAL.

在旋转对称物体的姿态估计中，可能会出现姿态模糊，而物体的对称轴对于消除这种模糊性至关重要。目前，在姿态估计中，依赖于手动设置对称轴降低了姿态估计的精度。为了解决这个问题，该方法提出了根据给定的旋转对称类型或多面体确定对称轴的顺序和轴之间的角度，减少了手动设置对称轴的需要。然后定义并定域两个最高阶键轴，然后基于键轴生成三个正交轴，利用正交轴计算每个对称轴。与逐个定位对称轴相比，基于键轴的对称轴定位效率更高。为了支持几何和纹理对称，该方法利用ADI度量对几何对称对象进行关键轴定位，并提出了一种新的度量ADI- c，用于纹理对称对象。在LM-O和HB数据集上的实验结果表明，对称轴定位误差降低了9.80%，姿态估计精度提高了1.64%。此外，该方法引入了一个新的数据集DSRSTO，以说明其在七种几何和纹理对称对象上的性能。基于此方法的开源工具的GitHub链接是https://github.com/WangYuLin-SEU/KASAL。

{"title":"Key-Axis-Based Localization of Symmetry Axes in 3D Objects Utilizing Geometry and Texture","authors":"Yulin Wang;Chen Luo","doi":"10.1109/TIP.2024.3515801","DOIUrl":"10.1109/TIP.2024.3515801","url":null,"abstract":"In pose estimation for objects with rotational symmetry, ambiguous poses may arise, and the symmetry axes of objects are crucial for eliminating such ambiguities. Currently, in pose estimation, reliance on manual settings of symmetry axes decreases the accuracy of pose estimation. To address this issue, this method proposes determining the orders of symmetry axes and angles between axes based on a given rotational symmetry type or polyhedron, reducing the need for manual settings of symmetry axes. Subsequently, two key axes with the highest orders are defined and localized, then three orthogonal axes are generated based on key axes, while each symmetry axis can be computed utilizing orthogonal axes. Compared to localizing symmetry axes one by one, the key-axis-based symmetry axis localization is more efficient. To support geometric and texture symmetry, the method utilizes the ADI metric for key axis localization in geometrically symmetric objects and proposes a novel metric, ADI-C, for objects with texture symmetry. Experimental results on the LM-O and HB datasets demonstrate a 9.80% reduction in symmetry axis localization error and a 1.64% improvement in pose estimation accuracy. Additionally, the method introduces a new dataset, DSRSTO, to illustrate its performance across seven types of geometrically and texturally symmetric objects. The GitHub link for the open-source tool based on this method is \u0000<uri>https://github.com/WangYuLin-SEU/KASAL</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6720-6733"},"PeriodicalIF":0.0,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142840914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning 通过伪监督学习利用无标记视频进行视频文本检索

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-16 DOI: 10.1109/TIP.2024.3514352

Yu Lu;Ruijie Quan;Linchao Zhu;Yi Yang

Large-scale pre-trained vision-language models (e.g., CLIP) have shown incredible generalization performance in downstream tasks such as video-text retrieval (VTR). Traditional approaches have leveraged CLIP’s robust multi-modal alignment ability for VTR by directly fine-tuning vision and text encoders with clean video-text data. Yet, these techniques rely on carefully annotated video-text pairs, which are expensive and require significant manual effort. In this context, we introduce a new approach, Pseudo-Supervised Selective Contrastive Learning (PS-SCL). PS-SCL minimizes the dependency on manually-labeled text annotations by generating pseudo-supervisions from unlabeled video data for training. We first exploit CLIP’s visual recognition capabilities to generate pseudo-texts automatically. These pseudo-texts contain diverse visual concepts from the video and serve as weak textual guidance. Moreover, we introduce Selective Contrastive Learning (SeLeCT), which prioritizes and selects highly correlated video-text pairs from pseudo-supervised video-text pairs. By doing so, SeLeCT enables more effective multi-modal learning under weak pairing supervision. Experimental results demonstrate that our method outperforms CLIP zero-shot performance by a large margin on multiple video-text retrieval benchmarks, e.g., 8.2% R@1 for video-to-text on MSRVTT, 12.2% R@1 for video-to-text on DiDeMo, and 10.9% R@1 for video-to-text on ActivityNet, respectively.

大规模预训练的视觉语言模型（如CLIP）在下游任务（如视频文本检索（VTR））中显示出令人难以置信的泛化性能。传统的方法是利用CLIP强大的多模态校准能力，直接微调视觉和文本编码器与干净的视频文本数据。然而，这些技术依赖于仔细注释的视频文本对，这是昂贵的，需要大量的人工努力。在此背景下，我们引入了一种新的方法——伪监督选择性对比学习（PS-SCL）。PS-SCL通过从未标记的视频数据生成用于训练的伪监督，最大限度地减少了对手动标记文本注释的依赖。我们首先利用CLIP的视觉识别能力自动生成伪文本。这些伪文本包含了来自视频的各种视觉概念，并起到了微弱的文本引导作用。此外，我们引入了选择性对比学习（SeLeCT），它从伪监督视频文本对中优先选择高度相关的视频文本对。通过这样做，SeLeCT可以在弱配对监督下更有效地进行多模态学习。实验结果表明，我们的方法在多个视频文本检索基准上大大优于CLIP零射击性能，例如，MSRVTT上的视频到文本分别为8.2% R@1， DiDeMo上的视频到文本分别为12.2% R@1， ActivityNet上的视频到文本分别为10.9% R@1。

{"title":"Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning","authors":"Yu Lu;Ruijie Quan;Linchao Zhu;Yi Yang","doi":"10.1109/TIP.2024.3514352","DOIUrl":"10.1109/TIP.2024.3514352","url":null,"abstract":"Large-scale pre-trained vision-language models (e.g., CLIP) have shown incredible generalization performance in downstream tasks such as video-text retrieval (VTR). Traditional approaches have leveraged CLIP’s robust multi-modal alignment ability for VTR by directly fine-tuning vision and text encoders with clean video-text data. Yet, these techniques rely on carefully annotated video-text pairs, which are expensive and require significant manual effort. In this context, we introduce a new approach, Pseudo-Supervised Selective Contrastive Learning (PS-SCL). PS-SCL minimizes the dependency on manually-labeled text annotations by generating pseudo-supervisions from unlabeled video data for training. We first exploit CLIP’s visual recognition capabilities to generate pseudo-texts automatically. These pseudo-texts contain diverse visual concepts from the video and serve as weak textual guidance. Moreover, we introduce Selective Contrastive Learning (SeLeCT), which prioritizes and selects highly correlated video-text pairs from pseudo-supervised video-text pairs. By doing so, SeLeCT enables more effective multi-modal learning under weak pairing supervision. Experimental results demonstrate that our method outperforms CLIP zero-shot performance by a large margin on multiple video-text retrieval benchmarks, e.g., 8.2% R@1 for video-to-text on MSRVTT, 12.2% R@1 for video-to-text on DiDeMo, and 10.9% R@1 for video-to-text on ActivityNet, respectively.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6748-6760"},"PeriodicalIF":0.0,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142832331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Diffusion Models as Strong Adversaries 作为强大对手的扩散模型

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-16 DOI: 10.1109/TIP.2024.3514361

Xuelong Dai;Yanjie Li;Mingxing Duan;Bin Xiao

Diffusion models have demonstrated their great ability to generate high-quality images for various tasks. With such a strong performance, diffusion models can potentially pose a severe threat to both humans and deep learning models, e.g., DNNs and MLLMs. However, their abilities as adversaries have not been well explored. Among different adversarial scenarios, the no-box adversarial attack is the most practical one, as it assumes that the attacker has no access to the training dataset or the target model. Existing works still require some data from the training dataset, which may not be feasible in real-world scenarios. In this paper, we investigate the adversarial capabilities of diffusion models by conducting no-box attacks solely using data generated by diffusion models. Specifically, our attack method generates a synthetic dataset using diffusion models to train a substitute model. We then employ a classification diffusion model to fine-tune the substitute model, considering model uncertainty and incorporating noise augmentation. Finally, we sample adversarial examples from the diffusion models using the average approximation over the diffusion substitute model with multiple inferences. Extensive experiments on the ImageNet dataset demonstrate that the proposed attack method achieves state-of-the-art performance in both no-box attack and black-box attack scenarios.

扩散模型已经证明了它们为各种任务生成高质量图像的强大能力。由于具有如此强大的性能，扩散模型可能对人类和深度学习模型（例如dnn和mlm）构成严重威胁。然而，他们作为对手的能力还没有得到很好的探索。在不同的对抗场景中，无箱对抗攻击是最实用的一种，因为它假设攻击者无法访问训练数据集或目标模型。现有的工作仍然需要来自训练数据集的一些数据，这在现实场景中可能是不可行的。在本文中，我们通过仅使用扩散模型生成的数据进行无箱攻击来研究扩散模型的对抗能力。具体来说，我们的攻击方法使用扩散模型生成一个合成数据集来训练替代模型。然后，我们使用一个分类扩散模型来微调替代模型，考虑模型的不确定性并加入噪声增强。最后，我们从扩散模型中使用具有多个推论的扩散替代模型的平均近似来采样对抗示例。在ImageNet数据集上的大量实验表明，所提出的攻击方法在无盒攻击和黑盒攻击场景下都达到了最先进的性能。

{"title":"Diffusion Models as Strong Adversaries","authors":"Xuelong Dai;Yanjie Li;Mingxing Duan;Bin Xiao","doi":"10.1109/TIP.2024.3514361","DOIUrl":"10.1109/TIP.2024.3514361","url":null,"abstract":"Diffusion models have demonstrated their great ability to generate high-quality images for various tasks. With such a strong performance, diffusion models can potentially pose a severe threat to both humans and deep learning models, e.g., DNNs and MLLMs. However, their abilities as adversaries have not been well explored. Among different adversarial scenarios, the no-box adversarial attack is the most practical one, as it assumes that the attacker has no access to the training dataset or the target model. Existing works still require some data from the training dataset, which may not be feasible in real-world scenarios. In this paper, we investigate the adversarial capabilities of diffusion models by conducting no-box attacks solely using data generated by diffusion models. Specifically, our attack method generates a synthetic dataset using diffusion models to train a substitute model. We then employ a classification diffusion model to fine-tune the substitute model, considering model uncertainty and incorporating noise augmentation. Finally, we sample adversarial examples from the diffusion models using the average approximation over the diffusion substitute model with multiple inferences. Extensive experiments on the ImageNet dataset demonstrate that the proposed attack method achieves state-of-the-art performance in both no-box attack and black-box attack scenarios.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6734-6747"},"PeriodicalIF":0.0,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142832330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unsupervised Learning of Intrinsic Semantics With Diffusion Model for Person Re-Identification 利用扩散模型对内在语义进行无监督学习，实现人员再识别

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-16 DOI: 10.1109/TIP.2024.3514360

Xuefeng Tao;Jun Kong;Min Jiang;Ming Lu;Ajmal Mian

Unsupervised person re-identification (Re-ID) aims to learn semantic representations for person retrieval without using identity labels. Most existing methods generate fine-grained patch features to reduce noise in global feature clustering. However, these methods often compromise the discriminative semantic structure and overlook the semantic consistency between the patch and global features. To address these problems, we propose a Person Intrinsic Semantic Learning (PISL) framework with diffusion model for unsupervised person Re-ID. First, we design the Spatial Diffusion Model (SDM), which performs a denoising diffusion process from noisy spatial transformer parameters to semantic parameters, enabling the sampling of patches with intrinsic semantic structure. Second, we propose the Semantic Controlled Diffusion (SCD) loss to guide the denoising direction of the diffusion model, facilitating the generation of semantic patches. Third, we propose the Patch Semantic Consistency (PSC) loss to capture semantic consistency between the patch and global features, refining the pseudo-labels of global features. Comprehensive experiments on three challenging datasets show that our method surpasses current unsupervised Re-ID methods. The source code will be publicly available at https://github.com/taoxuefong/Diffusion-reid

无监督人员再识别（Re-ID）旨在学习语义表示，以便在不使用身份标签的情况下进行人员检索。现有的方法大多是通过生成细粒度的patch特征来降低全局特征聚类中的噪声。然而，这些方法往往损害了判别语义结构，忽略了补丁和全局特征之间的语义一致性。为了解决这些问题，我们提出了一个带有扩散模型的无监督人再识别的人内在语义学习（PISL）框架。首先，我们设计了空间扩散模型（Spatial Diffusion Model， SDM），该模型进行了从有噪声的空间变压器参数到语义参数的去噪扩散过程，实现了对具有固有语义结构的斑块的采样。其次，我们提出了语义控制扩散（Semantic Controlled Diffusion， SCD）损失来指导扩散模型的去噪方向，促进语义补丁的生成。第三，提出补丁语义一致性（Patch Semantic Consistency， PSC）损失来捕获补丁与全局特征之间的语义一致性，对全局特征的伪标签进行细化。在三个具有挑战性的数据集上进行的综合实验表明，我们的方法优于当前的无监督Re-ID方法。源代码将在https://github.com/taoxuefong/Diffusion-reid上公开提供

{"title":"Unsupervised Learning of Intrinsic Semantics With Diffusion Model for Person Re-Identification","authors":"Xuefeng Tao;Jun Kong;Min Jiang;Ming Lu;Ajmal Mian","doi":"10.1109/TIP.2024.3514360","DOIUrl":"10.1109/TIP.2024.3514360","url":null,"abstract":"Unsupervised person re-identification (Re-ID) aims to learn semantic representations for person retrieval without using identity labels. Most existing methods generate fine-grained patch features to reduce noise in global feature clustering. However, these methods often compromise the discriminative semantic structure and overlook the semantic consistency between the patch and global features. To address these problems, we propose a Person Intrinsic Semantic Learning (PISL) framework with diffusion model for unsupervised person Re-ID. First, we design the Spatial Diffusion Model (SDM), which performs a denoising diffusion process from noisy spatial transformer parameters to semantic parameters, enabling the sampling of patches with intrinsic semantic structure. Second, we propose the Semantic Controlled Diffusion (SCD) loss to guide the denoising direction of the diffusion model, facilitating the generation of semantic patches. Third, we propose the Patch Semantic Consistency (PSC) loss to capture semantic consistency between the patch and global features, refining the pseudo-labels of global features. Comprehensive experiments on three challenging datasets show that our method surpasses current unsupervised Re-ID methods. The source code will be publicly available at \u0000<uri>https://github.com/taoxuefong/Diffusion-reid</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6705-6719"},"PeriodicalIF":0.0,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142832332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Self-Adaptive Feature Extraction Method for Aerial-View Geo-Localization 一种鸟瞰图地理定位的自适应特征提取方法

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-12 DOI: 10.1109/TIP.2024.3513157

Jinliang Lin;Zhiming Luo;Dazhen Lin;Shaozi Li;Zhun Zhong

Cross-view geo-localization aims to match the same geographic location from different view images, e.g., drone-view images and geo-referenced satellite-view images. Due to UAV cameras’ different shooting angles and heights, the scale of the same captured target building in the drone-view images varies greatly. Meanwhile, there is a difference in size and floor area for different geographic locations in the real world, such as towers and stadiums, which also leads to scale variants of geographic targets in the images. However, existing methods mainly focus on extracting the fine-grained information of the geographic targets or the contextual information of the surrounding area, which overlook the robust feature for scale changes and the importance of feature alignment. In this study, we argue that the key underpinning of this task is to train a network to mine a discriminative representation against scale variants. To this end, we design an effective and novel end-to-end network called Self-Adaptive Feature Extraction Network (Safe-Net) to extract powerful scale-invariant features in a self-adaptive manner. Safe-Net includes a global representation-guided feature alignment module and a saliency-guided feature partition module. The former applies an affine transformation guided by the global feature for adaptive feature alignment. Without extra region annotations, the latter computes saliency distribution for different regions of the image and adopts the saliency information to guide a self-adaptive feature partition on the feature map to learn a visual representation against scale variants. Experiments on two prevailing large-scale aerial-view geo-localization benchmarks, i.e., University-1652 and SUES-200, show that the proposed method achieves state-of-the-art results. In addition, our proposed Safe-Net has a significant scale adaptive capability and can extract robust feature representations for those query images with small target buildings. The source code of this study is available at: https://github.com/AggMan96/Safe-Net.

跨视点地理定位的目的是匹配不同视点图像中的同一地理位置，例如无人机视点图像和地理参考卫星视点图像。由于无人机摄像机拍摄角度和高度的不同，同一目标建筑物在无人机视角图像中的尺度差异很大。同时，现实世界中不同的地理位置，如高楼和体育馆，其大小和占地面积也存在差异，这也导致了图像中地理目标的尺度变化。然而，现有方法主要侧重于提取地理目标的细粒度信息或周边区域的上下文信息，忽略了尺度变化的鲁棒性特征和特征对齐的重要性。在这项研究中，我们认为这项任务的关键基础是训练一个网络来挖掘针对尺度变量的判别表示。为此，我们设计了一种有效的、新颖的端到端网络——自适应特征提取网络（Safe-Net），以自适应的方式提取强大的尺度不变特征。安全网包括一个全局表示导向的特征对齐模块和一个显著性导向的特征划分模块。前者采用全局特征引导下的仿射变换进行自适应特征对齐。后者在没有额外区域标注的情况下，计算图像不同区域的显著性分布，并利用显著性信息在特征映射上引导自适应特征分区，学习针对尺度变量的视觉表示。在两个主流的大规模鸟瞰图地理定位基准（即University-1652和SUES-200）上的实验表明，所提出的方法取得了最先进的结果。此外，我们提出的安全网具有显著的规模自适应能力，可以提取具有小目标建筑物的查询图像的鲁棒特征表示。本研究的源代码可在：https://github.com/AggMan96/Safe-Net。

{"title":"A Self-Adaptive Feature Extraction Method for Aerial-View Geo-Localization","authors":"Jinliang Lin;Zhiming Luo;Dazhen Lin;Shaozi Li;Zhun Zhong","doi":"10.1109/TIP.2024.3513157","DOIUrl":"10.1109/TIP.2024.3513157","url":null,"abstract":"Cross-view geo-localization aims to match the same geographic location from different view images, e.g., drone-view images and geo-referenced satellite-view images. Due to UAV cameras’ different shooting angles and heights, the scale of the same captured target building in the drone-view images varies greatly. Meanwhile, there is a difference in size and floor area for different geographic locations in the real world, such as towers and stadiums, which also leads to scale variants of geographic targets in the images. However, existing methods mainly focus on extracting the fine-grained information of the geographic targets or the contextual information of the surrounding area, which overlook the robust feature for scale changes and the importance of feature alignment. In this study, we argue that the key underpinning of this task is to train a network to mine a discriminative representation against scale variants. To this end, we design an effective and novel end-to-end network called Self-Adaptive Feature Extraction Network (Safe-Net) to extract powerful scale-invariant features in a self-adaptive manner. Safe-Net includes a global representation-guided feature alignment module and a saliency-guided feature partition module. The former applies an affine transformation guided by the global feature for adaptive feature alignment. Without extra region annotations, the latter computes saliency distribution for different regions of the image and adopts the saliency information to guide a self-adaptive feature partition on the feature map to learn a visual representation against scale variants. Experiments on two prevailing large-scale aerial-view geo-localization benchmarks, i.e., University-1652 and SUES-200, show that the proposed method achieves state-of-the-art results. In addition, our proposed Safe-Net has a significant scale adaptive capability and can extract robust feature representations for those query images with small target buildings. The source code of this study is available at: \u0000<uri>https://github.com/AggMan96/Safe-Net</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"126-139"},"PeriodicalIF":0.0,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VDMUFusion: A Versatile Diffusion Model-Based Unsupervised Framework for Image Fusion VDMUFusion：一种基于扩散模型的多用途无监督图像融合框架

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-12 DOI: 10.1109/TIP.2024.3512365

Yu Shi;Yu Liu;Juan Cheng;Z. Jane Wang;Xun Chen

Image fusion facilitates the integration of information from various source images of the same scene into a composite image, thereby benefiting perception, analysis, and understanding. Recently, diffusion models have demonstrated impressive generative capabilities in the field of computer vision, suggesting significant potential for application in image fusion. The forward process in the diffusion models requires the gradual addition of noise to the original data. However, typical unsupervised image fusion tasks (e.g., infrared-visible, medical, and multi-exposure image fusion) lack ground truth images (corresponding to the original data in diffusion models), thereby preventing the direct application of the diffusion models. To address this problem, we propose a versatile diffusion model-based unsupervised framework for image fusion, termed as VDMUFusion. In the proposed method, we integrate the fusion problem into the diffusion sampling process by formulating image fusion as a weighted average process and establishing appropriate assumptions about the noise in the diffusion model. To simplify the training process, we propose a multi-task learning framework that replaces the original noise prediction network, allowing for simultaneous prediction of noise and fusion weights. Meanwhile, our method employs joint training across various fusion tasks, which significantly improves noise prediction accuracy and yields higher quality fused images compared to training on a single task. Extensive experimental results demonstrate that the proposed method delivers very competitive performance across various image fusion tasks. The code is available at https://github.com/yuliu316316/VDMUFusion.

图像融合有助于将来自同一场景的各种源图像的信息集成为合成图像，从而有利于感知、分析和理解。最近，扩散模型在计算机视觉领域展示了令人印象深刻的生成能力，这表明在图像融合方面的应用具有巨大的潜力。扩散模型的正演过程需要在原始数据中逐渐加入噪声。然而，典型的无监督图像融合任务（如红外-可见光、医学和多曝光图像融合）缺乏地面真值图像（对应于扩散模型中的原始数据），从而阻碍了扩散模型的直接应用。为了解决这个问题，我们提出了一个通用的基于扩散模型的无监督图像融合框架，称为VDMUFusion。在该方法中，我们通过将图像融合表述为加权平均过程，并对扩散模型中的噪声建立适当的假设，将融合问题整合到扩散采样过程中。为了简化训练过程，我们提出了一个多任务学习框架来取代原始的噪声预测网络，允许同时预测噪声和融合权值。同时，我们的方法采用跨多个融合任务的联合训练，与单一任务的训练相比，显著提高了噪声预测的精度，产生了更高质量的融合图像。大量的实验结果表明，该方法在各种图像融合任务中具有很强的竞争力。代码可在https://github.com/yuliu316316/VDMUFusion上获得。

{"title":"VDMUFusion: A Versatile Diffusion Model-Based Unsupervised Framework for Image Fusion","authors":"Yu Shi;Yu Liu;Juan Cheng;Z. Jane Wang;Xun Chen","doi":"10.1109/TIP.2024.3512365","DOIUrl":"10.1109/TIP.2024.3512365","url":null,"abstract":"Image fusion facilitates the integration of information from various source images of the same scene into a composite image, thereby benefiting perception, analysis, and understanding. Recently, diffusion models have demonstrated impressive generative capabilities in the field of computer vision, suggesting significant potential for application in image fusion. The forward process in the diffusion models requires the gradual addition of noise to the original data. However, typical unsupervised image fusion tasks (e.g., infrared-visible, medical, and multi-exposure image fusion) lack ground truth images (corresponding to the original data in diffusion models), thereby preventing the direct application of the diffusion models. To address this problem, we propose a versatile diffusion model-based unsupervised framework for image fusion, termed as VDMUFusion. In the proposed method, we integrate the fusion problem into the diffusion sampling process by formulating image fusion as a weighted average process and establishing appropriate assumptions about the noise in the diffusion model. To simplify the training process, we propose a multi-task learning framework that replaces the original noise prediction network, allowing for simultaneous prediction of noise and fusion weights. Meanwhile, our method employs joint training across various fusion tasks, which significantly improves noise prediction accuracy and yields higher quality fused images compared to training on a single task. Extensive experimental results demonstrate that the proposed method delivers very competitive performance across various image fusion tasks. The code is available at <uri>https://github.com/yuliu316316/VDMUFusion</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"441-454"},"PeriodicalIF":0.0,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142815470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural Network 基于可匹配关键点辅助图神经网络的特征匹配学习

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-11 DOI: 10.1109/TIP.2024.3512352

Zizhuo Li;Jiayi Ma

Accurately matching local features between a pair of images corresponding to the same 3D scene is a challenging computer vision task. Previous studies typically utilize attention-based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images for visual and geometric information reasoning. However, in the background of local feature matching, a significant number of keypoints are non-repeatable due to factors like occlusion and failure of the detector, and thus irrelevant for message passing. The connectivity with non-repeatable keypoints not only introduces redundancy, resulting in limited efficiency (quadratic computational complexity w.r.t. the keypoint number), but also interferes with the representation aggregation process, leading to limited accuracy. Aiming at the best of both worlds on accuracy and efficiency, we propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide compact and meaningful message passing. More specifically, our Bilateral Context-Aware Sampling (BCAS) Module first dynamically samples two small sets of well-distributed keypoints with high matchability scores from the image pair. Then, our Matchable Keypoint-Assisted Context Aggregation (MKACA) Module regards sampled informative keypoints as message bottlenecks and thus constrains each keypoint only to retrieve favorable contextual information from intra- and inter-matchable keypoints, evading the interference of irrelevant and redundant connectivity with non-repeatable ones. Furthermore, considering the potential noise in initial keypoints and sampled matchable ones, the MKACA module adopts a matchability-guided attentional aggregation operation for purer data-dependent context propagation. By these means, MaKeGNN outperforms the state-of-the-arts on multiple highly challenging benchmarks, while significantly reducing computational and memory complexity compared to typical attentional GNNs.

准确匹配对应于同一3D场景的一对图像之间的局部特征是一项具有挑战性的计算机视觉任务。先前的研究通常利用基于注意力的图神经网络（gnn），在图像内/跨图像的关键点上建立全连接图，用于视觉和几何信息推理。然而，在局部特征匹配的背景下，由于遮挡和检测器失效等因素，大量的关键点是不可重复的，因此与消息传递无关。与不可重复关键点的连接不仅会引入冗余，导致效率有限（与关键点数量相关的二次计算复杂度），而且还会干扰表示聚合过程，导致精度有限。为了在准确性和效率上达到两全其美，我们提出了MaKeGNN，这是一种稀疏的基于注意力的GNN架构，它绕过了不可重复的关键点，并利用匹配的关键点来指导紧凑而有意义的消息传递。更具体地说，我们的双边上下文感知采样（BCAS）模块首先从图像对中动态采样两组分布良好且具有高匹配分数的小关键点。然后，我们的匹配关键点辅助上下文聚合（MKACA）模块将采样的信息关键点视为消息瓶颈，从而约束每个关键点仅从内部和内部匹配的关键点中检索有利的上下文信息，避免了与不可重复的不相关和冗余连接的干扰。此外，考虑到初始关键点和采样匹配点的潜在噪声，MKACA模块采用匹配引导的注意力聚合操作，实现更纯粹的数据依赖上下文传播。通过这些方法，MaKeGNN在多个极具挑战性的基准测试中表现优于最先进的技术，同时与典型的注意力gnn相比，显著降低了计算和内存复杂性。

{"title":"Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural Network","authors":"Zizhuo Li;Jiayi Ma","doi":"10.1109/TIP.2024.3512352","DOIUrl":"10.1109/TIP.2024.3512352","url":null,"abstract":"Accurately matching local features between a pair of images corresponding to the same 3D scene is a challenging computer vision task. Previous studies typically utilize attention-based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images for visual and geometric information reasoning. However, in the background of local feature matching, a significant number of keypoints are non-repeatable due to factors like occlusion and failure of the detector, and thus irrelevant for message passing. The connectivity with non-repeatable keypoints not only introduces redundancy, resulting in limited efficiency (quadratic computational complexity w.r.t. the keypoint number), but also interferes with the representation aggregation process, leading to limited accuracy. Aiming at the best of both worlds on accuracy and efficiency, we propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide compact and meaningful message passing. More specifically, our Bilateral Context-Aware Sampling (BCAS) Module first dynamically samples two small sets of well-distributed keypoints with high matchability scores from the image pair. Then, our Matchable Keypoint-Assisted Context Aggregation (MKACA) Module regards sampled informative keypoints as message bottlenecks and thus constrains each keypoint only to retrieve favorable contextual information from intra- and inter-matchable keypoints, evading the interference of irrelevant and redundant connectivity with non-repeatable ones. Furthermore, considering the potential noise in initial keypoints and sampled matchable ones, the MKACA module adopts a matchability-guided attentional aggregation operation for purer data-dependent context propagation. By these means, MaKeGNN outperforms the state-of-the-arts on multiple highly challenging benchmarks, while significantly reducing computational and memory complexity compared to typical attentional GNNs.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"154-169"},"PeriodicalIF":0.0,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Advancing Video Anomaly Detection: A Bi-Directional Hybrid Framework for Enhanced Single- and Multi-Task Approaches 推进视频异常检测：增强单任务和多任务方法的双向混合框架

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-11 DOI: 10.1109/TIP.2024.3512369

Guodong Shen;Yuqi Ouyang;Junru Lu;Yixuan Yang;Victor Sanchez

Despite the prevailing transition from single-task to multi-task approaches in video anomaly detection, we observe that many adopt sub-optimal frameworks for individual proxy tasks. Motivated by this, we contend that optimizing single-task frameworks can advance both single- and multi-task approaches. Accordingly, we leverage middle-frame prediction as the primary proxy task, and introduce an effective hybrid framework designed to generate accurate predictions for normal frames and flawed predictions for abnormal frames. This hybrid framework is built upon a bi-directional structure that seamlessly integrates both vision transformers and ConvLSTMs. Specifically, we utilize this bi-directional structure to fully analyze the temporal dimension by predicting frames in both forward and backward directions, significantly boosting the detection stability. Given the transformer’s capacity to model long-range contextual dependencies, we develop a convolutional temporal transformer that efficiently associates feature maps from all context frames to generate attention-based predictions for target frames. Furthermore, we devise a layer-interactive ConvLSTM bridge that facilitates the smooth flow of low-level features across layers and time-steps, thereby strengthening predictions with fine details. Anomalies are eventually identified by scrutinizing the discrepancies between target frames and their corresponding predictions. Several experiments conducted on public benchmarks affirm the efficacy of our hybrid framework, whether used as a standalone single-task approach or integrated as a branch in a multi-task approach. These experiments also underscore the advantages of merging vision transformers and ConvLSTMs for video anomaly detection. The implementation of our hybrid framework is available at https://github.com/SHENGUODONG19951126/ConvTTrans-ConvLSTM.

尽管视频异常检测中普遍存在从单任务到多任务的转变，但我们观察到许多人对单个代理任务采用了次优框架。基于此，我们认为优化单任务框架可以同时推进单任务和多任务方法。因此，我们利用中间帧预测作为主要代理任务，并引入一个有效的混合框架，旨在生成正常帧的准确预测和异常帧的错误预测。这种混合框架建立在一个双向结构上，无缝集成了视觉变压器和convlstm。具体来说，我们利用这种双向结构，通过向前和向后预测帧来充分分析时间维度，大大提高了检测的稳定性。考虑到转换器对远程上下文依赖关系建模的能力，我们开发了一个卷积时序转换器，它可以有效地将所有上下文框架的特征映射关联起来，从而为目标框架生成基于注意力的预测。此外，我们设计了一个层交互的ConvLSTM桥，促进了低层特征在层和时间步长的平滑流动，从而加强了具有精细细节的预测。通过仔细检查目标帧与其相应预测之间的差异，最终确定异常。在公共基准测试中进行的几个实验证实了我们的混合框架的有效性，无论是作为独立的单任务方法使用，还是作为多任务方法的分支集成。这些实验也强调了融合视觉变压器和卷积stm在视频异常检测中的优势。我们的混合框架的实现可以在https://github.com/SHENGUODONG19951126/ConvTTrans-ConvLSTM上获得。

{"title":"Advancing Video Anomaly Detection: A Bi-Directional Hybrid Framework for Enhanced Single- and Multi-Task Approaches","authors":"Guodong Shen;Yuqi Ouyang;Junru Lu;Yixuan Yang;Victor Sanchez","doi":"10.1109/TIP.2024.3512369","DOIUrl":"10.1109/TIP.2024.3512369","url":null,"abstract":"Despite the prevailing transition from single-task to multi-task approaches in video anomaly detection, we observe that many adopt sub-optimal frameworks for individual proxy tasks. Motivated by this, we contend that optimizing single-task frameworks can advance both single- and multi-task approaches. Accordingly, we leverage middle-frame prediction as the primary proxy task, and introduce an effective hybrid framework designed to generate accurate predictions for normal frames and flawed predictions for abnormal frames. This hybrid framework is built upon a bi-directional structure that seamlessly integrates both vision transformers and ConvLSTMs. Specifically, we utilize this bi-directional structure to fully analyze the temporal dimension by predicting frames in both forward and backward directions, significantly boosting the detection stability. Given the transformer’s capacity to model long-range contextual dependencies, we develop a convolutional temporal transformer that efficiently associates feature maps from all context frames to generate attention-based predictions for target frames. Furthermore, we devise a layer-interactive ConvLSTM bridge that facilitates the smooth flow of low-level features across layers and time-steps, thereby strengthening predictions with fine details. Anomalies are eventually identified by scrutinizing the discrepancies between target frames and their corresponding predictions. Several experiments conducted on public benchmarks affirm the efficacy of our hybrid framework, whether used as a standalone single-task approach or integrated as a branch in a multi-task approach. These experiments also underscore the advantages of merging vision transformers and ConvLSTMs for video anomaly detection. The implementation of our hybrid framework is available at \u0000<uri>https://github.com/SHENGUODONG19951126/ConvTTrans-ConvLSTM</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6865-6880"},"PeriodicalIF":0.0,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SALENet: Structure-Aware Lighting Estimations From a Single Image for Indoor Environments 基于室内环境单幅图像的结构感知照明估计

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-11 DOI: 10.1109/TIP.2024.3512381

Junhong Zhao;Bing Xue;Mengjie Zhang

High Dynamic Range (HDR) lighting plays a pivotal role in modern augmented and mixed-reality (AR/MR) applications, facilitating immersive experiences through realistic object insertion and dynamic relighting. However, the acquisition of precise HDR environment maps remains cost-prohibitive and impractical when using standard devices. To bridge this gap, this paper introduces SALENet, a novel deep network for estimating global lighting conditions from a single image, to effectively mitigate the need for resource-intensive acquisition methods. In contrast to earlier studies, we focus on exploring the inherent structural relationships within the lighting distribution. We design a hierarchical transformer-based neural network architecture with a proposed cross-attention mechanism between different resolution lighting source representations, optimizing the spatial distribution of lighting sources simultaneously for enhanced consistency. To further improve accuracy, a structure-based contrastive learning method is proposed to select positive-negative pairs based on lighting distribution similarity. By harnessing the synergy of hierarchical transformers and structure-based contrastive learning, our framework yields a significant enhancement in lighting prediction accuracy, enabling high-fidelity augmented and mixed reality to achieve cost-effectively immersive and realistic lighting effects.

高动态范围（HDR）照明在现代增强现实和混合现实（AR/MR）应用中发挥着关键作用，通过逼真的物体插入和动态重照明促进沉浸式体验。然而，当使用标准设备时，获取精确的HDR环境地图仍然成本高昂且不切实际。为了弥补这一差距，本文引入了SALENet，这是一种新的深度网络，用于从单个图像估计全球照明条件，以有效减轻对资源密集型获取方法的需求。与之前的研究相反，我们的重点是探索照明分布内部的内在结构关系。我们设计了一个基于分层变压器的神经网络架构，并提出了不同分辨率光源表示之间的交叉注意机制，同时优化光源的空间分布以增强一致性。为了进一步提高准确率，提出了一种基于结构的对比学习方法，基于光照分布相似度选择正负对。通过利用分层变压器和基于结构的对比学习的协同作用，我们的框架显著提高了照明预测的准确性，使高保真增强和混合现实能够实现经济有效的沉浸式和逼真的照明效果。

{"title":"SALENet: Structure-Aware Lighting Estimations From a Single Image for Indoor Environments","authors":"Junhong Zhao;Bing Xue;Mengjie Zhang","doi":"10.1109/TIP.2024.3512381","DOIUrl":"10.1109/TIP.2024.3512381","url":null,"abstract":"High Dynamic Range (HDR) lighting plays a pivotal role in modern augmented and mixed-reality (AR/MR) applications, facilitating immersive experiences through realistic object insertion and dynamic relighting. However, the acquisition of precise HDR environment maps remains cost-prohibitive and impractical when using standard devices. To bridge this gap, this paper introduces SALENet, a novel deep network for estimating global lighting conditions from a single image, to effectively mitigate the need for resource-intensive acquisition methods. In contrast to earlier studies, we focus on exploring the inherent structural relationships within the lighting distribution. We design a hierarchical transformer-based neural network architecture with a proposed cross-attention mechanism between different resolution lighting source representations, optimizing the spatial distribution of lighting sources simultaneously for enhanced consistency. To further improve accuracy, a structure-based contrastive learning method is proposed to select positive-negative pairs based on lighting distribution similarity. By harnessing the synergy of hierarchical transformers and structure-based contrastive learning, our framework yields a significant enhancement in lighting prediction accuracy, enabling high-fidelity augmented and mixed reality to achieve cost-effectively immersive and realistic lighting effects.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6806-6820"},"PeriodicalIF":0.0,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HEOI: Human Attention Prediction in Natural Daily Life With Fine-Grained Human-Environment-Object Interaction Model HEOI：基于细粒度人-环境-物交互模型的自然日常生活中人类注意力预测

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Pub Date : 2024-12-11 DOI: 10.1109/TIP.2024.3512380

Zhixiong Nan;Leiyu Jia;Bin Xiao

This paper handles the problem of human attention prediction in natural daily life from the third-person view. Due to the significance of this topic in various applications, researchers in the computer vision community have proposed many excellent models in the past few decades, and many models have begun to focus on natural daily life scenarios in recent years. However, existing mainstream models usually ignore a basic fact that human attention is a typical interdisciplinary concept. Specifically, the mainstream definition is direction-level or pixel-level, while many interdisciplinary studies argue the object-level definition. Additionally, the mainstream model structure converges to the dual-pathway architecture or its variants, while the majority of interdisciplinary studies claim attention is involved in the human-environment interaction procedure. Grounded on solid theories and studies in interdisciplinary fields including computer vision, cognition, neuroscience, psychology, and philosophy, this paper proposes a fine-grained Human-Environment-Object Interaction (HEOI) model, which for the first time integrates multi-granularity human cues to predict human attention. Our model is explainable and lightweight, and validated to be effective by a wide range of comparison, ablation, and visualization experiments on two public datasets.

本文从第三人称视角研究了自然日常生活中人类注意力预测问题。由于该主题在各种应用中的重要意义，在过去的几十年里，计算机视觉界的研究人员提出了许多优秀的模型，并且近年来许多模型开始关注自然的日常生活场景。然而，现有的主流模型往往忽略了一个基本事实，即人的注意力是一个典型的跨学科概念。具体来说，主流的定义是方向级或像素级，而许多跨学科的研究则认为是对象级的定义。此外，主流的模型结构趋向于双路径架构或其变体，而大多数跨学科研究都声称关注人与环境的相互作用过程。基于计算机视觉、认知、神经科学、心理学和哲学等跨学科领域的坚实理论和研究，本文提出了一个细粒度的人-环境-对象交互（HEOI）模型，该模型首次集成了多粒度的人类线索来预测人类的注意力。我们的模型是可解释的和轻量级的，并通过在两个公共数据集上进行广泛的比较、消融和可视化实验来验证其有效性。

{"title":"HEOI: Human Attention Prediction in Natural Daily Life With Fine-Grained Human-Environment-Object Interaction Model","authors":"Zhixiong Nan;Leiyu Jia;Bin Xiao","doi":"10.1109/TIP.2024.3512380","DOIUrl":"10.1109/TIP.2024.3512380","url":null,"abstract":"This paper handles the problem of human attention prediction in natural daily life from the third-person view. Due to the significance of this topic in various applications, researchers in the computer vision community have proposed many excellent models in the past few decades, and many models have begun to focus on natural daily life scenarios in recent years. However, existing mainstream models usually ignore a basic fact that human attention is a typical interdisciplinary concept. Specifically, the mainstream definition is direction-level or pixel-level, while many interdisciplinary studies argue the object-level definition. Additionally, the mainstream model structure converges to the dual-pathway architecture or its variants, while the majority of interdisciplinary studies claim attention is involved in the human-environment interaction procedure. Grounded on solid theories and studies in interdisciplinary fields including computer vision, cognition, neuroscience, psychology, and philosophy, this paper proposes a fine-grained Human-Environment-Object Interaction (HEOI) model, which for the first time integrates multi-granularity human cues to predict human attention. Our model is explainable and lightweight, and validated to be effective by a wide range of comparison, ablation, and visualization experiments on two public datasets.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"170-182"},"PeriodicalIF":0.0,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0