Computer Vision and Image Understanding最新文献_第8页

PMGNet: Disentanglement and entanglement benefit mutually for compositional zero-shot learning PMGNet：互不纠缠和纠缠互利，促进合成零点学习

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-16 DOI: 10.1016/j.cviu.2024.104197

Yu Liu , Jianghao Li , Yanyi Zhang , Qi Jia , Weimin Wang , Nan Pu , Nicu Sebe

Compositional zero-shot learning (CZSL) aims to model compositions of two primitives (i.e., attributes and objects) to classify unseen attribute-object pairs. Most studies are devoted to integrating disentanglement and entanglement strategies to circumvent the trade-off between contextuality and generalizability. Indeed, the two strategies can mutually benefit when used together. Nevertheless, they neglect the significance of developing mutual guidance between the two strategies. In this work, we take full advantage of guidance from disentanglement to entanglement and vice versa. Additionally, we propose exploring multi-scale feature learning to achieve fine-grained mutual guidance in a progressive framework. Our approach, termed Progressive Mutual Guidance Network (PMGNet), unifies disentanglement–entanglement representation learning, allowing them to learn from and teach each other progressively in one unified model. Furthermore, to alleviate overfitting recognition on seen pairs, we adopt a relaxed cross-entropy loss to train PMGNet, without an increase of time and memory cost. Extensive experiments on three benchmarks demonstrate that our method achieves distinct improvements, reaching state-of-the-art performance. Moreover, PMGNet exhibits promising performance under the most challenging open-world CZSL setting, especially for unseen pairs.

组合零点学习（CZSL）旨在为两个基元（即属性和对象）的组合建模，从而对未见的属性-对象对进行分类。大多数研究都致力于整合非纠缠和纠缠策略，以规避情境性和概括性之间的权衡。事实上，这两种策略结合使用可以互惠互利。然而，这些研究忽视了在两种策略之间发展相互引导的意义。在这项工作中，我们充分利用了从非纠缠到纠缠以及反之亦然的引导优势。此外，我们还提出探索多尺度特征学习，以在渐进框架中实现细粒度的相互引导。我们的方法被称为渐进式相互引导网络（Progressive Mutual Guidance Network，PMGNet），它将非纠缠-纠缠表示学习统一起来，使它们能够在一个统一的模型中逐步相互学习和传授。此外，为了减轻对所见对的过拟合识别，我们采用了一种宽松的交叉熵损失来训练 PMGNet，而不会增加时间和内存成本。在三个基准上进行的广泛实验表明，我们的方法取得了明显的改进，达到了最先进的性能。此外，在最具挑战性的开放世界 CZSL 环境下，PMGNet 表现出了良好的性能，尤其是对于未识别的配对。

{"title":"PMGNet: Disentanglement and entanglement benefit mutually for compositional zero-shot learning","authors":"Yu Liu , Jianghao Li , Yanyi Zhang , Qi Jia , Weimin Wang , Nan Pu , Nicu Sebe","doi":"10.1016/j.cviu.2024.104197","DOIUrl":"10.1016/j.cviu.2024.104197","url":null,"abstract":"<div><div>Compositional zero-shot learning (CZSL) aims to model compositions of two primitives (i.e., attributes and objects) to classify unseen attribute-object pairs. Most studies are devoted to integrating disentanglement and entanglement strategies to circumvent the trade-off between contextuality and generalizability. Indeed, the two strategies can mutually benefit when used together. Nevertheless, they neglect the significance of developing mutual guidance between the two strategies. In this work, we take full advantage of guidance from disentanglement to entanglement and vice versa. Additionally, we propose exploring multi-scale feature learning to achieve fine-grained mutual guidance in a progressive framework. Our approach, termed Progressive Mutual Guidance Network (PMGNet), unifies disentanglement–entanglement representation learning, allowing them to learn from and teach each other progressively in one unified model. Furthermore, to alleviate overfitting recognition on seen pairs, we adopt a relaxed cross-entropy loss to train PMGNet, without an increase of time and memory cost. Extensive experiments on three benchmarks demonstrate that our method achieves distinct improvements, reaching state-of-the-art performance. Moreover, PMGNet exhibits promising performance under the most challenging open-world CZSL setting, especially for unseen pairs.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104197"},"PeriodicalIF":4.3,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142445585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FTM: The Face Truth Machine—Hand-crafted features from micro-expressions to support lie detection FTM：面部真实机器--从微表情中手工创建特征，支持谎言检测

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-16 DOI: 10.1016/j.cviu.2024.104188

Maria De Marsico, Giordano Dionisi, Donato Francesco Pio Stanco

This work deals with the delicate task of lie detection from facial dynamics. The proposed Face Truth Machine (FTM) is an intelligent system able to support a human operator without any special equipment. It can be embedded in the present infrastructures for forensic investigation or whenever it is required to assess the trustworthiness of responses during an interview. Due to its flexibility and its non-invasiveness, it can overcome some limitations of present solutions. Of course, privacy issues may arise from the use of such systems, as often underlined nowadays. However, it is up to the utilizer to take these into account and make fair use of tools of this kind. The paper will discuss particular aspects of the dynamic analysis of face landmarks to detect lies. In particular, it will delve into the behavior of the features used for detection and how these influence the system’s final decision. The novel detection system underlying the Face Truth Machine is able to analyze the subject’s expressions in a wide range of poses. The results of the experiments presented testify to the potential of the proposed approach and also highlight the very good results obtained in cross-dataset testing, which usually represents a challenge for other approaches.

这项工作涉及从面部动态侦测谎言的精细任务。所提出的面部真实机器（FTM）是一种智能系统，无需任何特殊设备即可为人类操作员提供支持。它可以嵌入到现有的法医调查基础设施中，也可以在任何需要评估访谈过程中回答的可信度的时候使用。由于其灵活性和非侵入性，它可以克服现有解决方案的一些局限性。当然，使用这种系统可能会产生隐私问题，这也是目前经常强调的问题。不过，使用者应该考虑到这些问题，并公平地使用这类工具。本文将讨论动态分析人脸地标以检测谎言的特定方面。特别是，本文将深入探讨用于检测的特征的行为，以及这些特征如何影响系统的最终决定。脸部真实机器所采用的新型检测系统能够分析被试者在各种姿势下的表情。所展示的实验结果证明了所提出方法的潜力，同时也凸显了在跨数据集测试中取得的优异成绩，而这通常是其他方法所面临的挑战。

{"title":"FTM: The Face Truth Machine—Hand-crafted features from micro-expressions to support lie detection","authors":"Maria De Marsico, Giordano Dionisi, Donato Francesco Pio Stanco","doi":"10.1016/j.cviu.2024.104188","DOIUrl":"10.1016/j.cviu.2024.104188","url":null,"abstract":"<div><div>This work deals with the delicate task of lie detection from facial dynamics. The proposed Face Truth Machine (FTM) is an intelligent system able to support a human operator without any special equipment. It can be embedded in the present infrastructures for forensic investigation or whenever it is required to assess the trustworthiness of responses during an interview. Due to its flexibility and its non-invasiveness, it can overcome some limitations of present solutions. Of course, privacy issues may arise from the use of such systems, as often underlined nowadays. However, it is up to the utilizer to take these into account and make fair use of tools of this kind. The paper will discuss particular aspects of the dynamic analysis of face landmarks to detect lies. In particular, it will delve into the behavior of the features used for detection and how these influence the system’s final decision. The novel detection system underlying the Face Truth Machine is able to analyze the subject’s expressions in a wide range of poses. The results of the experiments presented testify to the potential of the proposed approach and also highlight the very good results obtained in cross-dataset testing, which usually represents a challenge for other approaches.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104188"},"PeriodicalIF":4.3,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142528463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

M3A: A multimodal misinformation dataset for media authenticity analysis M3A：用于媒体真实性分析的多模态错误信息数据集

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-15 DOI: 10.1016/j.cviu.2024.104205

Qingzheng Xu , Huiqiang Chen , Heming Du , Hu Zhang , Szymon Łukasik , Tianqing Zhu , Xin Yu

With the development of various generative models, misinformation in news media becomes more deceptive and easier to create, posing a significant problem. However, existing datasets for misinformation study often have limited modalities, constrained sources, and a narrow range of topics. These limitations make it difficult to train models that can effectively combat real-world misinformation. To address this, we propose a comprehensive, large-scale Multimodal Misinformation dataset for Media Authenticity Analysis (

M^{3} A

), featuring broad sources and fine-grained annotations for topics and sentiments. To curate

M^{3} A

, we collect genuine news content from 60 renowned news outlets worldwide and generate fake samples using multiple techniques. These include altering named entities in texts, swapping modalities between samples, creating new modalities, and misrepresenting movie content as news.

M^{3} A

contains 708K genuine news samples and over 6M fake news samples, spanning text, images, audio, and video.

M^{3} A

provides detailed multi-class labels, crucial for various misinformation detection tasks, including out-of-context detection and deepfake detection. For each task, we offer extensive benchmarks using state-of-the-art models, aiming to enhance the development of robust misinformation detection systems.

随着各种生成模型的发展，新闻媒体中的虚假信息变得更具欺骗性，也更容易制造，从而带来了一个重大问题。然而，现有的误报研究数据集往往模式有限、来源受限、主题范围狭窄。这些局限性使得我们很难训练出能有效对抗真实世界中错误信息的模型。为了解决这个问题，我们为媒体真实性分析（Media Authenticity Analysis，M3A）提出了一个全面、大规模的多模态错误信息数据集，该数据集具有广泛的信息源以及细粒度的主题和情感注释。为了策划 M3A，我们收集了全球 60 家知名新闻机构的真实新闻内容，并使用多种技术生成虚假样本。这些技术包括修改文本中的命名实体、在样本之间交换模式、创建新模式以及将电影内容歪曲为新闻。M3A 包含 708K 个真实新闻样本和 600 多万个假新闻样本，涵盖文本、图像、音频和视频。M3A 提供了详细的多类标签，这对各种错误信息检测任务（包括断章取义检测和深度伪造检测）至关重要。对于每项任务，我们都使用最先进的模型提供了广泛的基准测试，旨在加强强大的错误信息检测系统的开发。

{"title":"M3A: A multimodal misinformation dataset for media authenticity analysis","authors":"Qingzheng Xu , Huiqiang Chen , Heming Du , Hu Zhang , Szymon Łukasik , Tianqing Zhu , Xin Yu","doi":"10.1016/j.cviu.2024.104205","DOIUrl":"10.1016/j.cviu.2024.104205","url":null,"abstract":"<div><div>With the development of various generative models, misinformation in news media becomes more deceptive and easier to create, posing a significant problem. However, existing datasets for misinformation study often have limited modalities, constrained sources, and a narrow range of topics. These limitations make it difficult to train models that can effectively combat real-world misinformation. To address this, we propose a comprehensive, large-scale Multimodal Misinformation dataset for Media Authenticity Analysis (<span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span>), featuring broad sources and fine-grained annotations for topics and sentiments. To curate <span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span>, we collect genuine news content from 60 renowned news outlets worldwide and generate fake samples using multiple techniques. These include altering named entities in texts, swapping modalities between samples, creating new modalities, and misrepresenting movie content as news. <span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span> contains 708K genuine news samples and over 6M fake news samples, spanning text, images, audio, and video. <span><math><mrow><msup><mrow><mi>M</mi></mrow><mrow><mn>3</mn></mrow></msup><mi>A</mi></mrow></math></span> provides detailed multi-class labels, crucial for various misinformation detection tasks, including out-of-context detection and deepfake detection. For each task, we offer extensive benchmarks using state-of-the-art models, aiming to enhance the development of robust misinformation detection systems.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104205"},"PeriodicalIF":4.3,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142445584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Region-aware image-based human action retrieval with transformers 利用变换器进行基于区域感知图像的人体动作检索

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-14 DOI: 10.1016/j.cviu.2024.104202

Hongsong Wang , Jianhua Zhao , Jie Gui

Human action understanding is a fundamental and challenging task in computer vision. Although there exists tremendous research on this area, most works focus on action recognition, while action retrieval has received less attention. In this paper, we focus on the neglected but important task of image-based action retrieval which aims to find images that depict the same action as a query image. We establish benchmarks for this task and set up important baseline methods for fair comparison. We present a Transformer-based model that learns rich action representations from three aspects: the anchored person, contextual regions, and the global image. A fusion transformer is designed to model the relationships among different features and effectively fuse them into an action representation. Experiments on both the Stanford-40 and PASCAL VOC 2012 Action datasets show that the proposed method significantly outperforms previous approaches for image-based action retrieval.

人类动作理解是计算机视觉领域的一项基本而具有挑战性的任务。尽管在这一领域已有大量研究，但大多数作品都集中在动作识别上，而动作检索却较少受到关注。在本文中，我们重点讨论了基于图像的动作检索这一被忽视但却很重要的任务，其目的是找到与查询图像描述相同动作的图像。我们为这项任务建立了基准，并设定了重要的基准方法，以便进行公平比较。我们提出了一种基于变换器的模型，该模型可从三个方面学习丰富的动作表征：锚定人、上下文区域和全局图像。我们设计了一个融合转换器来模拟不同特征之间的关系，并将它们有效地融合到动作表示中。在 Stanford-40 和 PASCAL VOC 2012 动作数据集上进行的实验表明，在基于图像的动作检索方面，所提出的方法明显优于以往的方法。

引用次数: 0

A simple but effective vision transformer framework for visible–infrared person re-identification 用于可见光-红外线人员再识别的简单而有效的视觉转换器框架

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-11 DOI: 10.1016/j.cviu.2024.104192

Yudong Li , Sanyuan Zhao , Jianbing Shen

In the context of visible–infrared person re-identification (VI-ReID), the acquisition of a robust visual representation is paramount. Existing approaches predominantly rely on convolutional neural networks (CNNs), which are guided by intricately designed loss functions to extract features. In contrast, the vision transformer (ViT), a potent visual backbone, has often yielded subpar results in VI-ReID. We contend that the prevailing training methodologies and insights derived from CNNs do not seamlessly apply to ViT, leading to the underutilization of its potential in VI-ReID. One notable limitation is ViT’s appetite for extensive data, exemplified by the JFT-300M dataset, to surpass CNNs. Consequently, ViT struggles to transfer its knowledge from visible to infrared images due to inadequate training data. Even the largest available dataset, SYSU-MM01, proves insufficient for ViT to glean a robust representation of infrared images. This predicament is exacerbated when ViT is trained on the smaller RegDB dataset, where slight data flow modifications drastically affect performance—a stark contrast to CNN behavior. These observations lead us to conjecture that the CNN-inspired paradigm impedes ViT’s progress in VI-ReID. In light of these challenges, we undertake comprehensive ablation studies to shed new light on ViT’s applicability in VI-ReID. We propose a straightforward yet effective framework, named “Idformer”, to train a high-performing ViT for VI-ReID. Idformer serves as a robust baseline that can be further enhanced with carefully designed techniques akin to those used for CNNs. Remarkably, our method attains competitive results even in the absence of auxiliary information, achieving 78.58%/76.99% Rank-1/mAP on the SYSU-MM01 dataset, as well as 96.82%/91.83% Rank-1/mAP on the RegDB dataset. The code will be made publicly accessible.

在可见光-红外人员再识别（VI-ReID）中，获取可靠的视觉表征至关重要。现有的方法主要依赖卷积神经网络（CNN），该网络通过精心设计的损失函数来提取特征。与此相反，视觉转换器（ViT）作为一种强大的视觉骨干，在 VI-ReID 中却往往效果不佳。我们认为，从 CNN 中得出的主流训练方法和见解并不能完美地应用于 ViT，导致其在 VI-ReID 中的潜力未得到充分利用。一个值得注意的局限是，ViT 需要大量数据（以 JFT-300M 数据集为例）才能超越 CNN。因此，由于训练数据不足，ViT 难以将其知识从可见光图像转移到红外图像。即使是最大的可用数据集 SYSU-MM01，也不足以让 ViT 对红外图像进行稳健的表示。在较小的 RegDB 数据集上训练 ViT 时，这种困境更加严重，数据流的微小修改都会极大地影响性能--这与 CNN 的行为形成了鲜明对比。这些观察结果让我们猜测，CNN 启发的模式阻碍了 ViT 在 VI-ReID 领域的进展。鉴于这些挑战，我们开展了全面的消融研究，以揭示 ViT 在 VI-ReID 中的适用性。我们提出了一个简单而有效的框架，名为 "Idformer"，用于为 VI-ReID 训练高性能 ViT。Idformer 是一个稳健的基线，可以通过精心设计的技术（类似于用于 CNN 的技术）进一步增强。值得注意的是，即使在没有辅助信息的情况下，我们的方法也能获得有竞争力的结果，在 SYSU-MM01 数据集上实现了 78.58%/76.99% 的排名-1/mAP，在 RegDB 数据集上实现了 96.82%/91.83% 的排名-1/mAP。代码将向公众开放。

{"title":"A simple but effective vision transformer framework for visible–infrared person re-identification","authors":"Yudong Li , Sanyuan Zhao , Jianbing Shen","doi":"10.1016/j.cviu.2024.104192","DOIUrl":"10.1016/j.cviu.2024.104192","url":null,"abstract":"<div><div>In the context of visible–infrared person re-identification (VI-ReID), the acquisition of a robust visual representation is paramount. Existing approaches predominantly rely on convolutional neural networks (CNNs), which are guided by intricately designed loss functions to extract features. In contrast, the vision transformer (ViT), a potent visual backbone, has often yielded subpar results in VI-ReID. We contend that the prevailing training methodologies and insights derived from CNNs do not seamlessly apply to ViT, leading to the underutilization of its potential in VI-ReID. One notable limitation is ViT’s appetite for extensive data, exemplified by the JFT-300M dataset, to surpass CNNs. Consequently, ViT struggles to transfer its knowledge from visible to infrared images due to inadequate training data. Even the largest available dataset, SYSU-MM01, proves insufficient for ViT to glean a robust representation of infrared images. This predicament is exacerbated when ViT is trained on the smaller RegDB dataset, where slight data flow modifications drastically affect performance—a stark contrast to CNN behavior. These observations lead us to conjecture that the CNN-inspired paradigm impedes ViT’s progress in VI-ReID. In light of these challenges, we undertake comprehensive ablation studies to shed new light on ViT’s applicability in VI-ReID. We propose a straightforward yet effective framework, named “Idformer”, to train a high-performing ViT for VI-ReID. Idformer serves as a robust baseline that can be further enhanced with carefully designed techniques akin to those used for CNNs. Remarkably, our method attains competitive results even in the absence of auxiliary information, achieving 78.58%/76.99% Rank-1/mAP on the SYSU-MM01 dataset, as well as 96.82%/91.83% Rank-1/mAP on the RegDB dataset. The code will be made publicly accessible.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104192"},"PeriodicalIF":4.3,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An end-to-end tracking framework via multi-view and temporal feature aggregation 通过多视角和时间特征聚合实现端到端跟踪框架

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-10 DOI: 10.1016/j.cviu.2024.104203

Yihan Yang , Ming Xu , Jason F. Ralph , Yuchen Ling , Xiaonan Pan

Multi-view pedestrian tracking has frequently been used to cope with the challenges of occlusion and limited fields-of-view in single-view tracking. However, there are few end-to-end methods in this field. Many existing algorithms detect pedestrians in individual views, cluster projected detections in a top view and then track them. The others track pedestrians in individual views and then associate the projected tracklets in a top view. In this paper, an end-to-end framework is proposed for multi-view tracking, in which both multi-view and temporal aggregations of feature maps are applied. The multi-view aggregation projects the per-view feature maps to a top view, uses a transformer encoder to output encoded feature maps and then uses a CNN to calculate a pedestrian occupancy map. The temporal aggregation uses another CNN to estimate position offsets from the encoded feature maps in consecutive frames. Our experiments have demonstrated that this end-to-end framework outperforms the state-of-the-art online algorithms for multi-view pedestrian tracking.

多视角行人跟踪经常被用来应对单视角跟踪中的遮挡和有限视场的挑战。然而，在这一领域，端到端的方法还很少。现有的许多算法都是在单个视图中检测行人，在顶视图中对检测到的行人进行聚类，然后对其进行跟踪。其他算法则是在单个视图中跟踪行人，然后将投影小轨迹关联到顶视图中。本文提出了一种端到端多视图跟踪框架，其中应用了特征图的多视图聚合和时间聚合。多视图聚合将每个视图的特征图投影到顶视图上，使用变换编码器输出编码特征图，然后使用 CNN 计算行人占用图。时间聚合使用另一个 CNN 从连续帧中的编码特征图估算位置偏移。我们的实验证明，这种端到端框架在多视角行人跟踪方面优于最先进的在线算法。

{"title":"An end-to-end tracking framework via multi-view and temporal feature aggregation","authors":"Yihan Yang , Ming Xu , Jason F. Ralph , Yuchen Ling , Xiaonan Pan","doi":"10.1016/j.cviu.2024.104203","DOIUrl":"10.1016/j.cviu.2024.104203","url":null,"abstract":"<div><div>Multi-view pedestrian tracking has frequently been used to cope with the challenges of occlusion and limited fields-of-view in single-view tracking. However, there are few end-to-end methods in this field. Many existing algorithms detect pedestrians in individual views, cluster projected detections in a top view and then track them. The others track pedestrians in individual views and then associate the projected tracklets in a top view. In this paper, an end-to-end framework is proposed for multi-view tracking, in which both multi-view and temporal aggregations of feature maps are applied. The multi-view aggregation projects the per-view feature maps to a top view, uses a transformer encoder to output encoded feature maps and then uses a CNN to calculate a pedestrian occupancy map. The temporal aggregation uses another CNN to estimate position offsets from the encoded feature maps in consecutive frames. Our experiments have demonstrated that this end-to-end framework outperforms the state-of-the-art online algorithms for multi-view pedestrian tracking.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104203"},"PeriodicalIF":4.3,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A fast differential network with adaptive reference sample for gaze estimation 用于凝视估计的带有自适应参考样本的快速差分网络

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-09 DOI: 10.1016/j.cviu.2024.104156

Jiahui Hu , Yonghua Lu , Xiyuan Ye , Qiang Feng , Lihua Zhou

Most non-invasive gaze estimation methods do not consider the inter-individual differences in anatomical structure, but directly regress the gaze direction from the appearance image information, which limits the accuracy of individual-independent gaze estimation networks. In addition, existing gaze estimation methods tend to consider only how to improve the model’s generalization performance, ignoring the crucial issue of efficiency, which leads to bulky models that are difficult to deploy and have questionable cost-effectiveness in practical use. This paper makes the following contributions: (1) A differential network for gaze estimation using adaptive reference samples is proposed, which can adaptively select reference samples based on scene and individual characteristics. (2) The knowledge distillation is used to transfer the knowledge structure of robust teacher networks into lightweight networks so that our networks can execute quickly and at low computational cost, dramatically increasing the prospect and value of applying gaze estimation. (3) Integrating the above innovations, a novel fast differential neural network (Diff-Net) named FDAR-Net is constructed and achieved excellent results on MPIIGaze, UTMultiview and EyeDiap.

大多数非侵入式注视估计方法不考虑解剖结构的个体间差异，而是直接从外观图像信息回归注视方向，这限制了独立于个体的注视估计网络的准确性。此外，现有的注视估计方法往往只考虑如何提高模型的泛化性能，而忽视了效率这一关键问题，导致模型体积庞大，难以部署，在实际应用中的成本效益也值得怀疑。本文的贡献如下：（1）提出了一种使用自适应参考样本进行注视估计的差分网络，它可以根据场景和个体特征自适应地选择参考样本。(2) 利用知识提炼法将稳健教师网络的知识结构转移到轻量级网络中，从而使我们的网络能够以较低的计算成本快速执行，大大提高了凝视估计的应用前景和价值。(3) 综合上述创新，构建了名为 FDAR-Net 的新型快速差分神经网络（Diff-Net），并在 MPIIGaze、UTMultiview 和 EyeDiap 上取得了优异的结果。

{"title":"A fast differential network with adaptive reference sample for gaze estimation","authors":"Jiahui Hu , Yonghua Lu , Xiyuan Ye , Qiang Feng , Lihua Zhou","doi":"10.1016/j.cviu.2024.104156","DOIUrl":"10.1016/j.cviu.2024.104156","url":null,"abstract":"<div><div>Most non-invasive gaze estimation methods do not consider the inter-individual differences in anatomical structure, but directly regress the gaze direction from the appearance image information, which limits the accuracy of individual-independent gaze estimation networks. In addition, existing gaze estimation methods tend to consider only how to improve the model’s generalization performance, ignoring the crucial issue of efficiency, which leads to bulky models that are difficult to deploy and have questionable cost-effectiveness in practical use. This paper makes the following contributions: (1) A differential network for gaze estimation using adaptive reference samples is proposed, which can adaptively select reference samples based on scene and individual characteristics. (2) The knowledge distillation is used to transfer the knowledge structure of robust teacher networks into lightweight networks so that our networks can execute quickly and at low computational cost, dramatically increasing the prospect and value of applying gaze estimation. (3) Integrating the above innovations, a novel fast differential neural network (Diff-Net) named FDAR-Net is constructed and achieved excellent results on MPIIGaze, UTMultiview and EyeDiap.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104156"},"PeriodicalIF":4.3,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A semantic segmentation method integrated convolutional nonlinear spiking neural model with Transformer 将卷积非线性尖峰神经模型与变压器整合在一起的语义分割方法

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-09 DOI: 10.1016/j.cviu.2024.104196

Siyan Sun , Wenqian Yang , Hong Peng , Jun Wang , Zhicai Liu

Semantic segmentation is a critical task in computer vision, with significant applications in areas like autonomous driving and medical imaging. Transformer-based methods have gained considerable attention recently because of their strength in capturing global information. However, these methods often sacrifice detailed information due to the lack of mechanisms for local interactions. Similarly, convolutional neural network (CNN) methods struggle to capture global context due to the inherent limitations of convolutional kernels. To overcome these challenges, this paper introduces a novel Transformer-based semantic segmentation method called NSNPFormer, which leverages the nonlinear spiking neural P (NSNP) system—a computational model inspired by the spiking mechanisms of biological neurons. The NSNPFormer employs an encoding–decoding structure with two convolutional NSNP components and a residual connection channel. The convolutional NSNP components facilitate nonlinear local feature extraction and block-level feature fusion. Meanwhile, the residual connection channel helps prevent the loss of feature information during the decoding process. Evaluations on the ADE20K and Pascal Context datasets show that NSNPFormer achieves mIoU scores of 53.7 and 58.06, respectively, highlighting its effectiveness in semantic segmentation tasks.

语义分割是计算机视觉中的一项关键任务，在自动驾驶和医学成像等领域有着重要应用。基于变换器的方法因其在捕捉全局信息方面的优势而在最近获得了广泛关注。然而，由于缺乏局部交互机制，这些方法往往会牺牲细节信息。同样，由于卷积核的固有局限性，卷积神经网络（CNN）方法也难以捕捉全局上下文。为了克服这些挑战，本文介绍了一种名为 NSNPFormer 的基于 Transformer 的新型语义分割方法，该方法利用了非线性尖峰神经 P（NSNP）系统--一种受生物神经元尖峰机制启发的计算模型。NSNPFormer 采用的编码-解码结构包含两个卷积 NSNP 组件和一个残差连接通道。卷积 NSNP 组件有助于非线性局部特征提取和块级特征融合。同时，残差连接通道有助于防止在解码过程中丢失特征信息。在 ADE20K 和 Pascal Context 数据集上进行的评估表明，NSNPFormer 的 mIoU 分数分别达到了 53.7 和 58.06，突出了它在语义分割任务中的有效性。

{"title":"A semantic segmentation method integrated convolutional nonlinear spiking neural model with Transformer","authors":"Siyan Sun , Wenqian Yang , Hong Peng , Jun Wang , Zhicai Liu","doi":"10.1016/j.cviu.2024.104196","DOIUrl":"10.1016/j.cviu.2024.104196","url":null,"abstract":"<div><div>Semantic segmentation is a critical task in computer vision, with significant applications in areas like autonomous driving and medical imaging. Transformer-based methods have gained considerable attention recently because of their strength in capturing global information. However, these methods often sacrifice detailed information due to the lack of mechanisms for local interactions. Similarly, convolutional neural network (CNN) methods struggle to capture global context due to the inherent limitations of convolutional kernels. To overcome these challenges, this paper introduces a novel Transformer-based semantic segmentation method called NSNPFormer, which leverages the nonlinear spiking neural P (NSNP) system—a computational model inspired by the spiking mechanisms of biological neurons. The NSNPFormer employs an encoding–decoding structure with two convolutional NSNP components and a residual connection channel. The convolutional NSNP components facilitate nonlinear local feature extraction and block-level feature fusion. Meanwhile, the residual connection channel helps prevent the loss of feature information during the decoding process. Evaluations on the ADE20K and Pascal Context datasets show that NSNPFormer achieves mIoU scores of 53.7 and 58.06, respectively, highlighting its effectiveness in semantic segmentation tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104196"},"PeriodicalIF":4.3,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142433245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MT-DSNet: Mix-mask teacher–student strategies and dual dynamic selection plug-in module for fine-grained image recognition MT-DSNet：用于细粒度图像识别的师生混合掩码策略和双动态选择插件模块

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-08 DOI: 10.1016/j.cviu.2024.104201

Hongchun Lu, Min Han

The fine-grained image recognition (FGIR) task aims to classify and distinguish subtle differences between subcategories with visually similar appearances, such as bird species and the makes or models of vehicles. However, subtle interclass differences and significant intraclass variances lead to poor model recognition performance. To address these challenges, we developed a mixed-mask teacher–student cooperative training strategy. A mixed masked image is generated and embedded into a knowledge distillation network by replacing one image’s visible marker with another’s masked marker. Collaborative reinforcement between teachers and students is used to improve the recognition performance of the network. We chose the classic transformer architecture as a baseline to better explore the contextual relationships between features. Additionally, we suggest a dual dynamic selection plug-in for choosing features with discriminative capabilities in the spatial and channel dimensions and filter out irrelevant interference information to efficiently handle background and noise features in fine-grained images. The proposed feature suppression module is used to enhance the differences between different features, thereby motivating the network to mine more discriminative features. We validated our method using two datasets: CUB-200-2011 and Stanford Cars. The experimental results show that the proposed MT-DSNet can significantly improve the feature representation for FGIR tasks. Moreover, by applying it to different fine-grained networks, the FGIR accuracy can be improved without changing the original network structure. We hope that this work provides a promising approach for improving the feature representation of networks in the future.

细粒度图像识别（FGIR）任务旨在分类和区分视觉外观相似的子类别之间的细微差别，例如鸟的种类和汽车的品牌或型号。然而，细微的类间差异和显著的类内差异导致模型识别性能低下。为了应对这些挑战，我们开发了一种混合掩码师生合作训练策略。通过用一个图像的可见标记替换另一个图像的屏蔽标记，生成混合屏蔽图像并将其嵌入知识提炼网络。教师和学生之间通过合作强化来提高网络的识别性能。我们选择了经典的变换器架构作为基线，以更好地探索特征之间的上下文关系。此外，我们还提出了一个双动态选择插件，用于选择在空间和通道维度上具有分辨能力的特征，并过滤掉无关的干扰信息，以有效处理细粒度图像中的背景和噪声特征。所提出的特征抑制模块用于增强不同特征之间的差异，从而促使网络挖掘出更多具有分辨能力的特征。我们使用两个数据集验证了我们的方法：CUB-200-2011 和斯坦福汽车。实验结果表明，所提出的 MT-DSNet 可以显著改善 FGIR 任务的特征表示。此外，通过将其应用于不同的细粒度网络，可以在不改变原始网络结构的情况下提高 FGIR 的准确性。我们希望这项工作能为未来改进网络特征表示提供一种有前途的方法。

{"title":"MT-DSNet: Mix-mask teacher–student strategies and dual dynamic selection plug-in module for fine-grained image recognition","authors":"Hongchun Lu, Min Han","doi":"10.1016/j.cviu.2024.104201","DOIUrl":"10.1016/j.cviu.2024.104201","url":null,"abstract":"<div><div>The fine-grained image recognition (FGIR) task aims to classify and distinguish subtle differences between subcategories with visually similar appearances, such as bird species and the makes or models of vehicles. However, subtle interclass differences and significant intraclass variances lead to poor model recognition performance. To address these challenges, we developed a mixed-mask teacher–student cooperative training strategy. A mixed masked image is generated and embedded into a knowledge distillation network by replacing one image’s visible marker with another’s masked marker. Collaborative reinforcement between teachers and students is used to improve the recognition performance of the network. We chose the classic transformer architecture as a baseline to better explore the contextual relationships between features. Additionally, we suggest a dual dynamic selection plug-in for choosing features with discriminative capabilities in the spatial and channel dimensions and filter out irrelevant interference information to efficiently handle background and noise features in fine-grained images. The proposed feature suppression module is used to enhance the differences between different features, thereby motivating the network to mine more discriminative features. We validated our method using two datasets: CUB-200-2011 and Stanford Cars. The experimental results show that the proposed MT-DSNet can significantly improve the feature representation for FGIR tasks. Moreover, by applying it to different fine-grained networks, the FGIR accuracy can be improved without changing the original network structure. We hope that this work provides a promising approach for improving the feature representation of networks in the future.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104201"},"PeriodicalIF":4.3,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hyperspectral image classification with token fusion on GPU 利用 GPU 进行标记融合的高光谱图像分类

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding

Pub Date : 2024-10-05 DOI: 10.1016/j.cviu.2024.104198

He Huang, Sha Tao

Hyperspectral images capture material nuances with spectral data, vital for remote sensing. Transformer has become a mainstream approach for tackling the challenges posed by high-dimensional hyperspectral data with complex structures. However, a major challenge they face when processing hyperspectral images is the presence of a large number of redundant tokens, which leads to a significant increase in computational load, adding to the model’s computational burden and affecting inference speed. Therefore, we propose a token fusion algorithm tailored to the operational characteristics of the hyperspectral image and pure transformer network, aimed at enhancing the final accuracy and throughput of the model. The token fusion algorithm introduces a token merging step between the attention mechanism and the multi-layer perceptron module in each Transformer layer. Experiments on four hyperspectral image datasets demonstrate that our token fusion algorithm can significantly improve inference speed without any training, while only causing a slight decrease in the pure transformer network’s classification accuracy.

高光谱图像通过光谱数据捕捉物质的细微差别，这对遥感至关重要。变换器已成为应对结构复杂的高维高光谱数据挑战的主流方法。然而，在处理高光谱图像时，它们面临的一个主要挑战是存在大量冗余标记，这会导致计算负荷大幅增加，加重模型的计算负担并影响推理速度。因此，我们根据高光谱图像和纯变压器网络的运行特点，提出了一种令牌融合算法，旨在提高模型的最终精度和吞吐量。令牌融合算法在每个变压器层的注意力机制和多层感知器模块之间引入了令牌合并步骤。在四个高光谱图像数据集上进行的实验表明，我们的标记融合算法无需任何训练即可显著提高推理速度，同时只会导致纯变换器网络的分类准确率略有下降。

引用次数: 0