UAHOI: Uncertainty-aware robust interaction learning for HOI detection

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Vision and Image Understanding Pub Date : 2024-07-20 DOI:10.1016/j.cviu.2024.104091

{"title":"UAHOI: Uncertainty-aware robust interaction learning for HOI detection","authors":"","doi":"10.1016/j.cviu.2024.104091","DOIUrl":null,"url":null,"abstract":"<div><p>This paper focuses on Human–Object Interaction (HOI) detection, addressing the challenge of identifying and understanding the interactions between humans and objects within a given image or video frame. Spearheaded by Detection Transformer (DETR), recent developments lead to significant improvements by replacing traditional region proposals by a set of learnable queries. However, despite the powerful representation capabilities provided by Transformers, existing Human–Object Interaction (HOI) detection methods still yield low confidence levels when dealing with complex interactions and are prone to overlooking interactive actions. To address these issues, we propose a novel approach <span>UAHOI</span>, Uncertainty-aware Robust Human–Object Interaction Learning that explicitly estimates prediction uncertainty during the training process to refine both detection and interaction predictions. Our model not only predicts the HOI triplets but also quantifies the uncertainty of these predictions. Specifically, we model this uncertainty through the variance of predictions and incorporate it into the optimization objective, allowing the model to adaptively adjust its confidence threshold based on prediction variance. This integration helps in mitigating the adverse effects of incorrect or ambiguous predictions that are common in traditional methods without any hand-designed components, serving as an automatic confidence threshold. Our method is flexible to existing HOI detection methods and demonstrates improved accuracy. We evaluate <span>UAHOI</span> on two standard benchmarks in the field: V-COCO and HICO-DET, which represent challenging scenarios for HOI detection. Through extensive experiments, we demonstrate that <span>UAHOI</span> achieves significant improvements over existing state-of-the-art methods, enhancing both the accuracy and robustness of HOI detection.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314224001723","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

This paper focuses on Human–Object Interaction (HOI) detection, addressing the challenge of identifying and understanding the interactions between humans and objects within a given image or video frame. Spearheaded by Detection Transformer (DETR), recent developments lead to significant improvements by replacing traditional region proposals by a set of learnable queries. However, despite the powerful representation capabilities provided by Transformers, existing Human–Object Interaction (HOI) detection methods still yield low confidence levels when dealing with complex interactions and are prone to overlooking interactive actions. To address these issues, we propose a novel approach UAHOI, Uncertainty-aware Robust Human–Object Interaction Learning that explicitly estimates prediction uncertainty during the training process to refine both detection and interaction predictions. Our model not only predicts the HOI triplets but also quantifies the uncertainty of these predictions. Specifically, we model this uncertainty through the variance of predictions and incorporate it into the optimization objective, allowing the model to adaptively adjust its confidence threshold based on prediction variance. This integration helps in mitigating the adverse effects of incorrect or ambiguous predictions that are common in traditional methods without any hand-designed components, serving as an automatic confidence threshold. Our method is flexible to existing HOI detection methods and demonstrates improved accuracy. We evaluate UAHOI on two standard benchmarks in the field: V-COCO and HICO-DET, which represent challenging scenarios for HOI detection. Through extensive experiments, we demonstrate that UAHOI achieves significant improvements over existing state-of-the-art methods, enhancing both the accuracy and robustness of HOI detection.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

UAHOI：用于 HOI 检测的不确定性感知鲁棒交互学习

本文的重点是人-物互动（HOI）检测，以应对在给定图像或视频帧中识别和理解人与物体之间互动的挑战。在检测变换器（DETR）的引领下，最近的发展通过用一组可学习的查询来取代传统的区域建议，取得了显著的改进。然而，尽管变形器提供了强大的表示能力，现有的人-物交互（HOI）检测方法在处理复杂的交互时仍会产生较低的置信度，并且容易忽略交互动作。为了解决这些问题，我们提出了一种新方法 UAHOI（不确定性感知的鲁棒人-物交互学习），该方法在训练过程中明确估计预测的不确定性，以完善检测和交互预测。我们的模型不仅能预测 HOI 三胞胎，还能量化这些预测的不确定性。具体来说，我们通过预测方差对这种不确定性进行建模，并将其纳入优化目标，使模型能够根据预测方差自适应地调整其置信度阈值。这种整合有助于减轻传统方法中常见的不正确或模糊预测的不利影响，因为传统方法中没有任何手工设计的组件，可以作为自动置信度阈值。与现有的 HOI 检测方法相比，我们的方法非常灵活，而且准确性更高。我们在两个领域的标准基准上对 UAHOI 进行了评估：V-COCO 和 HICO-DET，它们代表了具有挑战性的 HOI 检测场景。通过大量实验，我们证明 UAHOI 比现有的最先进方法有了显著改进，提高了 HOI 检测的准确性和鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems