首页 > 最新文献

International Journal of Computer Vision最新文献

英文 中文
2D Semantic-Guided Semantic Scene Completion 二维语义引导的语义场景补全
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-03 DOI: 10.1007/s11263-024-02244-y
Xianzhu Liu, Haozhe Xie, Shengping Zhang, Hongxun Yao, Rongrong Ji, Liqiang Nie, Dacheng Tao

Semantic scene completion (SSC) aims to simultaneously perform scene completion (SC) and predict semantic categories of a 3D scene from a single depth and/or RGB image. Most existing SSC methods struggle to handle complex regions with multiple objects close to each other, especially for objects with reflective or dark surfaces. This primarily stems from two challenges: (1) the loss of geometric information due to the unreliability of depth values from sensors, and (2) the potential for semantic confusion when simultaneously predicting 3D shapes and semantic labels. To address these problems, we propose a Semantic-guided Semantic Scene Completion framework, dubbed SG-SSC, which involves Semantic-guided Fusion (SGF) and Volume-guided Semantic Predictor (VGSP). Guided by 2D semantic segmentation maps, SGF adaptively fuses RGB and depth features to compensate for the missing geometric information caused by the missing values in depth images, thus performing more robustly to unreliable depth information. VGSP exploits the mutual benefit between SC and SSC tasks, making SSC more focused on predicting the categories of voxels with high occupancy probabilities and also allowing SC to utilize semantic priors to better predict voxel occupancy. Experimental results show that SG-SSC outperforms existing state-of-the-art methods on the NYU, NYUCAD, and SemanticKITTI datasets. Models and code are available at https://github.com/aipixel/SG-SSC.

语义场景补全(SSC)旨在同时执行场景补全(SC),并从单张深度和/或 RGB 图像中预测三维场景的语义类别。现有的 SSC 方法大多难以处理多个物体相互靠近的复杂区域,尤其是具有反光或暗色表面的物体。这主要源于两个挑战:(1) 由于传感器提供的深度值不可靠,导致几何信息丢失;(2) 同时预测三维形状和语义标签时,可能出现语义混淆。为了解决这些问题,我们提出了一个语义引导的语义场景完成框架,称为 SG-SSC,其中包括语义引导融合(SGF)和体量引导语义预测器(VGSP)。在二维语义分割图的指导下,SGF 自适应地融合 RGB 和深度特征,以弥补深度图像中缺失值所造成的几何信息缺失,从而更稳健地处理不可靠的深度信息。VGSP 利用了 SC 任务和 SSC 任务之间的互利性,使 SSC 更专注于预测具有高占据概率的体素类别,也使 SC 能够利用语义先验更好地预测体素占据率。实验结果表明,SG-SSC 在 NYU、NYUCAD 和 SemanticKITTI 数据集上的表现优于现有的最先进方法。模型和代码请访问 https://github.com/aipixel/SG-SSC。
{"title":"2D Semantic-Guided Semantic Scene Completion","authors":"Xianzhu Liu, Haozhe Xie, Shengping Zhang, Hongxun Yao, Rongrong Ji, Liqiang Nie, Dacheng Tao","doi":"10.1007/s11263-024-02244-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02244-y","url":null,"abstract":"<p>Semantic scene completion (SSC) aims to simultaneously perform scene completion (SC) and predict semantic categories of a 3D scene from a single depth and/or RGB image. Most existing SSC methods struggle to handle complex regions with multiple objects close to each other, especially for objects with reflective or dark surfaces. This primarily stems from two challenges: (1) the loss of geometric information due to the unreliability of depth values from sensors, and (2) the potential for semantic confusion when simultaneously predicting 3D shapes and semantic labels. To address these problems, we propose a Semantic-guided Semantic Scene Completion framework, dubbed SG-SSC, which involves Semantic-guided Fusion (SGF) and Volume-guided Semantic Predictor (VGSP). Guided by 2D semantic segmentation maps, SGF adaptively fuses RGB and depth features to compensate for the missing geometric information caused by the missing values in depth images, thus performing more robustly to unreliable depth information. VGSP exploits the mutual benefit between SC and SSC tasks, making SSC more focused on predicting the categories of voxels with high occupancy probabilities and also allowing SC to utilize semantic priors to better predict voxel occupancy. Experimental results show that SG-SSC outperforms existing state-of-the-art methods on the NYU, NYUCAD, and SemanticKITTI datasets. Models and code are available at https://github.com/aipixel/SG-SSC.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142374204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From Gaze Jitter to Domain Adaptation: Generalizing Gaze Estimation by Manipulating High-Frequency Components 从凝视抖动到领域适应:通过操纵高频成分实现凝视估计的通用化
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-30 DOI: 10.1007/s11263-024-02233-1
Ruicong Liu, Haofei Wang, Feng Lu

Gaze, as a pivotal indicator of human emotion, plays a crucial role in various computer vision tasks. However, the accuracy of gaze estimation often significantly deteriorates when applied to unseen environments, thereby limiting its practical value. Therefore, enhancing the generalizability of gaze estimators to new domains emerges as a critical challenge. A common limitation in existing domain adaptation research is the inability to identify and leverage truly influential factors during the adaptation process. This shortcoming often results in issues such as limited accuracy and unstable adaptation. To address this issue, this article discovers a truly influential factor in the cross-domain problem, i.e., high-frequency components (HFC). This discovery stems from an analysis of gaze jitter-a frequently overlooked but impactful issue where predictions can deviate drastically even for visually similar input images. Inspired by this discovery, we propose an “embed-then-suppress" HFC manipulation strategy to adapt gaze estimation to new domains. Our method first embeds additive HFC to the input images, then performs domain adaptation by suppressing the impact of HFC. Specifically, the suppression is carried out in a contrasive manner. Each original image is paired with its HFC-embedded version, thereby enabling our method to suppress the HFC impact through contrasting the representations within the pairs. The proposed method is evaluated across four cross-domain gaze estimation tasks. The experimental results show that it not only enhances gaze estimation accuracy but also significantly reduces gaze jitter in the target domain. Compared with previous studies, our method offers higher accuracy, reduced gaze jitter, and improved adaptation stability, marking the potential for practical deployment.

目光作为人类情感的重要指标,在各种计算机视觉任务中发挥着至关重要的作用。然而,当应用于未知环境时,凝视估计的准确性往往会大大降低,从而限制了其实用价值。因此,如何提高凝视估计器在新领域的通用性成为了一项严峻的挑战。现有领域适应性研究的一个共同局限是,在适应过程中无法识别和利用真正有影响力的因素。这一缺陷往往导致精确度有限和适应不稳定等问题。为解决这一问题,本文发现了跨领域问题中的真正影响因素,即高频成分(HFC)。这一发现源于对凝视抖动的分析--凝视抖动是一个经常被忽视但却很有影响的问题,即使是视觉相似的输入图像,预测结果也会出现很大偏差。受这一发现的启发,我们提出了一种 "嵌入-压制 "HFC 操作策略,使注视估计适应新的领域。我们的方法首先在输入图像中嵌入加性 HFC,然后通过抑制 HFC 的影响来执行域适应。具体来说,抑制是以对比的方式进行的。每幅原始图像都与其嵌入 HFC 的版本配对,从而使我们的方法能够通过配对中的对比表示来抑制 HFC 的影响。我们在四项跨域注视估计任务中对所提出的方法进行了评估。实验结果表明,该方法不仅提高了注视估计的准确性,还显著降低了目标域的注视抖动。与之前的研究相比,我们的方法具有更高的准确性、更低的注视抖动和更好的适应稳定性,这标志着我们的方法具有实际应用的潜力。
{"title":"From Gaze Jitter to Domain Adaptation: Generalizing Gaze Estimation by Manipulating High-Frequency Components","authors":"Ruicong Liu, Haofei Wang, Feng Lu","doi":"10.1007/s11263-024-02233-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02233-1","url":null,"abstract":"<p>Gaze, as a pivotal indicator of human emotion, plays a crucial role in various computer vision tasks. However, the accuracy of gaze estimation often significantly deteriorates when applied to unseen environments, thereby limiting its practical value. Therefore, enhancing the generalizability of gaze estimators to new domains emerges as a critical challenge. A common limitation in existing domain adaptation research is the inability to identify and leverage truly influential factors during the adaptation process. This shortcoming often results in issues such as limited accuracy and unstable adaptation. To address this issue, this article discovers a truly influential factor in the cross-domain problem, <i>i.e.</i>, high-frequency components (HFC). This discovery stems from an analysis of gaze jitter-a frequently overlooked but impactful issue where predictions can deviate drastically even for visually similar input images. Inspired by this discovery, we propose an “embed-then-suppress\" HFC manipulation strategy to adapt gaze estimation to new domains. Our method first embeds additive HFC to the input images, then performs domain adaptation by suppressing the impact of HFC. Specifically, the suppression is carried out in a contrasive manner. Each original image is paired with its HFC-embedded version, thereby enabling our method to suppress the HFC impact through contrasting the representations within the pairs. The proposed method is evaluated across four cross-domain gaze estimation tasks. The experimental results show that it not only enhances gaze estimation accuracy but also significantly reduces gaze jitter in the target domain. Compared with previous studies, our method offers higher accuracy, reduced gaze jitter, and improved adaptation stability, marking the potential for practical deployment.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142360119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LEO: Generative Latent Image Animator for Human Video Synthesis LEO:用于人类视频合成的潜在图像生成动画器
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-27 DOI: 10.1007/s11263-024-02231-3
Yaohui Wang, Xin Ma, Xinyuan Chen, Cunjian Chen, Antitza Dantcheva, Bo Dai, Yu Qiao

Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing. Project page: https://wyhsirius.github.io/LEO-project/.

时空一致性是合成高质量视频的一大挑战,尤其是合成包含丰富的全局和局部变形的人体视频。为了解决这一难题,以往的方法在生成过程中采用了不同的特征来表现外观和运动。然而,由于缺乏严格的机制来保证这种分离,运动与外观的分离仍然具有挑战性,从而导致空间失真和时间抖动,破坏了时空一致性。受此启发,我们在此提出了 LEO,一种用于人类视频合成的新型框架,其重点在于时空一致性。我们的主要想法是在生成过程中将运动表示为一系列流图,从而从本质上将运动与外观隔离开来。我们通过基于流的图像动画器和潜在运动扩散模型(LMDM)来实现这一想法。前者将运动代码空间与流图空间连接起来,并以翘曲和涂抹的方式合成视频帧。LMDM 通过合成运动代码序列来学习捕捉训练数据中的运动先验。广泛的定量和定性分析表明,LEO 与之前在 TaichiHD、FaceForensics 和 CelebV-HQ 数据集上使用的方法相比,明显改善了人类视频的连贯合成。此外,LEO 能有效解除外观和运动的纠缠,从而实现两项额外任务,即无限长人体视频合成和内容保护视频编辑。项目页面:https://wyhsirius.github.io/LEO-project/。
{"title":"LEO: Generative Latent Image Animator for Human Video Synthesis","authors":"Yaohui Wang, Xin Ma, Xinyuan Chen, Cunjian Chen, Antitza Dantcheva, Bo Dai, Yu Qiao","doi":"10.1007/s11263-024-02231-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02231-3","url":null,"abstract":"<p>Spatio-temporal coherency is a major challenge in synthesizing high quality videos, particularly in synthesizing human videos that contain rich global and local deformations. To resolve this challenge, previous approaches have resorted to different features in the generation process aimed at representing appearance and motion. However, in the absence of strict mechanisms to guarantee such disentanglement, a separation of motion from appearance has remained challenging, resulting in spatial distortions and temporal jittering that break the spatio-temporal coherency. Motivated by this, we here propose LEO, a novel framework for human video synthesis, placing emphasis on spatio-temporal coherency. Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance. We implement this idea via a flow-based image animator and a Latent Motion Diffusion Model (LMDM). The former bridges a space of motion codes with the space of flow maps, and synthesizes video frames in a warp-and-inpaint manner. LMDM learns to capture motion prior in the training data by synthesizing sequences of motion codes. Extensive quantitative and qualitative analysis suggests that LEO significantly improves coherent synthesis of human videos over previous methods on the datasets TaichiHD, FaceForensics and CelebV-HQ. In addition, the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis, as well as content-preserving video editing. Project page: https://wyhsirius.github.io/LEO-project/. </p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Slimmable Networks for Contrastive Self-supervised Learning 用于对比式自我监督学习的可瘦网络
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-26 DOI: 10.1007/s11263-024-02211-7
Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang

Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. Mainstream solutions to this problem rely mainly on knowledge distillation, which involves a two-stage procedure: first training a large teacher model and then distilling it to improve the generalization ability of smaller ones. In this work, we introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers, namely, slimmable networks for contrastive self-supervised learning (SlimCLR). A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks, including small ones with low computation costs. However, interference between weight-sharing networks leads to severe performance degradation in self-supervised cases, as evidenced by gradient magnitude imbalance and gradient direction divergence. The former indicates that a small proportion of parameters produce dominant gradients during backpropagation, while the main parameters may not be fully optimized. The latter shows that the gradient direction is disordered, and the optimization process is unstable. To address these issues, we introduce three techniques to make the main parameters produce dominant gradients and sub-networks have consistent outputs. These techniques include slow start training of sub-networks, online distillation, and loss re-weighting according to model sizes. Furthermore, theoretical results are presented to demonstrate that a single slimmable linear layer is sub-optimal during linear evaluation. Thus a switchable linear probe layer is applied during linear evaluation. We instantiate SlimCLR with typical contrastive learning frameworks and achieve better performance than previous arts with fewer parameters and FLOPs. The code is available at https://github.com/mzhaoshuai/SlimCLR.

自我监督学习在预训练大型模型方面取得了重大进展,但在小型模型方面却举步维艰。解决这一问题的主流方案主要依赖于知识提炼,这涉及一个两阶段的过程:首先训练一个大型教师模型,然后对其进行提炼,以提高小型模型的泛化能力。在这项工作中,我们引入了另一种单阶段解决方案,即用于对比性自监督学习的可纤化网络(SlimCLR),无需额外的教师,即可获得预先训练好的小型模型。Slimmable 网络由一个完整网络和多个权重共享子网络组成,这些子网络可以通过一次预训练获得各种网络,包括计算成本较低的小型网络。然而,在自我监督情况下,权重共享网络之间的干扰会导致性能严重下降,具体表现为梯度幅度不平衡和梯度方向发散。前者表明,在反向传播过程中,一小部分参数产生了主导梯度,而主要参数可能没有完全优化。后者表明梯度方向紊乱,优化过程不稳定。为了解决这些问题,我们引入了三种技术,使主参数产生主导梯度,子网络具有一致的输出。这些技术包括子网络的慢速启动训练、在线蒸馏和根据模型大小进行损失再加权。此外,理论结果表明,在线性评估过程中,单个可纤细线性层是次优的。因此,在线性评估过程中应用了可切换线性探测层。我们将 SlimCLR 与典型的对比学习框架进行了实例化,并在参数和 FLOP 更少的情况下取得了比以前的技术更好的性能。代码见 https://github.com/mzhaoshuai/SlimCLR。
{"title":"Slimmable Networks for Contrastive Self-supervised Learning","authors":"Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang","doi":"10.1007/s11263-024-02211-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02211-7","url":null,"abstract":"<p>Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. Mainstream solutions to this problem rely mainly on knowledge distillation, which involves a two-stage procedure: first training a large teacher model and then distilling it to improve the generalization ability of smaller ones. In this work, we introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers, namely, slimmable networks for contrastive self-supervised learning (SlimCLR). A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks, including small ones with low computation costs. However, interference between weight-sharing networks leads to severe performance degradation in self-supervised cases, as evidenced by <i>gradient magnitude imbalance</i> and <i>gradient direction divergence</i>. The former indicates that a small proportion of parameters produce dominant gradients during backpropagation, while the main parameters may not be fully optimized. The latter shows that the gradient direction is disordered, and the optimization process is unstable. To address these issues, we introduce three techniques to make the main parameters produce dominant gradients and sub-networks have consistent outputs. These techniques include slow start training of sub-networks, online distillation, and loss re-weighting according to model sizes. Furthermore, theoretical results are presented to demonstrate that a single slimmable linear layer is sub-optimal during linear evaluation. Thus a switchable linear probe layer is applied during linear evaluation. We instantiate SlimCLR with typical contrastive learning frameworks and achieve better performance than previous arts with fewer parameters and FLOPs. The code is available at https://github.com/mzhaoshuai/SlimCLR.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mutual Prompt Leaning for Vision Language Models 视觉语言模型的相互提示倾斜
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-26 DOI: 10.1007/s11263-024-02243-z
Sifan Long, Zhen Zhao, Junkun Yuan, Zichang Tan, Jiangjiang Liu, Jingyuan Feng, Shengsheng Wang, Jingdong Wang

Large pre-trained vision language models (VLMs) have demonstrated impressive representation learning capabilities, but their transferability across various downstream tasks heavily relies on prompt learning. Since VLMs consist of text and visual sub-branches, existing prompt approaches are mainly divided into text and visual prompts. Recent text prompt methods have achieved great performance by designing input-condition prompts that encompass both text and image domain knowledge. However, roughly incorporating the same image feature into each learnable text token may be unjustifiable, as it could result in learnable text prompts being concentrated on one or a subset of characteristics. In light of this, we propose a fine-grained text prompt (FTP) that decomposes the single global image features into several finer-grained semantics and incorporates them into corresponding text prompt tokens. On the other hand, current methods neglect valuable text semantic information when building the visual prompt. Furthermore, text information contains redundant and negative category semantics. To address this, we propose a text-reorganized visual prompt (TVP) that reorganizes the text descriptions of the current image to construct the visual prompt, guiding the image branch to attend to class-related representations. By leveraging both FTP and TVP, we enable mutual prompting between the text and visual modalities, unleashing their potential to tap into the representation capabilities of VLMs. Extensive experiments on 11 classification benchmarks show that our method surpasses existing methods by a large margin. In particular, our approach improves recent state-of-the-art CoCoOp by 4.79% on new classes and 3.88% on harmonic mean over eleven classification benchmarks.

大型预训练视觉语言模型(VLMs)已经展示了令人印象深刻的表征学习能力,但它们在各种下游任务中的可移植性在很大程度上依赖于提示学习。由于视觉语言模型由文本和视觉子分支组成,现有的提示方法主要分为文本提示和视觉提示。最近的文本提示方法通过设计包含文本和图像领域知识的输入条件提示,取得了很好的效果。然而,在每个可学习文本标记中粗略地加入相同的图像特征可能是不合理的,因为这可能导致可学习文本提示集中在一个或一个子集特征上。有鉴于此,我们提出了一种细粒度文本提示(FTP),它将单一的全局图像特征分解为多个更细粒度的语义,并将其纳入相应的文本提示标记中。另一方面,目前的方法在构建视觉提示时忽略了有价值的文本语义信息。此外,文本信息还包含冗余和负面的类别语义。为了解决这个问题,我们提出了一种文本重组视觉提示(TVP),通过重组当前图像的文本描述来构建视觉提示,引导图像分支关注与类别相关的表征。通过同时利用 FTP 和 TVP,我们实现了文本和视觉模态之间的相互提示,释放了它们挖掘 VLM 表征能力的潜力。在 11 个分类基准上进行的广泛实验表明,我们的方法大大超越了现有方法。特别是,我们的方法在新类别上比最近最先进的 CoCoOp 方法提高了 4.79%,在 11 个分类基准的谐波平均值上提高了 3.88%。
{"title":"Mutual Prompt Leaning for Vision Language Models","authors":"Sifan Long, Zhen Zhao, Junkun Yuan, Zichang Tan, Jiangjiang Liu, Jingyuan Feng, Shengsheng Wang, Jingdong Wang","doi":"10.1007/s11263-024-02243-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02243-z","url":null,"abstract":"<p>Large pre-trained vision language models (VLMs) have demonstrated impressive representation learning capabilities, but their transferability across various downstream tasks heavily relies on prompt learning. Since VLMs consist of text and visual sub-branches, existing prompt approaches are mainly divided into text and visual prompts. Recent text prompt methods have achieved great performance by designing input-condition prompts that encompass both text and image domain knowledge. However, roughly incorporating the same image feature into each learnable text token may be unjustifiable, as it could result in learnable text prompts being concentrated on one or a subset of characteristics. In light of this, we propose a fine-grained text prompt (FTP) that decomposes the single global image features into several finer-grained semantics and incorporates them into corresponding text prompt tokens. On the other hand, current methods neglect valuable text semantic information when building the visual prompt. Furthermore, text information contains redundant and negative category semantics. To address this, we propose a text-reorganized visual prompt (TVP) that reorganizes the text descriptions of the current image to construct the visual prompt, guiding the image branch to attend to class-related representations. By leveraging both FTP and TVP, we enable mutual prompting between the text and visual modalities, unleashing their potential to tap into the representation capabilities of VLMs. Extensive experiments on 11 classification benchmarks show that our method surpasses existing methods by a large margin. In particular, our approach improves recent state-of-the-art CoCoOp by 4.79% on new classes and 3.88% on harmonic mean over eleven classification benchmarks.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust Deep Object Tracking against Adversarial Attacks 对抗对抗性攻击的鲁棒深度目标跟踪
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-26 DOI: 10.1007/s11263-024-02226-0
Shuai Jia, Chao Ma, Yibing Song, Xiaokang Yang, Ming-Hsuan Yang

Addressing the vulnerability of deep neural networks (DNNs) has attracted significant attention in recent years. While recent studies on adversarial attack and defense mainly reside in a single image, few efforts have been made to perform temporal attacks against video sequences. As the temporal consistency between frames is not considered, existing adversarial attack approaches designed for static images do not perform well for deep object tracking. In this work, we generate adversarial examples on top of video sequences to improve the tracking robustness against adversarial attacks under white-box and black-box settings. To this end, we consider motion signals when generating lightweight perturbations over the estimated tracking results frame-by-frame. For the white-box attack, we generate temporal perturbations via known trackers to degrade significantly the tracking performance. We transfer the generated perturbations into unknown targeted trackers for the black-box attack to achieve transferring attacks. Furthermore, we train universal adversarial perturbations and directly add them into all frames of videos, improving the attack effectiveness with minor computational costs. On the other hand, we sequentially learn to estimate and remove the perturbations from input sequences to restore the tracking performance. We apply the proposed adversarial attack and defense approaches to state-of-the-art tracking algorithms. Extensive evaluations on large-scale benchmark datasets, including OTB, VOT, UAV123, and LaSOT, demonstrate that our attack method degrades the tracking performance significantly with favorable transferability to other backbones and trackers. Notably, the proposed defense method restores the original tracking performance to some extent and achieves additional performance gains when not under adversarial attacks.

近年来,解决深度神经网络(DNN)的脆弱性问题引起了广泛关注。最近关于对抗性攻击和防御的研究主要集中在单幅图像上,而针对视频序列进行时间攻击的研究却很少。由于没有考虑帧与帧之间的时间一致性,现有的针对静态图像设计的对抗性攻击方法在深度目标跟踪方面表现不佳。在这项工作中,我们在视频序列上生成对抗示例,以提高在白盒和黑盒设置下对抗攻击的跟踪鲁棒性。为此,我们在对逐帧估计的跟踪结果生成轻量级扰动时考虑了运动信号。在白盒攻击中,我们通过已知的跟踪器产生时间扰动,从而显著降低跟踪性能。对于黑盒攻击,我们将生成的扰动转移到未知的目标跟踪器中,以实现转移攻击。此外,我们还训练了通用对抗扰动,并将其直接添加到视频的所有帧中,从而以较小的计算成本提高了攻击效果。另一方面,我们通过连续学习来估计并移除输入序列中的扰动,从而恢复跟踪性能。我们将提出的对抗性攻击和防御方法应用于最先进的跟踪算法。在大规模基准数据集(包括 OTB、VOT、UAV123 和 LaSOT)上进行的广泛评估表明,我们的攻击方法显著降低了跟踪性能,并能很好地移植到其他骨干网和跟踪器上。值得注意的是,所提出的防御方法在一定程度上恢复了原有的跟踪性能,并在未受到对抗性攻击时实现了额外的性能提升。
{"title":"Robust Deep Object Tracking against Adversarial Attacks","authors":"Shuai Jia, Chao Ma, Yibing Song, Xiaokang Yang, Ming-Hsuan Yang","doi":"10.1007/s11263-024-02226-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02226-0","url":null,"abstract":"<p>Addressing the vulnerability of deep neural networks (DNNs) has attracted significant attention in recent years. While recent studies on adversarial attack and defense mainly reside in a single image, few efforts have been made to perform temporal attacks against video sequences. As the temporal consistency between frames is not considered, existing adversarial attack approaches designed for static images do not perform well for deep object tracking. In this work, we generate adversarial examples on top of video sequences to improve the tracking robustness against adversarial attacks under white-box and black-box settings. To this end, we consider motion signals when generating lightweight perturbations over the estimated tracking results frame-by-frame. For the white-box attack, we generate temporal perturbations via known trackers to degrade significantly the tracking performance. We transfer the generated perturbations into unknown targeted trackers for the black-box attack to achieve transferring attacks. Furthermore, we train universal adversarial perturbations and directly add them into all frames of videos, improving the attack effectiveness with minor computational costs. On the other hand, we sequentially learn to estimate and remove the perturbations from input sequences to restore the tracking performance. We apply the proposed adversarial attack and defense approaches to state-of-the-art tracking algorithms. Extensive evaluations on large-scale benchmark datasets, including OTB, VOT, UAV123, and LaSOT, demonstrate that our attack method degrades the tracking performance significantly with favorable transferability to other backbones and trackers. Notably, the proposed defense method restores the original tracking performance to some extent and achieves additional performance gains when not under adversarial attacks.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142321564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Breaking the Limits of Reliable Prediction via Generated Data 通过生成数据打破可靠预测的限制
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-20 DOI: 10.1007/s11263-024-02221-5
Zhen Cheng, Fei Zhu, Xu-Yao Zhang, Cheng-Lin Liu

In open-world recognition of safety-critical applications, providing reliable prediction for deep neural networks has become a critical requirement. Many methods have been proposed for reliable prediction related tasks such as confidence calibration, misclassification detection, and out-of-distribution detection. Recently, pre-training has been shown to be one of the most effective methods for improving reliable prediction, particularly for modern networks like ViT, which require a large amount of training data. However, collecting data manually is time-consuming. In this paper, taking advantage of the breakthrough of generative models, we investigate whether and how expanding the training set using generated data can improve reliable prediction. Our experiments reveal that training with a large quantity of generated data can eliminate overfitting in reliable prediction, leading to significantly improved performance. Surprisingly, classical networks like ResNet-18, when trained on a notably extensive volume of generated data, can sometimes exhibit performance competitive to pre-training ViT with a substantial real dataset.

在安全关键应用的开放世界识别中,为深度神经网络提供可靠的预测已成为一项关键要求。针对可信度校准、误分类检测和分布外检测等与可靠预测相关的任务,已经提出了许多方法。最近,预训练被证明是提高可靠预测的最有效方法之一,特别是对于像 ViT 这样需要大量训练数据的现代网络。然而,手动收集数据非常耗时。在本文中,我们利用生成模型的突破,研究使用生成数据扩展训练集是否以及如何提高预测的可靠性。我们的实验表明,使用大量生成数据进行训练可以消除可靠预测中的过拟合,从而显著提高性能。令人惊讶的是,像 ResNet-18 这样的经典网络,在使用大量生成数据进行训练时,有时会表现出与使用大量真实数据集对 ViT 进行预训练时相当的性能。
{"title":"Breaking the Limits of Reliable Prediction via Generated Data","authors":"Zhen Cheng, Fei Zhu, Xu-Yao Zhang, Cheng-Lin Liu","doi":"10.1007/s11263-024-02221-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02221-5","url":null,"abstract":"<p>In open-world recognition of safety-critical applications, providing reliable prediction for deep neural networks has become a critical requirement. Many methods have been proposed for reliable prediction related tasks such as confidence calibration, misclassification detection, and out-of-distribution detection. Recently, pre-training has been shown to be one of the most effective methods for improving reliable prediction, particularly for modern networks like ViT, which require a large amount of training data. However, collecting data manually is time-consuming. In this paper, taking advantage of the breakthrough of generative models, we investigate whether and how expanding the training set using generated data can improve reliable prediction. Our experiments reveal that training with a large quantity of generated data can eliminate overfitting in reliable prediction, leading to significantly improved performance. Surprisingly, classical networks like ResNet-18, when trained on a notably extensive volume of generated data, can sometimes exhibit performance competitive to pre-training ViT with a substantial real dataset.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lidar Panoptic Segmentation in an Open World 开放世界中的激光雷达全景细分
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-19 DOI: 10.1007/s11263-024-02166-9
Anirudh S. Chakravarthy, Meghana Reddy Ganesina, Peiyun Hu, Laura Leal-Taixé, Shu Kong, Deva Ramanan, Aljosa Osep

Addressing Lidar Panoptic Segmentation (LPS) is crucial for safe deployment of autnomous vehicles. LPS aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including thing classes of countable objects (e.g., pedestrians and vehicles) and stuff classes of amorphous regions (e.g., vegetation and road). Importantly, LPS requires segmenting individual thing instances (e.g., every single vehicle). Current LPS methods make an unrealistic assumption that the semantic class vocabulary is fixed in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of novel classes that are considered to be unknowns w.r.t. thepre-defined class vocabulary. To address this unrealistic assumption, we study LPS in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of thing and stuff classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (i.e., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network (Ren et al. NeurIPS, 2015). We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.

解决激光雷达全景分割(LPS)问题对于自动驾驶汽车的安全部署至关重要。LPS 的目的是根据预定义的语义类词汇识别和分割激光雷达点,包括可数对象的事物类(如行人和车辆)和无定形区域的事物类(如植被和道路)。重要的是,LPS 需要分割单个事物实例(如每辆车)。当前的 LPS 方法做出了一个不切实际的假设,即在真实的开放世界中,语义类词汇是固定不变的,但事实上,类本体通常会随着时间的推移而不断演化,因为机器人会遇到新的类实例,而这些新的类实例在预先定义的类词汇中被认为是未知的。为了解决这个不切实际的假设,我们研究了开放世界中的 LPS(LiPSOW):我们在一个具有预定义语义类词汇的数据集上训练模型,然后研究它们在更大的数据集上的泛化情况,在这个数据集上可能会出现事物和物品类的新实例。这种实验设置得出了有趣的结论。现有技术训练了特定类别的实例分割方法,并在已知类别上获得了最先进的结果,而基于类别无关的自下而上分组方法则在初始类别词汇之外的类别(即未知类别)上表现良好。遗憾的是,这些方法在已知类别上的表现无法与完全数据驱动的方法相提并论。我们的工作提出了一种中间路线:我们进行类无关的点聚类,并以分层方式对输入云进行过度分割,然后进行二进制点分割分类,类似于区域建议网络(Ren et al. NeurIPS, 2015)。我们通过计算点段加权分层树中的切分来获得最终的点云分割,与语义分类无关。值得注意的是,这种统一的方法在已知和未知类别上都有很好的表现。
{"title":"Lidar Panoptic Segmentation in an Open World","authors":"Anirudh S. Chakravarthy, Meghana Reddy Ganesina, Peiyun Hu, Laura Leal-Taixé, Shu Kong, Deva Ramanan, Aljosa Osep","doi":"10.1007/s11263-024-02166-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02166-9","url":null,"abstract":"<p>Addressing Lidar Panoptic Segmentation (<i>LPS</i>) is crucial for safe deployment of autnomous vehicles. <i>LPS</i> aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including <span>thing</span> classes of countable objects (e.g., pedestrians and vehicles) and <span>stuff</span> classes of amorphous regions (e.g., vegetation and road). Importantly, <i>LPS</i> requires segmenting individual <span>thing</span> instances (<i>e.g</i>., every single vehicle). Current <i>LPS</i> methods make an unrealistic assumption that the semantic class vocabulary is <i>fixed</i> in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of <i>novel</i> classes that are considered to be unknowns w.r.t. thepre-defined class vocabulary. To address this unrealistic assumption, we study <i>LPS</i> in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of <span>thing</span> and <span>stuff</span> classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (<i>i.e</i>., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network (Ren et al. NeurIPS, 2015). We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention FastComposer:利用局部注意力生成无调谐多主体图像
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-19 DOI: 10.1007/s11263-024-02227-z
Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han

Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300(times )–2500(times ) speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here (https://github.com/mit-han-lab/fastcomposer).

扩散模型在文本到图像的生成方面表现出色,尤其是在主题驱动的个性化图像生成方面。然而,现有的方法由于需要针对特定对象进行微调而效率低下,微调需要大量计算,妨碍了高效部署。此外,现有方法在多主体生成方面也很吃力,因为它们经常会混淆主体间的身份。我们提出的 FastComposer 可实现高效、个性化、多主体文本到图像的生成,而无需微调。FastComposer 使用图像编码器提取的主体嵌入来增强扩散模型中的通用文本调节,只需向前传递即可根据主体图像和文本指示生成个性化图像。为了解决多主体生成中的身份混合问题,FastComposer 在训练过程中提出了交叉注意力定位监督,强制参考主体的注意力定位到目标图像中的正确区域。天真地对被试嵌入进行调节会导致被试过拟合。FastComposer 建议在去噪步骤中延迟主体调节,以保持主体驱动图像生成中的身份识别和可编辑性。FastComposer 能生成多个未见个体的图像,这些个体具有不同的风格、动作和背景。与基于微调的方法相比,它的速度提高了300(次)-2500(次),而且新主体不需要额外存储。FastComposer 为高效、个性化和高质量的多主体图像创建铺平了道路。代码、模型和数据集可在此处获取(https://github.com/mit-han-lab/fastcomposer)。
{"title":"FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention","authors":"Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han","doi":"10.1007/s11263-024-02227-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02227-z","url":null,"abstract":"<p>Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions <i>with only forward passes</i>. To address the identity blending problem in the multi-subject generation, FastComposer proposes <i>cross-attention localization</i> supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes <i>delayed subject conditioning</i> in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300<span>(times )</span>–2500<span>(times )</span> speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here (https://github.com/mit-han-lab/fastcomposer).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical Active Learning for Low-Altitude Drone-View Object Detection 用于低空无人机视角物体检测的分层主动学习技术
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-15 DOI: 10.1007/s11263-024-02228-y
Haohao Hu, Tianyu Han, Yuerong Wang, Wanjun Zhong, Jingwei Yue, Peng Zan

Various object detection techniques are employed on drone platforms. However, the task of annotating drone-view samples is both time-consuming and laborious. This is primarily due to the presence of numerous small-sized instances to be labeled in the drone-view image. To tackle this issue, we propose HALD, a hierarchical active learning approach for low-altitude drone-view object detection. HALD extracts unlabeled image information sequentially from different levels, including point, box, image, and class, aiming to obtain a reliable indicator of image information. The point-level module is utilized to ascertain the valid count and location of instances, while the box-level module screens out reliable predictions. The image-level module selects candidate samples by calculating the consistency of valid boxes within an image, and the class-level module selects the final selected samples based on the distribution of candidate and labeled samples across different classes. Extensive experiments conducted on the VisDrone and CityPersons datasets demonstrate that HALD outperforms several other baseline methods. Additionally, we provide an in-depth analysis of each proposed module. The results show that the performance of evaluating the informativeness of samples can be effectively improved by the four hierarchical levels.

无人机平台采用了多种物体检测技术。然而,对无人机视图样本进行标注既费时又费力。这主要是由于无人机视图中存在大量需要标注的小尺寸实例。为了解决这个问题,我们提出了一种用于低空无人机视图物体检测的分层主动学习方法--HALD。HALD 从点、盒、图像和类等不同层次依次提取未标记的图像信息,旨在获得可靠的图像信息指标。点级模块用于确定有效实例的数量和位置,而盒级模块则筛选出可靠的预测。图像级模块通过计算图像中有效方框的一致性来选择候选样本,而类别级模块则根据候选样本和标记样本在不同类别中的分布情况来选择最终选定的样本。在 VisDrone 和 CityPersons 数据集上进行的大量实验表明,HALD 优于其他几种基准方法。此外,我们还对提出的每个模块进行了深入分析。结果表明,通过四个层次结构可以有效提高样本信息度的评估性能。
{"title":"Hierarchical Active Learning for Low-Altitude Drone-View Object Detection","authors":"Haohao Hu, Tianyu Han, Yuerong Wang, Wanjun Zhong, Jingwei Yue, Peng Zan","doi":"10.1007/s11263-024-02228-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02228-y","url":null,"abstract":"<p>Various object detection techniques are employed on drone platforms. However, the task of annotating drone-view samples is both time-consuming and laborious. This is primarily due to the presence of numerous small-sized instances to be labeled in the drone-view image. To tackle this issue, we propose HALD, a hierarchical active learning approach for low-altitude drone-view object detection. HALD extracts unlabeled image information sequentially from different levels, including point, box, image, and class, aiming to obtain a reliable indicator of image information. The point-level module is utilized to ascertain the valid count and location of instances, while the box-level module screens out reliable predictions. The image-level module selects candidate samples by calculating the consistency of valid boxes within an image, and the class-level module selects the final selected samples based on the distribution of candidate and labeled samples across different classes. Extensive experiments conducted on the VisDrone and CityPersons datasets demonstrate that HALD outperforms several other baseline methods. Additionally, we provide an in-depth analysis of each proposed module. The results show that the performance of evaluating the informativeness of samples can be effectively improved by the four hierarchical levels.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":19.5,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142233294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Journal of Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1