International Journal of Computer Vision最新文献_第8页

Feature Matching via Graph Clustering with Local Affine Consensus 通过图聚类与局部仿射共识进行特征匹配

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-15 DOI: 10.1007/s11263-024-02291-5

Yifan Lu, Jiayi Ma

This paper studies graph clustering with application to feature matching and proposes an effective method, termed as GC-LAC, that can establish reliable feature correspondences and simultaneously discover all potential visual patterns. In particular, we regard each putative match as a node and encode the geometric relationships into edges where a visual pattern sharing similar motion behaviors corresponds to a strongly connected subgraph. In this setting, it is natural to formulate the feature matching task as a graph clustering problem. To construct a geometric meaningful graph, based on the best practices, we adopt a local affine strategy. By investigating the motion coherence prior, we further propose an efficient and deterministic geometric solver (MCDG) to extract the local geometric information that helps construct the graph. The graph is sparse and general for various image transformations. Subsequently, a novel robust graph clustering algorithm (D2SCAN) is introduced, which defines the notion of density-reachable on the graph by replicator dynamics optimization. Extensive experiments focusing on both the local and the whole of our GC-LAC with various practical vision tasks including relative pose estimation, homography and fundamental matrix estimation, loop-closure detection, and multimodel fitting, demonstrate that our GC-LAC is more competitive than current state-of-the-art methods, in terms of generality, efficiency, and effectiveness. The source code for this work is publicly available at: https://github.com/YifanLu2000/GCLAC.

本文研究了图聚类在特征匹配中的应用，并提出了一种有效的方法（称为 GC-LAC），它可以建立可靠的特征对应关系，同时发现所有潜在的视觉模式。具体而言，我们将每个可能的匹配视为一个节点，并将几何关系编码为边，其中具有相似运动行为的视觉模式对应于一个强连接子图。在这种情况下，自然可以将特征匹配任务表述为图聚类问题。为了构建有几何意义的图，我们根据最佳实践，采用了局部仿射策略。通过研究运动一致性先验，我们进一步提出了一种高效的确定性几何求解器（MCDG），以提取有助于构建图的局部几何信息。该图稀疏且通用于各种图像变换。随后，我们引入了一种新颖的鲁棒图聚类算法（D2SCAN），该算法通过复制器动态优化定义了图上可达到的密度概念。我们的 GC-LAC 在各种实际视觉任务（包括相对姿态估算、同源性和基本矩阵估算、闭环检测和多模型拟合）中进行了广泛的局部和整体实验，证明我们的 GC-LAC 在通用性、效率和有效性方面都比目前最先进的方法更具竞争力。这项工作的源代码可在以下网址公开获取：https://github.com/YifanLu2000/GCLAC。

{"title":"Feature Matching via Graph Clustering with Local Affine Consensus","authors":"Yifan Lu, Jiayi Ma","doi":"10.1007/s11263-024-02291-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02291-5","url":null,"abstract":"This paper studies graph clustering with application to feature matching and proposes an effective method, termed as GC-LAC, that can establish reliable feature correspondences and simultaneously discover all potential visual patterns. In particular, we regard each putative match as a node and encode the geometric relationships into edges where a visual pattern sharing similar motion behaviors corresponds to a strongly connected subgraph. In this setting, it is natural to formulate the feature matching task as a graph clustering problem. To construct a geometric meaningful graph, based on the best practices, we adopt a local affine strategy. By investigating the motion coherence prior, we further propose an efficient and deterministic geometric solver (MCDG) to extract the local geometric information that helps construct the graph. The graph is sparse and general for various image transformations. Subsequently, a novel robust graph clustering algorithm (D2SCAN) is introduced, which defines the notion of density-reachable on the graph by replicator dynamics optimization. Extensive experiments focusing on both the local and the whole of our GC-LAC with various practical vision tasks including relative pose estimation, homography and fundamental matrix estimation, loop-closure detection, and multimodel fitting, demonstrate that our GC-LAC is more competitive than current state-of-the-art methods, in terms of generality, efficiency, and effectiveness. The source code for this work is publicly available at: https://github.com/YifanLu2000/GCLAC.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"75 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142637263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning to Detect Novel Species with SAM in the Wild 学会在野外用 SAM 检测新物种

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-13 DOI: 10.1007/s11263-024-02234-0

Garvita Allabadi, Ana Lucic, Yu-Xiong Wang, Vikram Adve

This paper tackles the limitation of a closed-world object detection model that was trained on one species. The expectation for this model is that it will not generalize well to recognize the instances of new species if they were present in the incoming data stream. We propose a novel object detection framework for this open-world setting that is suitable for applications that monitor wildlife, ocean life, livestock, plant phenotype and crops that typically feature one species in the image. Our method leverages labeled samples from one species in combination with a novelty detection method and Segment Anything Model, a vision foundation model, to (1) identify the presence of new species in unlabeled images, (2) localize their instances, and (3) retrain the initial model with the localized novel class instances. The resulting integrated system assimilates and learns from unlabeled samples of the new classes while not “forgetting” the original species the model was trained on. We demonstrate our findings on two different domains, (1) wildlife detection and (2) plant detection. Our method achieves an AP of 56.2 (for 4 novel species) to 61.6 (for 1 novel species) for wildlife domain, without relying on any ground truth data in the background.

本文探讨了封闭世界物体检测模型的局限性，该模型是针对一种物种进行训练的。该模型的预期结果是，如果新物种出现在输入数据流中，它将无法很好地泛化到识别新物种的实例中。我们为这种开放世界环境提出了一种新颖的物体检测框架，适用于监测野生动物、海洋生物、牲畜、植物表型和农作物的应用，这些应用通常以图像中的一个物种为特征。我们的方法利用一个物种的标注样本，结合新奇事物检测方法和视觉基础模型 Segment Anything Model，来（1）识别未标注图像中新物种的存在，（2）定位其实例，（3）利用定位的新类别实例重新训练初始模型。由此产生的集成系统会吸收和学习未标记的新类别样本，同时不会 "遗忘 "模型所训练的原始物种。我们在两个不同的领域展示了我们的研究成果：(1) 野生动物检测和 (2) 植物检测。在野生动物领域，我们的方法实现了 56.2（针对 4 个新物种）到 61.6（针对 1 个新物种）的 AP 值，而无需依赖背景中的任何地面实况数据。

{"title":"Learning to Detect Novel Species with SAM in the Wild","authors":"Garvita Allabadi, Ana Lucic, Yu-Xiong Wang, Vikram Adve","doi":"10.1007/s11263-024-02234-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02234-0","url":null,"abstract":"This paper tackles the limitation of a closed-world object detection model that was trained on one species. The expectation for this model is that it will not generalize well to recognize the instances of new species if they were present in the incoming data stream. We propose a novel object detection framework for this open-world setting that is suitable for applications that monitor wildlife, ocean life, livestock, plant phenotype and crops that typically feature one species in the image. Our method leverages labeled samples from one species in combination with a novelty detection method and Segment Anything Model, a vision foundation model, to (1) identify the presence of new species in unlabeled images, (2) localize their instances, and (3) retrain the initial model with the localized novel class instances. The resulting integrated system assimilates and learns from unlabeled samples of the new classes while not “forgetting” the original species the model was trained on. We demonstrate our findings on two different domains, (1) wildlife detection and (2) plant detection. Our method achieves an AP of 56.2 (for 4 novel species) to 61.6 (for 1 novel species) for wildlife domain, without relying on any ground truth data in the background.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"80 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142610210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MVTN: Learning Multi-view Transformations for 3D Understanding MVTN：学习多视角变换以了解 3D

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-11 DOI: 10.1007/s11263-024-02283-5

Abdullah Hamdi, Faisal AlZahrani, Silvio Giancola, Bernard Ghanem

Multi-view projection techniques have shown themselves to be highly effective in achieving top-performing results in the recognition of 3D shapes. These methods involve learning how to combine information from multiple view-points. However, the camera view-points from which these views are obtained are often fixed for all shapes. To overcome the static nature of current multi-view techniques, we propose learning these view-points. Specifically, we introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition. As a result, MVTN can be trained end-to-end with any multi-view network for 3D shape classification. We integrate MVTN into a novel adaptive multi-view pipeline that is capable of rendering both 3D meshes and point clouds. Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks (ModelNet40, ScanObjectNN, ShapeNet Core55). Further analysis indicates that our approach exhibits improved robustness to occlusion compared to other methods. We also investigate additional aspects of MVTN, such as 2D pretraining and its use for segmentation. To support further research in this area, we have released MVTorch, a PyTorch library for 3D understanding and generation using multi-view projections.

多视角投影技术在三维图形识别中取得优异成绩方面已显示出巨大的功效。这些方法涉及学习如何结合来自多个视点的信息。然而，对于所有形状而言，获取这些视图的摄像机视点往往是固定的。为了克服当前多视角技术的静态特性，我们建议学习这些视点。具体来说，我们引入了多视角变换网络（Multi-View Transformation Network，MVTN），它使用可变渲染来确定三维形状识别的最佳视角。因此，MVTN 可以与任何用于三维形状分类的多视角网络进行端对端训练。我们将 MVTN 集成到新颖的自适应多视角管道中，该管道能够渲染三维网格和点云。我们的方法在多个基准（ModelNet40、ScanObjectNN、ShapeNet Core55）上展示了最先进的三维分类和形状检索性能。进一步的分析表明，与其他方法相比，我们的方法对遮挡的鲁棒性有所提高。我们还研究了 MVTN 的其他方面，如二维预训练及其在分割中的应用。为了支持这一领域的进一步研究，我们发布了 MVTorch，这是一个利用多视角投影进行 3D 理解和生成的 PyTorch 库。

{"title":"MVTN: Learning Multi-view Transformations for 3D Understanding","authors":"Abdullah Hamdi, Faisal AlZahrani, Silvio Giancola, Bernard Ghanem","doi":"10.1007/s11263-024-02283-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02283-5","url":null,"abstract":"Multi-view projection techniques have shown themselves to be highly effective in achieving top-performing results in the recognition of 3D shapes. These methods involve learning how to combine information from multiple view-points. However, the camera view-points from which these views are obtained are often fixed for all shapes. To overcome the static nature of current multi-view techniques, we propose learning these view-points. Specifically, we introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition. As a result, MVTN can be trained end-to-end with any multi-view network for 3D shape classification. We integrate MVTN into a novel adaptive multi-view pipeline that is capable of rendering both 3D meshes and point clouds. Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks (ModelNet40, ScanObjectNN, ShapeNet Core55). Further analysis indicates that our approach exhibits improved robustness to occlusion compared to other methods. We also investigate additional aspects of MVTN, such as 2D pretraining and its use for segmentation. To support further research in this area, we have released MVTorch, a PyTorch library for 3D understanding and generation using multi-view projections.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"38 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive Middle Modality Alignment Learning for Visible-Infrared Person Re-identification 用于可见光-红外线人员再识别的自适应中间模态对齐学习

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-09 DOI: 10.1007/s11263-024-02276-4

Yukang Zhang, Yan Yan, Yang Lu, Hanzi Wang

Visible-infrared person re-identification (VIReID) has attracted increasing attention due to the requirements for 24-hour intelligent surveillance systems. In this task, one of the major challenges is the modality discrepancy between the visible (VIS) and infrared (NIR) images. Most conventional methods try to design complex networks or generative models to mitigate the cross-modality discrepancy while ignoring the fact that the modality gaps differ between the different VIS and NIR images. Different from existing methods, in this paper, we propose an Adaptive Middle-modality Alignment Learning (AMML) method, which can effectively reduce the modality discrepancy via an adaptive middle modality learning strategy at both image level and feature level. The proposed AMML method enjoys several merits. First, we propose an Adaptive Middle-modality Generator (AMG) module to reduce the modality discrepancy between the VIS and NIR images from the image level, which can effectively project the VIS and NIR images into a unified middle modality image (UMMI) space to adaptively generate middle-modality (M-modality) images. Second, we propose a feature-level Adaptive Distribution Alignment (ADA) loss to force the distribution of the VIS features and NIR features adaptively align with the distribution of M-modality features. Moreover, we also propose a novel Center-based Diverse Distribution Learning (CDDL) loss, which can effectively learn diverse cross-modality knowledge from different modalities while reducing the modality discrepancy between the VIS and NIR modalities. Extensive experiments on three challenging VIReID datasets show the superiority of the proposed AMML method over the other state-of-the-art methods. More remarkably, our method achieves 77.8% in terms of Rank-1 and 74.8% in terms of mAP on the SYSU-MM01 dataset for all search mode, and 86.6% in terms of Rank-1 and 88.3% in terms of mAP on the SYSU-MM01 dataset for indoor search mode. The code is released at: https://github.com/ZYK100/MMN.

由于 24 小时智能监控系统的要求，可见光-红外人员再识别（VIReID）引起了越来越多的关注。在这项任务中，主要挑战之一是可见光（VIS）和红外（NIR）图像之间的模态差异。大多数传统方法都试图设计复杂的网络或生成模型来缓解跨模态差异，但却忽视了不同可见光和近红外图像之间的模态差距是不同的这一事实。与现有方法不同，本文提出了一种自适应中间模态对齐学习（AMML）方法，通过在图像级和特征级采用自适应中间模态学习策略，有效减少模态差异。所提出的 AMML 方法有几个优点。首先，我们提出了自适应中间模态生成器（AMG）模块，从图像层面减少可见光和近红外图像之间的模态差异，从而有效地将可见光和近红外图像投射到统一的中间模态图像（UMMI）空间，自适应地生成中间模态（M-modality）图像。其次，我们提出了一种特征级自适应分布对齐（ADA）损耗，以迫使可见光特征和近红外特征的分布与中间模态特征的分布自适应地对齐。此外，我们还提出了一种新颖的基于中心的多样化分布学习（CDDL）损失，它可以有效地从不同模态学习多样化的跨模态知识，同时减少可见光和近红外模态之间的模态差异。在三个具有挑战性的 VIReID 数据集上进行的广泛实验表明，所提出的 AMML 方法优于其他最先进的方法。更值得注意的是，我们的方法在 SYSU-MM01 数据集的所有搜索模式下的 Rank-1 和 mAP 分别达到了 77.8% 和 74.8%，在 SYSU-MM01 数据集的室内搜索模式下的 Rank-1 和 mAP 分别达到了 86.6% 和 88.3%。代码发布于：https://github.com/ZYK100/MMN。

{"title":"Adaptive Middle Modality Alignment Learning for Visible-Infrared Person Re-identification","authors":"Yukang Zhang, Yan Yan, Yang Lu, Hanzi Wang","doi":"10.1007/s11263-024-02276-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02276-4","url":null,"abstract":"Visible-infrared person re-identification (VIReID) has attracted increasing attention due to the requirements for 24-hour intelligent surveillance systems. In this task, one of the major challenges is the modality discrepancy between the visible (VIS) and infrared (NIR) images. Most conventional methods try to design complex networks or generative models to mitigate the cross-modality discrepancy while ignoring the fact that the modality gaps differ between the different VIS and NIR images. Different from existing methods, in this paper, we propose an Adaptive Middle-modality Alignment Learning (AMML) method, which can effectively reduce the modality discrepancy via an adaptive middle modality learning strategy at both image level and feature level. The proposed AMML method enjoys several merits. First, we propose an Adaptive Middle-modality Generator (AMG) module to reduce the modality discrepancy between the VIS and NIR images from the image level, which can effectively project the VIS and NIR images into a unified middle modality image (UMMI) space to adaptively generate middle-modality (M-modality) images. Second, we propose a feature-level Adaptive Distribution Alignment (ADA) loss to force the distribution of the VIS features and NIR features adaptively align with the distribution of M-modality features. Moreover, we also propose a novel Center-based Diverse Distribution Learning (CDDL) loss, which can effectively learn diverse cross-modality knowledge from different modalities while reducing the modality discrepancy between the VIS and NIR modalities. Extensive experiments on three challenging VIReID datasets show the superiority of the proposed AMML method over the other state-of-the-art methods. More remarkably, our method achieves 77.8% in terms of Rank-1 and 74.8% in terms of mAP on the SYSU-MM01 dataset for all search mode, and 86.6% in terms of Rank-1 and 88.3% in terms of mAP on the SYSU-MM01 dataset for indoor search mode. The code is released at: https://github.com/ZYK100/MMN.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"24 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142597431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rethinking Contemporary Deep Learning Techniques for Error Correction in Biometric Data 反思当代生物识别数据纠错的深度学习技术

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-06 DOI: 10.1007/s11263-024-02280-8

YenLung Lai, XingBo Dong, Zhe Jin, Wei Jia, Massimo Tistarelli, XueJun Li

In the realm of cryptography, the implementation of error correction in biometric data offers many benefits, including secure data storage and key derivation. Deep learning-based decoders have emerged as a catalyst for improved error correction when decoding noisy biometric data. Although these decoders exhibit competence in approximating precise solutions, we expose the potential inadequacy of their security assurances through a minimum entropy analysis. This limitation curtails their applicability in secure biometric contexts, as the inherent complexities of their non-linear neural network architectures pose challenges in modeling the solution distribution precisely. To address this limitation, we introduce U-Sketch, a universal approach for error correction in biometrics, which converts arbitrary input random biometric source distributions into independent and identically distributed (i.i.d.) data while maintaining the pairwise distance of the data post-transformation. This method ensures interpretability within the decoder, facilitating transparent entropy analysis and a substantiated security claim. Moreover, U-Sketch employs Maximum Likelihood Decoding, which provides optimal error tolerance and a precise security guarantee.

在密码学领域，在生物识别数据中实施纠错有很多好处，包括安全的数据存储和密钥推导。基于深度学习的解码器已成为改进生物识别数据解码纠错的催化剂。虽然这些解码器在近似精确解法方面表现出了能力，但我们通过最小熵分析揭示了其安全保证的潜在不足。这一局限性限制了它们在安全生物识别领域的应用，因为其非线性神经网络架构的内在复杂性给精确建模解分布带来了挑战。为解决这一局限性，我们引入了 U-Sketch 这种生物识别中的通用纠错方法，它能将任意输入的随机生物识别源分布转换为独立且同分布（i.i.d.）的数据，同时保持转换后数据的成对距离。这种方法可确保解码器内的可解释性，便于进行透明的熵分析，并提出可靠的安全主张。此外，U-Sketch 还采用了最大似然解码，从而提供了最佳的容错能力和精确的安全保证。

{"title":"Rethinking Contemporary Deep Learning Techniques for Error Correction in Biometric Data","authors":"YenLung Lai, XingBo Dong, Zhe Jin, Wei Jia, Massimo Tistarelli, XueJun Li","doi":"10.1007/s11263-024-02280-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02280-8","url":null,"abstract":"In the realm of cryptography, the implementation of error correction in biometric data offers many benefits, including secure data storage and key derivation. Deep learning-based decoders have emerged as a catalyst for improved error correction when decoding noisy biometric data. Although these decoders exhibit competence in approximating precise solutions, we expose the potential inadequacy of their security assurances through a minimum entropy analysis. This limitation curtails their applicability in secure biometric contexts, as the inherent complexities of their non-linear neural network architectures pose challenges in modeling the solution distribution precisely. To address this limitation, we introduce U-Sketch, a universal approach for error correction in biometrics, which converts arbitrary input random biometric source distributions into independent and identically distributed (i.i.d.) data while maintaining the pairwise distance of the data post-transformation. This method ensures interpretability within the decoder, facilitating transparent entropy analysis and a substantiated security claim. Moreover, U-Sketch employs Maximum Likelihood Decoding, which provides optimal error tolerance and a precise security guarantee.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"48 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142588713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Day2Dark: Pseudo-Supervised Activity Recognition Beyond Silent Daylight Day2Dark：超越无声日光的伪监督活动识别

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-06 DOI: 10.1007/s11263-024-02273-7

Yunhua Zhang, Hazel Doughty, Cees G. M. Snoek

This paper strives to recognize activities in the dark, as well as in the day. We first establish that state-of-the-art activity recognizers are effective during the day, but not trustworthy in the dark. The main causes are the limited availability of labeled dark videos to learn from, as well as the distribution shift towards the lower color contrast at test-time. To compensate for the lack of labeled dark videos, we introduce a pseudo-supervised learning scheme, which utilizes easy to obtain unlabeled and task-irrelevant dark videos to improve an activity recognizer in low light. As the lower color contrast results in visual information loss, we further propose to incorporate the complementary activity information within audio, which is invariant to illumination. Since the usefulness of audio and visual features differs depending on the amount of illumination, we introduce our ‘darkness-adaptive’ audio-visual recognizer. Experiments on EPIC-Kitchens, Kinetics-Sound, and Charades demonstrate our proposals are superior to image enhancement, domain adaptation and alternative audio-visual fusion methods, and can even improve robustness to local darkness caused by occlusions. Project page: https://xiaobai1217.github.io/Day2Dark/.

本文致力于识别黑暗中和白天的活动。我们首先确定，最先进的活动识别器在白天是有效的，但在黑暗中却不可信。主要原因是可供学习的有标签的黑暗视频有限，以及测试时颜色对比度较低的分布变化。为了弥补标记暗视频的不足，我们引入了一种伪监督学习方案，利用容易获得的非标记和任务无关的暗视频来改进暗光下的活动识别器。由于较低的色彩对比度会导致视觉信息损失，我们进一步建议将互补的活动信息纳入音频中，因为音频不受光照影响。由于音频和视觉特征的有用性因光照度而异，因此我们推出了 "暗度适应 "视听识别器。在 EPIC-Kitchens、Kinetics-Sound 和 Charades 上进行的实验表明，我们的建议优于图像增强、域自适应和其他视听融合方法，甚至可以提高对遮挡物造成的局部黑暗的鲁棒性。项目页面：https://xiaobai1217.github.io/Day2Dark/。

{"title":"Day2Dark: Pseudo-Supervised Activity Recognition Beyond Silent Daylight","authors":"Yunhua Zhang, Hazel Doughty, Cees G. M. Snoek","doi":"10.1007/s11263-024-02273-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02273-7","url":null,"abstract":"This paper strives to recognize activities in the dark, as well as in the day. We first establish that state-of-the-art activity recognizers are effective during the day, but not trustworthy in the dark. The main causes are the limited availability of labeled dark videos to learn from, as well as the distribution shift towards the lower color contrast at test-time. To compensate for the lack of labeled dark videos, we introduce a pseudo-supervised learning scheme, which utilizes easy to obtain unlabeled and task-irrelevant dark videos to improve an activity recognizer in low light. As the lower color contrast results in visual information loss, we further propose to incorporate the complementary activity information within audio, which is invariant to illumination. Since the usefulness of audio and visual features differs depending on the amount of illumination, we introduce our ‘darkness-adaptive’ audio-visual recognizer. Experiments on EPIC-Kitchens, Kinetics-Sound, and Charades demonstrate our proposals are superior to image enhancement, domain adaptation and alternative audio-visual fusion methods, and can even improve robustness to local darkness caused by occlusions. Project page: https://xiaobai1217.github.io/Day2Dark/.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"68 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142588590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Achieving Procedure-Aware Instructional Video Correlation Learning Under Weak Supervision from a Collaborative Perspective 从协作视角实现弱监督下的程序感知教学视频关联学习

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-04 DOI: 10.1007/s11263-024-02272-8

Tianyao He, Huabin Liu, Zelin Ni, Yuxi Li, Xiao Ma, Cheng Zhong, Yang Zhang, Yingxue Wang, Weiyao Lin

Video Correlation Learning (VCL) delineates a high-level research domain that centers on analyzing the semantic and temporal correspondences between videos through a comparative paradigm. Recently, instructional video-related tasks have drawn increasing attention due to their promising potential. Compared with general videos, instructional videos possess more complex procedure information, making correlation learning quite challenging. To obtain procedural knowledge, current methods rely heavily on fine-grained step-level annotations, which are costly and non-scalable. To improve VCL on instructional videos, we introduce a weakly supervised framework named Collaborative Procedure Alignment (CPA). To be specific, our framework comprises two core components: the collaborative step mining (CSM) module and the frame-to-step alignment (FSA) module. Free of the necessity for step-level annotations, the CSM module can properly conduct temporal step segmentation and pseudo-step learning by exploring the inner procedure correspondences between paired videos. Subsequently, the FSA module efficiently yields the probability of aligning one video’s frame-level features with another video’s pseudo-step labels, which can act as a reliable correlation degree for paired videos. The two modules are inherently interconnected and can mutually enhance each other to extract the step-level knowledge and measure the video correlation distances accurately. Our framework provides an effective tool for instructional video correlation learning. We instantiate our framework on four representative tasks, including sequence verification, few-shot action recognition, temporal action segmentation, and action quality assessment. Furthermore, we extend our framework to more innovative functions to further exhibit its potential. Extensive and in-depth experiments validate CPA’s strong correlation learning capability on instructional videos. The implementation can be found at https://github.com/hotelll/Collaborative_Procedure_Alignment.

视频相关学习（Video Correlation Learning，VCL）是一个高级研究领域，其核心是通过比较范式分析视频之间的语义和时间对应关系。最近，与教学视频相关的任务因其巨大的潜力而日益受到关注。与普通视频相比，教学视频拥有更复杂的程序信息，这使得关联学习具有相当大的挑战性。为了获取程序知识，目前的方法主要依赖于细粒度的步骤级注释，这种方法成本高且不可扩展。为了改进教学视频中的 VCL，我们引入了一个名为协作程序对齐（CPA）的弱监督框架。具体来说，我们的框架由两个核心部分组成：协作步骤挖掘（CSM）模块和帧到步骤对齐（FSA）模块。CSM 模块无需步骤级注释，通过探索配对视频之间的内部程序对应关系，可以正确地进行时间步骤分割和伪步骤学习。随后，FSA 模块能有效地得出一个视频的帧级特征与另一个视频的伪步骤标签的对齐概率，这可以作为配对视频的可靠相关度。这两个模块之间存在内在联系，可以相互促进，从而提取步骤级知识并准确测量视频相关距离。我们的框架为教学视频相关性学习提供了有效工具。我们在四个具有代表性的任务中实例化了我们的框架，包括序列验证、少镜头动作识别、时序动作分割和动作质量评估。此外，我们还将框架扩展到更多创新功能，以进一步展示其潜力。广泛而深入的实验验证了 CPA 在教学视频中强大的关联学习能力。具体实现方法请访问 https://github.com/hotelll/Collaborative_Procedure_Alignment。

{"title":"Achieving Procedure-Aware Instructional Video Correlation Learning Under Weak Supervision from a Collaborative Perspective","authors":"Tianyao He, Huabin Liu, Zelin Ni, Yuxi Li, Xiao Ma, Cheng Zhong, Yang Zhang, Yingxue Wang, Weiyao Lin","doi":"10.1007/s11263-024-02272-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02272-8","url":null,"abstract":"Video Correlation Learning (VCL) delineates a high-level research domain that centers on analyzing the semantic and temporal correspondences between videos through a comparative paradigm. Recently, instructional video-related tasks have drawn increasing attention due to their promising potential. Compared with general videos, instructional videos possess more complex procedure information, making correlation learning quite challenging. To obtain procedural knowledge, current methods rely heavily on fine-grained step-level annotations, which are costly and non-scalable. To improve VCL on instructional videos, we introduce a weakly supervised framework named Collaborative Procedure Alignment (CPA). To be specific, our framework comprises two core components: the collaborative step mining (CSM) module and the frame-to-step alignment (FSA) module. Free of the necessity for step-level annotations, the CSM module can properly conduct temporal step segmentation and pseudo-step learning by exploring the inner procedure correspondences between paired videos. Subsequently, the FSA module efficiently yields the probability of aligning one video’s frame-level features with another video’s pseudo-step labels, which can act as a reliable correlation degree for paired videos. The two modules are inherently interconnected and can mutually enhance each other to extract the step-level knowledge and measure the video correlation distances accurately. Our framework provides an effective tool for instructional video correlation learning. We instantiate our framework on four representative tasks, including sequence verification, few-shot action recognition, temporal action segmentation, and action quality assessment. Furthermore, we extend our framework to more innovative functions to further exhibit its potential. Extensive and in-depth experiments validate CPA’s strong correlation learning capability on instructional videos. The implementation can be found at https://github.com/hotelll/Collaborative_Procedure_Alignment.\u0000","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"109 4 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142580525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EfficientDeRain+: Learning Uncertainty-Aware Filtering via RainMix Augmentation for High-Efficiency Deraining EfficientDeRain+：通过雨水混合增强学习不确定性感知过滤，实现高效去污

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-04 DOI: 10.1007/s11263-024-02281-7

Qing Guo, Hua Qi, Jingyang Sun, Felix Juefei-Xu, Lei Ma, Di Lin, Wei Feng, Song Wang

Deraining is a significant and fundamental computer vision task, aiming to remove the rain streaks and accumulations in an image or video. Existing deraining methods usually make heuristic assumptions of the rain model, which compels them to employ complex optimization or iterative refinement for high recovery quality. However, this leads to time-consuming methods and affects the effectiveness of addressing rain patterns, deviating from the assumptions. This paper proposes a simple yet efficient deraining method by formulating deraining as a predictive filtering problem without complex rain model assumptions. Specifically, we identify spatially-variant predictive filtering (SPFilt) that adaptively predicts proper kernels via a deep network to filter different individual pixels. Since the filtering can be implemented via well-accelerated convolution, our method can be significantly efficient. We further propose the EfDeRain+ that contains three main contributions to address residual rain traces, multi-scale, and diverse rain patterns without harming efficiency. First, we propose the uncertainty-aware cascaded predictive filtering (UC-PFilt) that can identify the difficulties of reconstructing clean pixels via predicted kernels and remove the residual rain traces effectively. Second, we design the weight-sharing multi-scale dilated filtering (WS-MS-DFilt) to handle multi-scale rain streaks without harming the efficiency. Third, to eliminate the gap across diverse rain patterns, we propose a novel data augmentation method (i.e., RainMix) to train our deep models. By combining all contributions with sophisticated analysis on different variants, our final method outperforms baseline methods on six single-image deraining datasets and one video-deraining dataset in terms of both recovery quality and speed. In particular, EfDeRain+ can derain within about 6.3 ms on a (481times 321) image and is over 74 times faster than the top baseline method with even better recovery quality. We release code in https://github.com/tsingqguo/efficientderainplus.

去毛刺是一项重要而基本的计算机视觉任务，旨在去除图像或视频中的雨条纹和积雨。现有的去毛刺方法通常会对雨水模型做出启发式假设，这就迫使它们采用复杂的优化或迭代改进来获得较高的恢复质量。然而，这导致方法耗时，并影响了处理雨模式的效果，偏离了假设。本文提出了一种简单而高效的降雨预报方法，将降雨预报表述为一个预测性过滤问题，而无需复杂的降雨模型假设。具体来说，我们确定了空间变异预测过滤（SPFilt），通过深度网络自适应地预测适当的内核，以过滤不同的单个像素。由于可以通过加速卷积实现过滤，我们的方法可以显著提高效率。我们进一步提出了 EfDeRain+，它包含三个主要贡献，可在不影响效率的情况下解决残留雨迹、多尺度和多样化雨模式等问题。首先，我们提出了不确定性感知级联预测滤波（UC-PFilt），它能识别通过预测核重建干净像素的困难，并有效去除残留雨迹。其次，我们设计了分权多尺度扩张滤波（WS-MS-DFilt）来处理多尺度雨痕，而不会降低效率。第三，为了消除不同降雨模式之间的差距，我们提出了一种新颖的数据增强方法（即 RainMix）来训练我们的深度模型。通过将所有贡献与对不同变体的复杂分析相结合，我们的最终方法在六个单图像去污数据集和一个视频去污数据集上的恢复质量和速度均优于基准方法。特别是，EfDeRain+ 可以在大约 6.3 毫秒内对（481 次/321）图像进行去污，比最高基线方法快 74 倍以上，而且恢复质量更好。我们在 https://github.com/tsingqguo/efficientderainplus 中发布了代码。

{"title":"EfficientDeRain+: Learning Uncertainty-Aware Filtering via RainMix Augmentation for High-Efficiency Deraining","authors":"Qing Guo, Hua Qi, Jingyang Sun, Felix Juefei-Xu, Lei Ma, Di Lin, Wei Feng, Song Wang","doi":"10.1007/s11263-024-02281-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02281-7","url":null,"abstract":"Deraining is a significant and fundamental computer vision task, aiming to remove the rain streaks and accumulations in an image or video. Existing deraining methods usually make heuristic assumptions of the rain model, which compels them to employ complex optimization or iterative refinement for high recovery quality. However, this leads to time-consuming methods and affects the effectiveness of addressing rain patterns, deviating from the assumptions. This paper proposes a simple yet efficient deraining method by formulating deraining as a predictive filtering problem without complex rain model assumptions. Specifically, we identify spatially-variant predictive filtering (SPFilt) that adaptively predicts proper kernels via a deep network to filter different individual pixels. Since the filtering can be implemented via well-accelerated convolution, our method can be significantly efficient. We further propose the EfDeRain+ that contains three main contributions to address residual rain traces, multi-scale, and diverse rain patterns without harming efficiency. First, we propose the uncertainty-aware cascaded predictive filtering (UC-PFilt) that can identify the difficulties of reconstructing clean pixels via predicted kernels and remove the residual rain traces effectively. Second, we design the weight-sharing multi-scale dilated filtering (WS-MS-DFilt) to handle multi-scale rain streaks without harming the efficiency. Third, to eliminate the gap across diverse rain patterns, we propose a novel data augmentation method (i.e., RainMix) to train our deep models. By combining all contributions with sophisticated analysis on different variants, our final method outperforms baseline methods on six single-image deraining datasets and one video-deraining dataset in terms of both recovery quality and speed. In particular, EfDeRain+ can derain within about 6.3 ms on a (481times 321) image and is over 74 times faster than the top baseline method with even better recovery quality. We release code in https://github.com/tsingqguo/efficientderainplus.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"68 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142580522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Few Annotated Pixels and Point Cloud Based Weakly Supervised Semantic Segmentation of Driving Scenes 基于少量注释像素和点云的驾驶场景弱监督语义分割

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-04 DOI: 10.1007/s11263-024-02275-5

Huimin Ma, Sheng Yi, Shijie Chen, Jiansheng Chen, Yu Wang

Previous weakly supervised semantic segmentation (WSSS) methods mainly begin with the segmentation seeds from the CAM method. Because of the high complexity of driving scene images, their framework performs not well on driving scene datasets. In this paper, we propose a new kind of WSSS annotations on the complex driving scene dataset, with only one or several labeled points per category. This annotation is more lightweight than image-level annotation and provides critical localization information for prototypes. We propose a framework to address the WSSS task under this annotation, which generates prototype feature vectors from labeled points and then produces 2D pseudo labels. Besides, we found the point cloud data is useful for distinguishing different objects. Our framework could extract rich semantic information from unlabeled point cloud data and generate instance masks, which does not require extra annotation resources. We combine the pseudo labels and the instance masks to modify the incorrect regions and thus obtain more accurate supervision for training the semantic segmentation network. We evaluated this framework on the KITTI dataset. Experiments show that the proposed method achieves state-of-the-art performance.

以往的弱监督语义分割（WSSS）方法主要从 CAM 方法的分割种子开始。由于驾驶场景图像的高复杂性，他们的框架在驾驶场景数据集上表现不佳。在本文中，我们针对复杂的驾驶场景数据集提出了一种新的 WSSS 注释，每个类别只有一个或几个标注点。这种注释比图像级注释更轻量级，能为原型提供关键的定位信息。我们提出了一个框架来解决这种标注下的 WSSS 任务，该框架可根据标注点生成原型特征向量，然后生成二维伪标签。此外，我们还发现点云数据有助于区分不同的物体。我们的框架可以从未标明的点云数据中提取丰富的语义信息并生成实例掩码，这不需要额外的标注资源。我们结合伪标签和实例掩码来修改不正确的区域，从而为训练语义分割网络获得更准确的监督。我们在 KITTI 数据集上对这一框架进行了评估。实验表明，所提出的方法达到了最先进的性能。

{"title":"Few Annotated Pixels and Point Cloud Based Weakly Supervised Semantic Segmentation of Driving Scenes","authors":"Huimin Ma, Sheng Yi, Shijie Chen, Jiansheng Chen, Yu Wang","doi":"10.1007/s11263-024-02275-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02275-5","url":null,"abstract":"Previous weakly supervised semantic segmentation (WSSS) methods mainly begin with the segmentation seeds from the CAM method. Because of the high complexity of driving scene images, their framework performs not well on driving scene datasets. In this paper, we propose a new kind of WSSS annotations on the complex driving scene dataset, with only one or several labeled points per category. This annotation is more lightweight than image-level annotation and provides critical localization information for prototypes. We propose a framework to address the WSSS task under this annotation, which generates prototype feature vectors from labeled points and then produces 2D pseudo labels. Besides, we found the point cloud data is useful for distinguishing different objects. Our framework could extract rich semantic information from unlabeled point cloud data and generate instance masks, which does not require extra annotation resources. We combine the pseudo labels and the instance masks to modify the incorrect regions and thus obtain more accurate supervision for training the semantic segmentation network. We evaluated this framework on the KITTI dataset. Experiments show that the proposed method achieves state-of-the-art performance.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"2022 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142580565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

APPTracker+: Displacement Uncertainty for Occlusion Handling in Low-Frame-Rate Multiple Object Tracking APPTracker+：在低帧速率多目标跟踪中处理遮挡的位移不确定性

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-11-03 DOI: 10.1007/s11263-024-02237-x

Tao Zhou, Qi Ye, Wenhan Luo, Haizhou Ran, Zhiguo Shi, Jiming Chen

Multi-object tracking (MOT) in the scenario of low-frame-rate videos is a promising solution to better meet the computing, storage, and transmitting bandwidth resource constraints of edge devices. Tracking with a low frame rate poses particular challenges in the association stage as objects in two successive frames typically exhibit much quicker variations in locations, velocities, appearances, and visibilities than those in normal frame rates. In this paper, we observe severe performance degeneration of many existing association strategies caused by such variations. Though optical-flow-based methods like CenterTrack can handle the large displacement to some extent due to their large receptive field, the temporally local nature makes them fail to give reliable displacement estimations of objects that newly appear in the current frame (i.e., not visible in the previous frame). To overcome the local nature of optical-flow-based methods, we propose an online tracking method by extending the CenterTrack architecture with a new head, named APP, to recognize unreliable displacement estimations. Further, to capture the fine-grained and private unreliability of each displacement estimation, we extend the binary APP predictions to displacement uncertainties. To this end, we reformulate the displacement estimation task via Bayesian deep learning tools. With APP predictions, we propose to conduct association in a multi-stage manner where vision cues or historical motion cues are leveraged in the corresponding stage. By rethinking the commonly used bipartite matching algorithms, we equip the proposed multi-stage association policy with a hybrid matching strategy conditioned on displacement uncertainties. Our method shows robustness in preserving identities in low-frame-rate video sequences. Experimental results on public datasets in various low-frame-rate settings demonstrate the advantages of the proposed method.

低帧频视频中的多目标跟踪（MOT）是一种很有前景的解决方案，能更好地满足边缘设备在计算、存储和传输带宽资源方面的限制。由于连续两帧中的物体在位置、速度、外观和可见度上的变化通常比正常帧率下的物体要快得多，因此低帧率下的跟踪在关联阶段面临着特殊的挑战。在本文中，我们观察到许多现有的关联策略都因这种变化而导致性能严重下降。虽然基于光流的方法（如 CenterTrack）由于具有较大的感受野，可以在一定程度上处理较大的位移，但其时间局部性使其无法对当前帧中新出现的物体（即在前一帧中不可见的物体）进行可靠的位移估计。为了克服基于光流的方法的局部性，我们提出了一种在线跟踪方法，通过扩展 CenterTrack 架构，增加一个新的头部（名为 APP）来识别不可靠的位移估计。此外，为了捕捉每个位移估计的细粒度和私人不可靠度，我们将二进制 APP 预测扩展到位移不确定性。为此，我们通过贝叶斯深度学习工具重新制定了位移估计任务。通过 APP 预测，我们建议以多阶段方式进行关联，在相应阶段利用视觉线索或历史运动线索。通过重新思考常用的两端匹配算法，我们为所提出的多阶段关联策略配备了以位移不确定性为条件的混合匹配策略。我们的方法在低帧率视频序列中显示出保护身份的鲁棒性。在各种低帧率环境下的公共数据集上的实验结果证明了所提方法的优势。

{"title":"APPTracker+: Displacement Uncertainty for Occlusion Handling in Low-Frame-Rate Multiple Object Tracking","authors":"Tao Zhou, Qi Ye, Wenhan Luo, Haizhou Ran, Zhiguo Shi, Jiming Chen","doi":"10.1007/s11263-024-02237-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02237-x","url":null,"abstract":"Multi-object tracking (MOT) in the scenario of low-frame-rate videos is a promising solution to better meet the computing, storage, and transmitting bandwidth resource constraints of edge devices. Tracking with a low frame rate poses particular challenges in the association stage as objects in two successive frames typically exhibit much quicker variations in locations, velocities, appearances, and visibilities than those in normal frame rates. In this paper, we observe severe performance degeneration of many existing association strategies caused by such variations. Though optical-flow-based methods like CenterTrack can handle the large displacement to some extent due to their large receptive field, the temporally local nature makes them fail to give reliable displacement estimations of objects that newly appear in the current frame (i.e., not visible in the previous frame). To overcome the local nature of optical-flow-based methods, we propose an online tracking method by extending the CenterTrack architecture with a new head, named APP, to recognize unreliable displacement estimations. Further, to capture the fine-grained and private unreliability of each displacement estimation, we extend the binary APP predictions to displacement uncertainties. To this end, we reformulate the displacement estimation task via Bayesian deep learning tools. With APP predictions, we propose to conduct association in a multi-stage manner where vision cues or historical motion cues are leveraged in the corresponding stage. By rethinking the commonly used bipartite matching algorithms, we equip the proposed multi-stage association policy with a hybrid matching strategy conditioned on displacement uncertainties. Our method shows robustness in preserving identities in low-frame-rate video sequences. Experimental results on public datasets in various low-frame-rate settings demonstrate the advantages of the proposed method.\u0000","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"7 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142566097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0