arXiv - CS - Computer Vision and Pattern Recognition最新文献_第8页

A Comprehensive Survey on Deep Multimodal Learning with Missing Modality 关于缺失模态深度多模态学习的全面调查

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07825

Renjie Wu, Hu Wang, Hsiang-Ting Chen

During multimodal model training and reasoning, data samples may miss certainmodalities and lead to compromised model performance due to sensor limitations,cost constraints, privacy concerns, data loss, and temporal and spatialfactors. This survey provides an overview of recent progress in MultimodalLearning with Missing Modality (MLMM), focusing on deep learning techniques. Itis the first comprehensive survey that covers the historical background and thedistinction between MLMM and standard multimodal learning setups, followed by adetailed analysis of current MLMM methods, applications, and datasets,concluding with a discussion about challenges and potential future directionsin the field.

在多模态模型训练和推理过程中，由于传感器限制、成本约束、隐私问题、数据丢失以及时间和空间因素，数据样本可能会遗漏某些模态，导致模型性能受损。本调查概述了有缺失模态的多模态学习（MLMM）的最新进展，重点关注深度学习技术。这是第一份全面的调查报告，涵盖了 MLMM 与标准多模态学习设置之间的历史背景和区别，随后详细分析了当前的 MLMM 方法、应用和数据集，最后讨论了该领域面临的挑战和潜在的未来发展方向。

引用次数: 0

Task-Augmented Cross-View Imputation Network for Partial Multi-View Incomplete Multi-Label Classification 用于部分多视图不完整多标签分类的任务增强型交叉视图估算网络

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07931

Xiaohuan Lu, Lian Zhao, Wai Keung Wong, Jie Wen, Jiang Long, Wulin Xie

In real-world scenarios, multi-view multi-label learning often encounters thechallenge of incomplete training data due to limitations in data collection andunreliable annotation processes. The absence of multi-view features impairs thecomprehensive understanding of samples, omitting crucial details essential forclassification. To address this issue, we present a task-augmented cross-viewimputation network (TACVI-Net) for the purpose of handling partial multi-viewincomplete multi-label classification. Specifically, we employ a two-stagenetwork to derive highly task-relevant features to recover the missing views.In the first stage, we leverage the information bottleneck theory to obtain adiscriminative representation of each view by extracting task-relevantinformation through a view-specific encoder-classifier architecture. In thesecond stage, an autoencoder based multi-view reconstruction network isutilized to extract high-level semantic representation of the augmentedfeatures and recover the missing data, thereby aiding the final classificationtask. Extensive experiments on five datasets demonstrate that our TACVI-Netoutperforms other state-of-the-art methods.

在现实世界中，由于数据收集的局限性和注释过程的不可靠，多视角多标签学习经常会遇到训练数据不完整的挑战。多视角特征的缺失会影响对样本的全面理解，从而遗漏对分类至关重要的细节。为了解决这个问题，我们提出了一种任务增强跨视图输入网络（TACVI-Net），用于处理部分多视图不完整多标签分类。在第一阶段，我们利用信息瓶颈理论，通过特定视图的编码器-分类器架构提取任务相关信息，从而获得每个视图的区分表示。在第二阶段，我们利用基于自动编码器的多视图重构网络来提取增强特征的高级语义表示并恢复缺失数据，从而帮助完成最终的分类任务。在五个数据集上进行的广泛实验表明，我们的 TACVI-Netout 优于其他最先进的方法。

{"title":"Task-Augmented Cross-View Imputation Network for Partial Multi-View Incomplete Multi-Label Classification","authors":"Xiaohuan Lu, Lian Zhao, Wai Keung Wong, Jie Wen, Jiang Long, Wulin Xie","doi":"arxiv-2409.07931","DOIUrl":"https://doi.org/arxiv-2409.07931","url":null,"abstract":"In real-world scenarios, multi-view multi-label learning often encounters the\u0000challenge of incomplete training data due to limitations in data collection and\u0000unreliable annotation processes. The absence of multi-view features impairs the\u0000comprehensive understanding of samples, omitting crucial details essential for\u0000classification. To address this issue, we present a task-augmented cross-view\u0000imputation network (TACVI-Net) for the purpose of handling partial multi-view\u0000incomplete multi-label classification. Specifically, we employ a two-stage\u0000network to derive highly task-relevant features to recover the missing views.\u0000In the first stage, we leverage the information bottleneck theory to obtain a\u0000discriminative representation of each view by extracting task-relevant\u0000information through a view-specific encoder-classifier architecture. In the\u0000second stage, an autoencoder based multi-view reconstruction network is\u0000utilized to extract high-level semantic representation of the augmented\u0000features and recover the missing data, thereby aiding the final classification\u0000task. Extensive experiments on five datasets demonstrate that our TACVI-Net\u0000outperforms other state-of-the-art methods.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Canine Musculoskeletal Diagnoses: Leveraging Synthetic Image Data for Pre-Training AI-Models on Visual Documentations 加强犬类肌肉骨骼诊断：利用合成图像数据预训练视觉文档上的人工智能模型

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08181

Martin Thißen, Thi Ngoc Diep Tran, Ben Joel Schönbein, Ute Trapp, Barbara Esteve Ratsch, Beate Egner, Romana Piat, Elke Hergenröther

The examination of the musculoskeletal system in dogs is a challenging taskin veterinary practice. In this work, a novel method has been developed thatenables efficient documentation of a dog's condition through a visualrepresentation. However, since the visual documentation is new, there is noexisting training data. The objective of this work is therefore to mitigate theimpact of data scarcity in order to develop an AI-based diagnostic supportsystem. To this end, the potential of synthetic data that mimics realisticvisual documentations of diseases for pre-training AI models is investigated.We propose a method for generating synthetic image data that mimics realisticvisual documentations. Initially, a basic dataset containing three distinctclasses is generated, followed by the creation of a more sophisticated datasetcontaining 36 different classes. Both datasets are used for the pre-training ofan AI model. Subsequently, an evaluation dataset is created, consisting of 250manually created visual documentations for five different diseases. Thisdataset, along with a subset containing 25 examples. The obtained results onthe evaluation dataset containing 25 examples demonstrate a significantenhancement of approximately 10% in diagnosis accuracy when utilizing generatedsynthetic images that mimic real-world visual documentations. However, theseresults do not hold true for the larger evaluation dataset containing 250examples, indicating that the advantages of using synthetic data forpre-training an AI model emerge primarily when dealing with few examples ofvisual documentations for a given disease. Overall, this work provides valuableinsights into mitigating the limitations imposed by limited training datathrough the strategic use of generated synthetic data, presenting an approachapplicable beyond the canine musculoskeletal assessment domain.

在兽医实践中，对狗的肌肉骨骼系统进行检查是一项具有挑战性的任务。在这项工作中，我们开发了一种新方法，可通过视觉呈现有效记录狗的状况。然而，由于视觉记录是一项新技术，因此没有现成的训练数据。因此，这项工作的目标是减轻数据匮乏的影响，以开发基于人工智能的诊断支持系统。为此，我们研究了模拟现实疾病视觉文献的合成数据在预训练人工智能模型方面的潜力。首先，我们生成了一个包含三个不同类别的基本数据集，然后又创建了一个包含 36 个不同类别的更复杂数据集。这两个数据集都用于人工智能模型的预训练。随后，我们创建了一个评估数据集，由 250 个手动创建的五种不同疾病的视觉文档组成。该数据集以及包含 25 个示例的子集。在包含 25 个示例的评估数据集上获得的结果表明，利用模拟真实世界视觉文档生成的合成图像，诊断准确率显著提高了约 10%。然而，这些结果在包含 250 个示例的更大评估数据集上并不成立，这表明使用合成数据预训练人工智能模型的优势主要体现在处理特定疾病的少量视觉文档示例时。总之，这项工作为通过战略性地使用生成的合成数据来减轻有限的训练数据所带来的限制提供了宝贵的见解，提出了一种适用于犬肌肉骨骼评估领域以外的方法。

{"title":"Enhancing Canine Musculoskeletal Diagnoses: Leveraging Synthetic Image Data for Pre-Training AI-Models on Visual Documentations","authors":"Martin Thißen, Thi Ngoc Diep Tran, Ben Joel Schönbein, Ute Trapp, Barbara Esteve Ratsch, Beate Egner, Romana Piat, Elke Hergenröther","doi":"arxiv-2409.08181","DOIUrl":"https://doi.org/arxiv-2409.08181","url":null,"abstract":"The examination of the musculoskeletal system in dogs is a challenging task\u0000in veterinary practice. In this work, a novel method has been developed that\u0000enables efficient documentation of a dog's condition through a visual\u0000representation. However, since the visual documentation is new, there is no\u0000existing training data. The objective of this work is therefore to mitigate the\u0000impact of data scarcity in order to develop an AI-based diagnostic support\u0000system. To this end, the potential of synthetic data that mimics realistic\u0000visual documentations of diseases for pre-training AI models is investigated.\u0000We propose a method for generating synthetic image data that mimics realistic\u0000visual documentations. Initially, a basic dataset containing three distinct\u0000classes is generated, followed by the creation of a more sophisticated dataset\u0000containing 36 different classes. Both datasets are used for the pre-training of\u0000an AI model. Subsequently, an evaluation dataset is created, consisting of 250\u0000manually created visual documentations for five different diseases. This\u0000dataset, along with a subset containing 25 examples. The obtained results on\u0000the evaluation dataset containing 25 examples demonstrate a significant\u0000enhancement of approximately 10% in diagnosis accuracy when utilizing generated\u0000synthetic images that mimic real-world visual documentations. However, these\u0000results do not hold true for the larger evaluation dataset containing 250\u0000examples, indicating that the advantages of using synthetic data for\u0000pre-training an AI model emerge primarily when dealing with few examples of\u0000visual documentations for a given disease. Overall, this work provides valuable\u0000insights into mitigating the limitations imposed by limited training data\u0000through the strategic use of generated synthetic data, presenting an approach\u0000applicable beyond the canine musculoskeletal assessment domain.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Structured Pruning for Efficient Visual Place Recognition 高效视觉地点识别的结构化修剪

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07834

Oliver Grainge, Michael Milford, Indu Bodala, Sarvapali D. Ramchurn, Shoaib Ehsan

Visual Place Recognition (VPR) is fundamental for the global re-localizationof robots and devices, enabling them to recognize previously visited locationsbased on visual inputs. This capability is crucial for maintaining accuratemapping and localization over large areas. Given that VPR methods need tooperate in real-time on embedded systems, it is critical to optimize thesesystems for minimal resource consumption. While the most efficient VPRapproaches employ standard convolutional backbones with fixed descriptordimensions, these often lead to redundancy in the embedding space as well as inthe network architecture. Our work introduces a novel structured pruningmethod, to not only streamline common VPR architectures but also tostrategically remove redundancies within the feature embedding space. This dualfocus significantly enhances the efficiency of the system, reducing both mapand model memory requirements and decreasing feature extraction and retrievallatencies. Our approach has reduced memory usage and latency by 21% and 16%,respectively, across models, while minimally impacting recall@1 accuracy byless than 1%. This significant improvement enhances real-time applications onedge devices with negligible accuracy loss.

视觉地点识别（VPR）是机器人和设备进行全球再定位的基础，它使机器人和设备能够根据视觉输入识别以前访问过的地点。这种能力对于保持大面积精确测量和定位至关重要。鉴于 VPR 方法需要在嵌入式系统上实时运行，因此优化系统以减少资源消耗至关重要。虽然最高效的 VPR 方法采用了具有固定描述维度的标准卷积骨干，但这些方法往往会导致嵌入空间和网络架构的冗余。我们的工作引入了一种新颖的结构化剪枝方法，不仅简化了常见的 VPR 架构，还从战略上消除了特征嵌入空间中的冗余。这种双管齐下的方法大大提高了系统的效率，减少了地图和模型的内存需求，降低了特征提取和检索的延迟。我们的方法在各种模型中将内存使用量和延迟时间分别降低了 21% 和 16%，同时将召回@1 准确率的影响降到了 1%以下。这一重大改进增强了边缘设备上的实时应用，其准确性损失可以忽略不计。

{"title":"Structured Pruning for Efficient Visual Place Recognition","authors":"Oliver Grainge, Michael Milford, Indu Bodala, Sarvapali D. Ramchurn, Shoaib Ehsan","doi":"arxiv-2409.07834","DOIUrl":"https://doi.org/arxiv-2409.07834","url":null,"abstract":"Visual Place Recognition (VPR) is fundamental for the global re-localization\u0000of robots and devices, enabling them to recognize previously visited locations\u0000based on visual inputs. This capability is crucial for maintaining accurate\u0000mapping and localization over large areas. Given that VPR methods need to\u0000operate in real-time on embedded systems, it is critical to optimize these\u0000systems for minimal resource consumption. While the most efficient VPR\u0000approaches employ standard convolutional backbones with fixed descriptor\u0000dimensions, these often lead to redundancy in the embedding space as well as in\u0000the network architecture. Our work introduces a novel structured pruning\u0000method, to not only streamline common VPR architectures but also to\u0000strategically remove redundancies within the feature embedding space. This dual\u0000focus significantly enhances the efficiency of the system, reducing both map\u0000and model memory requirements and decreasing feature extraction and retrieval\u0000latencies. Our approach has reduced memory usage and latency by 21% and 16%,\u0000respectively, across models, while minimally impacting recall@1 accuracy by\u0000less than 1%. This significant improvement enhances real-time applications on\u0000edge devices with negligible accuracy loss.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SDformer: Efficient End-to-End Transformer for Depth Completion SDformer：用于深度补全的高效端到端变换器

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08159

Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang

Depth completion aims to predict dense depth maps with sparse depthmeasurements from a depth sensor. Currently, Convolutional Neural Network (CNN)based models are the most popular methods applied to depth completion tasks.However, despite the excellent high-end performance, they suffer from a limitedrepresentation area. To overcome the drawbacks of CNNs, a more effective andpowerful method has been presented: the Transformer, which is an adaptiveself-attention setting sequence-to-sequence model. While the standardTransformer quadratically increases the computational cost from the key-querydot-product of input resolution which improperly employs depth completiontasks. In this work, we propose a different window-based Transformerarchitecture for depth completion tasks named Sparse-to-Dense Transformer(SDformer). The network consists of an input module for the depth map and RGBimage features extraction and concatenation, a U-shaped encoder-decoderTransformer for extracting deep features, and a refinement module.Specifically, we first concatenate the depth map features with the RGB imagefeatures through the input model. Then, instead of calculating self-attentionwith the whole feature maps, we apply different window sizes to extract thelong-range depth dependencies. Finally, we refine the predicted features fromthe input module and the U-shaped encoder-decoder Transformer module to get theenriching depth features and employ a convolution layer to obtain the densedepth map. In practice, the SDformer obtains state-of-the-art results againstthe CNN-based depth completion models with lower computing loads and parameterson the NYU Depth V2 and KITTI DC datasets.

深度补全旨在利用深度传感器的稀疏深度测量数据预测密集深度图。目前，基于卷积神经网络（CNN）的模型是深度补全任务中最常用的方法。然而，尽管这些模型具有出色的高端性能，但它们的代表性区域有限。为了克服卷积神经网络的缺点，人们提出了一种更有效、更强大的方法：Transformer，它是一种自适应自我关注设置序列到序列模型。标准的变换器会因输入分辨率的 key-querydot-product 而四倍地增加计算成本，这就不适当地使用了深度完成任务。在这项工作中，我们针对深度补全任务提出了一种不同的基于窗口的变换器架构，命名为稀疏到密集变换器（SDformer）。具体来说，我们首先通过输入模型将深度图特征与 RGB 图像特征串联起来。具体来说，我们首先通过输入模型将深度图特征与 RGB 图像特征串联起来，然后，我们不再使用整个特征图来计算自注意力，而是使用不同大小的窗口来提取长距离深度依赖关系。最后，我们对输入模块和 U 型编码器-解码器变换器模块的预测特征进行细化，得到丰富的深度特征，并利用卷积层获得有密度的深度图。在实践中，SDformer 在 NYU Depth V2 和 KITTI DC 数据集上以较低的计算负荷和参数获得了与基于 CNN 的深度补全模型相比最先进的结果。

{"title":"SDformer: Efficient End-to-End Transformer for Depth Completion","authors":"Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang","doi":"arxiv-2409.08159","DOIUrl":"https://doi.org/arxiv-2409.08159","url":null,"abstract":"Depth completion aims to predict dense depth maps with sparse depth\u0000measurements from a depth sensor. Currently, Convolutional Neural Network (CNN)\u0000based models are the most popular methods applied to depth completion tasks.\u0000However, despite the excellent high-end performance, they suffer from a limited\u0000representation area. To overcome the drawbacks of CNNs, a more effective and\u0000powerful method has been presented: the Transformer, which is an adaptive\u0000self-attention setting sequence-to-sequence model. While the standard\u0000Transformer quadratically increases the computational cost from the key-query\u0000dot-product of input resolution which improperly employs depth completion\u0000tasks. In this work, we propose a different window-based Transformer\u0000architecture for depth completion tasks named Sparse-to-Dense Transformer\u0000(SDformer). The network consists of an input module for the depth map and RGB\u0000image features extraction and concatenation, a U-shaped encoder-decoder\u0000Transformer for extracting deep features, and a refinement module.\u0000Specifically, we first concatenate the depth map features with the RGB image\u0000features through the input model. Then, instead of calculating self-attention\u0000with the whole feature maps, we apply different window sizes to extract the\u0000long-range depth dependencies. Finally, we refine the predicted features from\u0000the input module and the U-shaped encoder-decoder Transformer module to get the\u0000enriching depth features and employ a convolution layer to obtain the dense\u0000depth map. In practice, the SDformer obtains state-of-the-art results against\u0000the CNN-based depth completion models with lower computing loads and parameters\u0000on the NYU Depth V2 and KITTI DC datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LT3SD: Latent Trees for 3D Scene Diffusion LT3SD：用于三维场景扩散的潜影树

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08215

Quan Meng, Lei Li, Matthias Nießner, Angela Dai

We present LT3SD, a novel latent diffusion model for large-scale 3D scenegeneration. Recent advances in diffusion models have shown impressive resultsin 3D object generation, but are limited in spatial extent and quality whenextended to 3D scenes. To generate complex and diverse 3D scene structures, weintroduce a latent tree representation to effectively encode bothlower-frequency geometry and higher-frequency detail in a coarse-to-finehierarchy. We can then learn a generative diffusion process in this latent 3Dscene space, modeling the latent components of a scene at each resolutionlevel. To synthesize large-scale scenes with varying sizes, we train ourdiffusion model on scene patches and synthesize arbitrary-sized output 3Dscenes through shared diffusion generation across multiple scene patches.Through extensive experiments, we demonstrate the efficacy and benefits ofLT3SD for large-scale, high-quality unconditional 3D scene generation and forprobabilistic completion for partial scene observations.

我们介绍了 LT3SD，这是一种用于大规模三维场景生成的新型潜在扩散模型。扩散模型的最新进展在三维物体生成方面取得了令人印象深刻的成果，但当扩展到三维场景时，其空间范围和质量都受到了限制。为了生成复杂多样的三维场景结构，我们引入了潜树表示法，以从粗到细的层次结构有效地编码低频几何图形和高频细节。然后，我们可以在这个潜在的三维场景空间中学习一个生成扩散过程，在每个分辨率级别上对场景的潜在成分进行建模。为了合成不同大小的大规模场景，我们在场景补丁上训练扩散模型，并通过在多个场景补丁上共享扩散生成来合成任意大小的输出三维场景。通过大量实验，我们证明了 LT3SD 在大规模、高质量无条件三维场景生成和部分场景观测的概率完成方面的功效和优势。

引用次数: 0

UNIT: Unsupervised Online Instance Segmentation through Time 单元：通过时间进行无监督在线实例分割

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07887

Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit

Online object segmentation and tracking in Lidar point clouds enablesautonomous agents to understand their surroundings and make safe decisions.Unfortunately, manual annotations for these tasks are prohibitively costly. Wetackle this problem with the task of class-agnostic unsupervised onlineinstance segmentation and tracking. To that end, we leverage an instancesegmentation backbone and propose a new training recipe that enables the onlinetracking of objects. Our network is trained on pseudo-labels, eliminating theneed for manual annotations. We conduct an evaluation using metrics adapted fortemporal instance segmentation. Computing these metrics requirestemporally-consistent instance labels. When unavailable, we construct theselabels using the available 3D bounding boxes and semantic labels in thedataset. We compare our method against strong baselines and demonstrate itssuperiority across two different outdoor Lidar datasets.

激光雷达点云中的在线物体分割和跟踪可帮助自主代理了解周围环境并做出安全决策。为了解决这个问题，我们采用了类无关的无监督在线实例分割和跟踪任务。为此，我们利用实例分割骨干网，提出了一种新的训练方法，实现了对物体的在线跟踪。我们的网络在伪标签上进行训练，无需人工标注。我们使用适用于时态实例分割的指标进行了评估。计算这些指标需要系统一致的实例标签。如果没有，我们就使用数据集中可用的三维边界框和语义标签来构建标签。我们将我们的方法与强大的基线进行了比较，并在两个不同的室外激光雷达数据集上证明了其优越性。

引用次数: 0

Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes 深度很重要：探索 RGB-D 的深度交互，实现交通场景中的语义分割

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07995

Siyu Chen, Ting Han, Changshe Zhang, Weiquan Liu, Jinhe Su, Zongyue Wang, Guorong Cai

RGB-D has gradually become a crucial data source for understanding complexscenes in assisted driving. However, existing studies have paid insufficientattention to the intrinsic spatial properties of depth maps. This oversightsignificantly impacts the attention representation, leading to predictionerrors caused by attention shift issues. To this end, we propose a novellearnable Depth interaction Pyramid Transformer (DiPFormer) to explore theeffectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization(Depth SAO) as offset to represent real-world spatial relationships. Secondly,the similarity in the feature space of RGB-D is learned by Depth LinearCross-Attention (Depth LCA) to clarify spatial differences at the pixel level.Finally, an MLP Decoder is utilized to effectively fuse multi-scale featuresfor meeting real-time requirements. Comprehensive experiments demonstrate thatthe proposed DiPFormer significantly addresses the issue of attentionmisalignment in both road detection (+7.5%) and semantic segmentation (+4.9% /+1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI(97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes(83.4% mIoU) datasets.

RGB-D 已逐渐成为辅助驾驶中了解复杂场景的重要数据源。然而，现有研究对深度图的内在空间属性关注不够。这种疏忽严重影响了注意力表征，导致注意力转移问题造成预测误差。为此，我们提出了一种新颖的可学习深度交互金字塔转换器（Depth interaction Pyramid Transformer，DiPFormer）来探索深度的有效性。首先，我们引入深度空间感知优化（Depth Spatial-Aware Optimization，Depth SAO）作为偏移量来表示真实世界的空间关系。其次，通过深度线性交叉注意（Depth LinearCross-Attention，DCA）学习 RGB-D 特征空间中的相似性，以明确像素级的空间差异。最后，利用 MLP 解码器有效融合多尺度特征，以满足实时性要求。综合实验证明，所提出的 DiPFormer 显著解决了道路检测（+7.5%）和语义分割（+4.9% /+1.5%）任务中的注意力调整问题。DiPFormer 在 KITTI（在 KITTI 道路上的 F-score 为 97.57%，在 KITTI-360 上的 mIoU 为 68.74%）和 Cityscapes（mIoU 为 83.4%）数据集上取得了最先进的性能。

{"title":"Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes","authors":"Siyu Chen, Ting Han, Changshe Zhang, Weiquan Liu, Jinhe Su, Zongyue Wang, Guorong Cai","doi":"arxiv-2409.07995","DOIUrl":"https://doi.org/arxiv-2409.07995","url":null,"abstract":"RGB-D has gradually become a crucial data source for understanding complex\u0000scenes in assisted driving. However, existing studies have paid insufficient\u0000attention to the intrinsic spatial properties of depth maps. This oversight\u0000significantly impacts the attention representation, leading to prediction\u0000errors caused by attention shift issues. To this end, we propose a novel\u0000learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the\u0000effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization\u0000(Depth SAO) as offset to represent real-world spatial relationships. Secondly,\u0000the similarity in the feature space of RGB-D is learned by Depth Linear\u0000Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level.\u0000Finally, an MLP Decoder is utilized to effectively fuse multi-scale features\u0000for meeting real-time requirements. Comprehensive experiments demonstrate that\u0000the proposed DiPFormer significantly addresses the issue of attention\u0000misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% /\u0000+1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI\u0000(97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes\u0000(83.4% mIoU) datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"433 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low-Cost Tree Crown Dieback Estimation Using Deep Learning-Based Segmentation 利用基于深度学习的分割技术进行低成本树冠枯萎估算

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08171

M. J. Allen, D. Moreno-Fernández, P. Ruiz-Benito, S. W. D. Grieve, E. R. Lines

The global increase in observed forest dieback, characterised by the death oftree foliage, heralds widespread decline in forest ecosystems. This degradationcauses significant changes to ecosystem services and functions, includinghabitat provision and carbon sequestration, which can be difficult to detectusing traditional monitoring techniques, highlighting the need for large-scaleand high-frequency monitoring. Contemporary developments in the instruments andmethods to gather and process data at large-scales mean this monitoring is nowpossible. In particular, the advancement of low-cost drone technology and deeplearning on consumer-level hardware provide new opportunities. Here, we use anapproach based on deep learning and vegetation indices to assess crown diebackfrom RGB aerial data without the need for expensive instrumentation such asLiDAR. We use an iterative approach to match crown footprints predicted by deeplearning with field-based inventory data from a Mediterranean ecosystemexhibiting drought-induced dieback, and compare expert field-based crowndieback estimation with vegetation index-based estimates. We obtain highoverall segmentation accuracy (mAP: 0.519) without the need for additionaltechnical development of the underlying Mask R-CNN model, underscoring thepotential of these approaches for non-expert use and proving theirapplicability to real-world conservation. We also find colour-coordinate basedestimates of dieback correlate well with expert field-based estimation.Substituting ground truth for Mask R-CNN model predictions showed negligibleimpact on dieback estimates, indicating robustness. Our findings demonstratethe potential of automated data collection and processing, including theapplication of deep learning, to improve the coverage, speed and cost of forestdieback monitoring.

在全球范围内观察到的以树叶枯死为特征的森林枯死现象有所增加，这预示着森林生态系统的普遍衰退。这种退化会导致生态系统服务和功能（包括提供栖息地和碳封存）发生重大变化，而传统的监测技术很难检测到这些变化，因此需要进行大规模和高频率的监测。当代大规模收集和处理数据的仪器和方法的发展意味着现在可以进行这种监测。特别是低成本无人机技术和消费级硬件深度学习技术的发展提供了新的机遇。在这里，我们使用一种基于深度学习和植被指数的方法，从 RGB 航空数据中评估树冠枯死情况，而无需使用诸如激光雷达之类的昂贵仪器。我们使用一种迭代方法，将深度学习预测的树冠足迹与基于干旱引起的枯死的地中海生态系统的实地清单数据相匹配，并将专家的实地树冠枯死估算与基于植被指数的估算进行比较。我们获得了较高的整体分割准确率（mAP：0.519），而无需对底层 Mask R-CNN 模型进行额外的技术开发，这凸显了这些方法在非专家使用方面的潜力，并证明了其在现实世界保护中的适用性。我们还发现，基于颜色坐标的枯萎病估测结果与专家的实地估测结果有很好的相关性。用地面实况替代 Mask R-CNN 模型的预测结果对枯萎病估测结果的影响可以忽略不计，这表明该方法具有鲁棒性。我们的研究结果证明了自动数据收集和处理（包括深度学习的应用）在提高林梢枯死监测的覆盖率、速度和成本方面的潜力。

{"title":"Low-Cost Tree Crown Dieback Estimation Using Deep Learning-Based Segmentation","authors":"M. J. Allen, D. Moreno-Fernández, P. Ruiz-Benito, S. W. D. Grieve, E. R. Lines","doi":"arxiv-2409.08171","DOIUrl":"https://doi.org/arxiv-2409.08171","url":null,"abstract":"The global increase in observed forest dieback, characterised by the death of\u0000tree foliage, heralds widespread decline in forest ecosystems. This degradation\u0000causes significant changes to ecosystem services and functions, including\u0000habitat provision and carbon sequestration, which can be difficult to detect\u0000using traditional monitoring techniques, highlighting the need for large-scale\u0000and high-frequency monitoring. Contemporary developments in the instruments and\u0000methods to gather and process data at large-scales mean this monitoring is now\u0000possible. In particular, the advancement of low-cost drone technology and deep\u0000learning on consumer-level hardware provide new opportunities. Here, we use an\u0000approach based on deep learning and vegetation indices to assess crown dieback\u0000from RGB aerial data without the need for expensive instrumentation such as\u0000LiDAR. We use an iterative approach to match crown footprints predicted by deep\u0000learning with field-based inventory data from a Mediterranean ecosystem\u0000exhibiting drought-induced dieback, and compare expert field-based crown\u0000dieback estimation with vegetation index-based estimates. We obtain high\u0000overall segmentation accuracy (mAP: 0.519) without the need for additional\u0000technical development of the underlying Mask R-CNN model, underscoring the\u0000potential of these approaches for non-expert use and proving their\u0000applicability to real-world conservation. We also find colour-coordinate based\u0000estimates of dieback correlate well with expert field-based estimation.\u0000Substituting ground truth for Mask R-CNN model predictions showed negligible\u0000impact on dieback estimates, indicating robustness. Our findings demonstrate\u0000the potential of automated data collection and processing, including the\u0000application of deep learning, to improve the coverage, speed and cost of forest\u0000dieback monitoring.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"112 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance EZIGen：通过精确的主体编码和解耦引导，增强零镜头主体驱动图像生成功能

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08091

Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu

Zero-shot subject-driven image generation aims to produce images thatincorporate a subject from a given example image. The challenge lies inpreserving the subject's identity while aligning with the text prompt, whichoften requires modifying certain aspects of the subject's appearance. Despiteadvancements in diffusion model based methods, existing approaches stillstruggle to balance identity preservation with text prompt alignment. In thisstudy, we conducted an in-depth investigation into this issue and uncovered keyinsights for achieving effective identity preservation while maintaining astrong balance. Our key findings include: (1) the design of the subject imageencoder significantly impacts identity preservation quality, and (2) generatingan initial layout is crucial for both text alignment and identity preservation.Building on these insights, we introduce a new approach called EZIGen, whichemploys two main strategies: a carefully crafted subject image Encoder based onthe UNet architecture of the pretrained Stable Diffusion model to ensurehigh-quality identity transfer, following a process that decouples the guidancestages and iteratively refines the initial image layout. Through thesestrategies, EZIGen achieves state-of-the-art results on multiple subject-drivenbenchmarks with a unified model and 100 times less training data.

零拍主体驱动图像生成的目的是根据给定的示例图像生成包含主体的图像。其难点在于如何在保持主体身份的同时与文本提示保持一致，这通常需要修改主体外观的某些方面。尽管基于扩散模型的方法取得了进步，但现有方法仍难以在保持身份和文本提示对齐之间取得平衡。在本研究中，我们对这一问题进行了深入调查，并发现了在保持有力平衡的同时实现有效身份保护的关键见解。我们的主要发现包括(基于这些见解，我们引入了一种名为 EZIGen 的新方法，该方法采用了两种主要策略：一种是基于预训练稳定扩散模型的 UNet 架构精心设计的主题图像编码器，以确保高质量的身份转移；另一种是解耦引导阶段并迭代完善初始图像布局的过程。通过这些策略，EZIGen 以统一的模型和减少 100 倍的训练数据，在多个主题驱动的基准测试中取得了最先进的结果。

{"title":"EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance","authors":"Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu","doi":"arxiv-2409.08091","DOIUrl":"https://doi.org/arxiv-2409.08091","url":null,"abstract":"Zero-shot subject-driven image generation aims to produce images that\u0000incorporate a subject from a given example image. The challenge lies in\u0000preserving the subject's identity while aligning with the text prompt, which\u0000often requires modifying certain aspects of the subject's appearance. Despite\u0000advancements in diffusion model based methods, existing approaches still\u0000struggle to balance identity preservation with text prompt alignment. In this\u0000study, we conducted an in-depth investigation into this issue and uncovered key\u0000insights for achieving effective identity preservation while maintaining a\u0000strong balance. Our key findings include: (1) the design of the subject image\u0000encoder significantly impacts identity preservation quality, and (2) generating\u0000an initial layout is crucial for both text alignment and identity preservation.\u0000Building on these insights, we introduce a new approach called EZIGen, which\u0000employs two main strategies: a carefully crafted subject image Encoder based on\u0000the UNet architecture of the pretrained Stable Diffusion model to ensure\u0000high-quality identity transfer, following a process that decouples the guidance\u0000stages and iteratively refines the initial image layout. Through these\u0000strategies, EZIGen achieves state-of-the-art results on multiple subject-driven\u0000benchmarks with a unified model and 100 times less training data.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0