During multimodal model training and reasoning, data samples may miss certain modalities and lead to compromised model performance due to sensor limitations, cost constraints, privacy concerns, data loss, and temporal and spatial factors. This survey provides an overview of recent progress in Multimodal Learning with Missing Modality (MLMM), focusing on deep learning techniques. It is the first comprehensive survey that covers the historical background and the distinction between MLMM and standard multimodal learning setups, followed by a detailed analysis of current MLMM methods, applications, and datasets, concluding with a discussion about challenges and potential future directions in the field.
{"title":"A Comprehensive Survey on Deep Multimodal Learning with Missing Modality","authors":"Renjie Wu, Hu Wang, Hsiang-Ting Chen","doi":"arxiv-2409.07825","DOIUrl":"https://doi.org/arxiv-2409.07825","url":null,"abstract":"During multimodal model training and reasoning, data samples may miss certain\u0000modalities and lead to compromised model performance due to sensor limitations,\u0000cost constraints, privacy concerns, data loss, and temporal and spatial\u0000factors. This survey provides an overview of recent progress in Multimodal\u0000Learning with Missing Modality (MLMM), focusing on deep learning techniques. It\u0000is the first comprehensive survey that covers the historical background and the\u0000distinction between MLMM and standard multimodal learning setups, followed by a\u0000detailed analysis of current MLMM methods, applications, and datasets,\u0000concluding with a discussion about challenges and potential future directions\u0000in the field.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaohuan Lu, Lian Zhao, Wai Keung Wong, Jie Wen, Jiang Long, Wulin Xie
In real-world scenarios, multi-view multi-label learning often encounters the challenge of incomplete training data due to limitations in data collection and unreliable annotation processes. The absence of multi-view features impairs the comprehensive understanding of samples, omitting crucial details essential for classification. To address this issue, we present a task-augmented cross-view imputation network (TACVI-Net) for the purpose of handling partial multi-view incomplete multi-label classification. Specifically, we employ a two-stage network to derive highly task-relevant features to recover the missing views. In the first stage, we leverage the information bottleneck theory to obtain a discriminative representation of each view by extracting task-relevant information through a view-specific encoder-classifier architecture. In the second stage, an autoencoder based multi-view reconstruction network is utilized to extract high-level semantic representation of the augmented features and recover the missing data, thereby aiding the final classification task. Extensive experiments on five datasets demonstrate that our TACVI-Net outperforms other state-of-the-art methods.
{"title":"Task-Augmented Cross-View Imputation Network for Partial Multi-View Incomplete Multi-Label Classification","authors":"Xiaohuan Lu, Lian Zhao, Wai Keung Wong, Jie Wen, Jiang Long, Wulin Xie","doi":"arxiv-2409.07931","DOIUrl":"https://doi.org/arxiv-2409.07931","url":null,"abstract":"In real-world scenarios, multi-view multi-label learning often encounters the\u0000challenge of incomplete training data due to limitations in data collection and\u0000unreliable annotation processes. The absence of multi-view features impairs the\u0000comprehensive understanding of samples, omitting crucial details essential for\u0000classification. To address this issue, we present a task-augmented cross-view\u0000imputation network (TACVI-Net) for the purpose of handling partial multi-view\u0000incomplete multi-label classification. Specifically, we employ a two-stage\u0000network to derive highly task-relevant features to recover the missing views.\u0000In the first stage, we leverage the information bottleneck theory to obtain a\u0000discriminative representation of each view by extracting task-relevant\u0000information through a view-specific encoder-classifier architecture. In the\u0000second stage, an autoencoder based multi-view reconstruction network is\u0000utilized to extract high-level semantic representation of the augmented\u0000features and recover the missing data, thereby aiding the final classification\u0000task. Extensive experiments on five datasets demonstrate that our TACVI-Net\u0000outperforms other state-of-the-art methods.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin Thißen, Thi Ngoc Diep Tran, Ben Joel Schönbein, Ute Trapp, Barbara Esteve Ratsch, Beate Egner, Romana Piat, Elke Hergenröther
The examination of the musculoskeletal system in dogs is a challenging task in veterinary practice. In this work, a novel method has been developed that enables efficient documentation of a dog's condition through a visual representation. However, since the visual documentation is new, there is no existing training data. The objective of this work is therefore to mitigate the impact of data scarcity in order to develop an AI-based diagnostic support system. To this end, the potential of synthetic data that mimics realistic visual documentations of diseases for pre-training AI models is investigated. We propose a method for generating synthetic image data that mimics realistic visual documentations. Initially, a basic dataset containing three distinct classes is generated, followed by the creation of a more sophisticated dataset containing 36 different classes. Both datasets are used for the pre-training of an AI model. Subsequently, an evaluation dataset is created, consisting of 250 manually created visual documentations for five different diseases. This dataset, along with a subset containing 25 examples. The obtained results on the evaluation dataset containing 25 examples demonstrate a significant enhancement of approximately 10% in diagnosis accuracy when utilizing generated synthetic images that mimic real-world visual documentations. However, these results do not hold true for the larger evaluation dataset containing 250 examples, indicating that the advantages of using synthetic data for pre-training an AI model emerge primarily when dealing with few examples of visual documentations for a given disease. Overall, this work provides valuable insights into mitigating the limitations imposed by limited training data through the strategic use of generated synthetic data, presenting an approach applicable beyond the canine musculoskeletal assessment domain.
{"title":"Enhancing Canine Musculoskeletal Diagnoses: Leveraging Synthetic Image Data for Pre-Training AI-Models on Visual Documentations","authors":"Martin Thißen, Thi Ngoc Diep Tran, Ben Joel Schönbein, Ute Trapp, Barbara Esteve Ratsch, Beate Egner, Romana Piat, Elke Hergenröther","doi":"arxiv-2409.08181","DOIUrl":"https://doi.org/arxiv-2409.08181","url":null,"abstract":"The examination of the musculoskeletal system in dogs is a challenging task\u0000in veterinary practice. In this work, a novel method has been developed that\u0000enables efficient documentation of a dog's condition through a visual\u0000representation. However, since the visual documentation is new, there is no\u0000existing training data. The objective of this work is therefore to mitigate the\u0000impact of data scarcity in order to develop an AI-based diagnostic support\u0000system. To this end, the potential of synthetic data that mimics realistic\u0000visual documentations of diseases for pre-training AI models is investigated.\u0000We propose a method for generating synthetic image data that mimics realistic\u0000visual documentations. Initially, a basic dataset containing three distinct\u0000classes is generated, followed by the creation of a more sophisticated dataset\u0000containing 36 different classes. Both datasets are used for the pre-training of\u0000an AI model. Subsequently, an evaluation dataset is created, consisting of 250\u0000manually created visual documentations for five different diseases. This\u0000dataset, along with a subset containing 25 examples. The obtained results on\u0000the evaluation dataset containing 25 examples demonstrate a significant\u0000enhancement of approximately 10% in diagnosis accuracy when utilizing generated\u0000synthetic images that mimic real-world visual documentations. However, these\u0000results do not hold true for the larger evaluation dataset containing 250\u0000examples, indicating that the advantages of using synthetic data for\u0000pre-training an AI model emerge primarily when dealing with few examples of\u0000visual documentations for a given disease. Overall, this work provides valuable\u0000insights into mitigating the limitations imposed by limited training data\u0000through the strategic use of generated synthetic data, presenting an approach\u0000applicable beyond the canine musculoskeletal assessment domain.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oliver Grainge, Michael Milford, Indu Bodala, Sarvapali D. Ramchurn, Shoaib Ehsan
Visual Place Recognition (VPR) is fundamental for the global re-localization of robots and devices, enabling them to recognize previously visited locations based on visual inputs. This capability is crucial for maintaining accurate mapping and localization over large areas. Given that VPR methods need to operate in real-time on embedded systems, it is critical to optimize these systems for minimal resource consumption. While the most efficient VPR approaches employ standard convolutional backbones with fixed descriptor dimensions, these often lead to redundancy in the embedding space as well as in the network architecture. Our work introduces a novel structured pruning method, to not only streamline common VPR architectures but also to strategically remove redundancies within the feature embedding space. This dual focus significantly enhances the efficiency of the system, reducing both map and model memory requirements and decreasing feature extraction and retrieval latencies. Our approach has reduced memory usage and latency by 21% and 16%, respectively, across models, while minimally impacting recall@1 accuracy by less than 1%. This significant improvement enhances real-time applications on edge devices with negligible accuracy loss.
{"title":"Structured Pruning for Efficient Visual Place Recognition","authors":"Oliver Grainge, Michael Milford, Indu Bodala, Sarvapali D. Ramchurn, Shoaib Ehsan","doi":"arxiv-2409.07834","DOIUrl":"https://doi.org/arxiv-2409.07834","url":null,"abstract":"Visual Place Recognition (VPR) is fundamental for the global re-localization\u0000of robots and devices, enabling them to recognize previously visited locations\u0000based on visual inputs. This capability is crucial for maintaining accurate\u0000mapping and localization over large areas. Given that VPR methods need to\u0000operate in real-time on embedded systems, it is critical to optimize these\u0000systems for minimal resource consumption. While the most efficient VPR\u0000approaches employ standard convolutional backbones with fixed descriptor\u0000dimensions, these often lead to redundancy in the embedding space as well as in\u0000the network architecture. Our work introduces a novel structured pruning\u0000method, to not only streamline common VPR architectures but also to\u0000strategically remove redundancies within the feature embedding space. This dual\u0000focus significantly enhances the efficiency of the system, reducing both map\u0000and model memory requirements and decreasing feature extraction and retrieval\u0000latencies. Our approach has reduced memory usage and latency by 21% and 16%,\u0000respectively, across models, while minimally impacting recall@1 accuracy by\u0000less than 1%. This significant improvement enhances real-time applications on\u0000edge devices with negligible accuracy loss.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang
Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.
深度补全旨在利用深度传感器的稀疏深度测量数据预测密集深度图。目前,基于卷积神经网络(CNN)的模型是深度补全任务中最常用的方法。然而,尽管这些模型具有出色的高端性能,但它们的代表性区域有限。为了克服卷积神经网络的缺点,人们提出了一种更有效、更强大的方法:Transformer,它是一种自适应自我关注设置序列到序列模型。标准的变换器会因输入分辨率的 key-querydot-product 而四倍地增加计算成本,这就不适当地使用了深度完成任务。在这项工作中,我们针对深度补全任务提出了一种不同的基于窗口的变换器架构,命名为稀疏到密集变换器(SDformer)。具体来说,我们首先通过输入模型将深度图特征与 RGB 图像特征串联起来。具体来说,我们首先通过输入模型将深度图特征与 RGB 图像特征串联起来,然后,我们不再使用整个特征图来计算自注意力,而是使用不同大小的窗口来提取长距离深度依赖关系。最后,我们对输入模块和 U 型编码器-解码器变换器模块的预测特征进行细化,得到丰富的深度特征,并利用卷积层获得有密度的深度图。在实践中,SDformer 在 NYU Depth V2 和 KITTI DC 数据集上以较低的计算负荷和参数获得了与基于 CNN 的深度补全模型相比最先进的结果。
{"title":"SDformer: Efficient End-to-End Transformer for Depth Completion","authors":"Jian Qian, Miao Sun, Ashley Lee, Jie Li, Shenglong Zhuo, Patrick Yin Chiang","doi":"arxiv-2409.08159","DOIUrl":"https://doi.org/arxiv-2409.08159","url":null,"abstract":"Depth completion aims to predict dense depth maps with sparse depth\u0000measurements from a depth sensor. Currently, Convolutional Neural Network (CNN)\u0000based models are the most popular methods applied to depth completion tasks.\u0000However, despite the excellent high-end performance, they suffer from a limited\u0000representation area. To overcome the drawbacks of CNNs, a more effective and\u0000powerful method has been presented: the Transformer, which is an adaptive\u0000self-attention setting sequence-to-sequence model. While the standard\u0000Transformer quadratically increases the computational cost from the key-query\u0000dot-product of input resolution which improperly employs depth completion\u0000tasks. In this work, we propose a different window-based Transformer\u0000architecture for depth completion tasks named Sparse-to-Dense Transformer\u0000(SDformer). The network consists of an input module for the depth map and RGB\u0000image features extraction and concatenation, a U-shaped encoder-decoder\u0000Transformer for extracting deep features, and a refinement module.\u0000Specifically, we first concatenate the depth map features with the RGB image\u0000features through the input model. Then, instead of calculating self-attention\u0000with the whole feature maps, we apply different window sizes to extract the\u0000long-range depth dependencies. Finally, we refine the predicted features from\u0000the input module and the U-shaped encoder-decoder Transformer module to get the\u0000enriching depth features and employ a convolution layer to obtain the dense\u0000depth map. In practice, the SDformer obtains state-of-the-art results against\u0000the CNN-based depth completion models with lower computing loads and parameters\u0000on the NYU Depth V2 and KITTI DC datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present LT3SD, a novel latent diffusion model for large-scale 3D scene generation. Recent advances in diffusion models have shown impressive results in 3D object generation, but are limited in spatial extent and quality when extended to 3D scenes. To generate complex and diverse 3D scene structures, we introduce a latent tree representation to effectively encode both lower-frequency geometry and higher-frequency detail in a coarse-to-fine hierarchy. We can then learn a generative diffusion process in this latent 3D scene space, modeling the latent components of a scene at each resolution level. To synthesize large-scale scenes with varying sizes, we train our diffusion model on scene patches and synthesize arbitrary-sized output 3D scenes through shared diffusion generation across multiple scene patches. Through extensive experiments, we demonstrate the efficacy and benefits of LT3SD for large-scale, high-quality unconditional 3D scene generation and for probabilistic completion for partial scene observations.
{"title":"LT3SD: Latent Trees for 3D Scene Diffusion","authors":"Quan Meng, Lei Li, Matthias Nießner, Angela Dai","doi":"arxiv-2409.08215","DOIUrl":"https://doi.org/arxiv-2409.08215","url":null,"abstract":"We present LT3SD, a novel latent diffusion model for large-scale 3D scene\u0000generation. Recent advances in diffusion models have shown impressive results\u0000in 3D object generation, but are limited in spatial extent and quality when\u0000extended to 3D scenes. To generate complex and diverse 3D scene structures, we\u0000introduce a latent tree representation to effectively encode both\u0000lower-frequency geometry and higher-frequency detail in a coarse-to-fine\u0000hierarchy. We can then learn a generative diffusion process in this latent 3D\u0000scene space, modeling the latent components of a scene at each resolution\u0000level. To synthesize large-scale scenes with varying sizes, we train our\u0000diffusion model on scene patches and synthesize arbitrary-sized output 3D\u0000scenes through shared diffusion generation across multiple scene patches.\u0000Through extensive experiments, we demonstrate the efficacy and benefits of\u0000LT3SD for large-scale, high-quality unconditional 3D scene generation and for\u0000probabilistic completion for partial scene observations.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit
Online object segmentation and tracking in Lidar point clouds enables autonomous agents to understand their surroundings and make safe decisions. Unfortunately, manual annotations for these tasks are prohibitively costly. We tackle this problem with the task of class-agnostic unsupervised online instance segmentation and tracking. To that end, we leverage an instance segmentation backbone and propose a new training recipe that enables the online tracking of objects. Our network is trained on pseudo-labels, eliminating the need for manual annotations. We conduct an evaluation using metrics adapted for temporal instance segmentation. Computing these metrics requires temporally-consistent instance labels. When unavailable, we construct these labels using the available 3D bounding boxes and semantic labels in the dataset. We compare our method against strong baselines and demonstrate its superiority across two different outdoor Lidar datasets.
{"title":"UNIT: Unsupervised Online Instance Segmentation through Time","authors":"Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit","doi":"arxiv-2409.07887","DOIUrl":"https://doi.org/arxiv-2409.07887","url":null,"abstract":"Online object segmentation and tracking in Lidar point clouds enables\u0000autonomous agents to understand their surroundings and make safe decisions.\u0000Unfortunately, manual annotations for these tasks are prohibitively costly. We\u0000tackle this problem with the task of class-agnostic unsupervised online\u0000instance segmentation and tracking. To that end, we leverage an instance\u0000segmentation backbone and propose a new training recipe that enables the online\u0000tracking of objects. Our network is trained on pseudo-labels, eliminating the\u0000need for manual annotations. We conduct an evaluation using metrics adapted for\u0000temporal instance segmentation. Computing these metrics requires\u0000temporally-consistent instance labels. When unavailable, we construct these\u0000labels using the available 3D bounding boxes and semantic labels in the\u0000dataset. We compare our method against strong baselines and demonstrate its\u0000superiority across two different outdoor Lidar datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyu Chen, Ting Han, Changshe Zhang, Weiquan Liu, Jinhe Su, Zongyue Wang, Guorong Cai
RGB-D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization (Depth SAO) as offset to represent real-world spatial relationships. Secondly, the similarity in the feature space of RGB-D is learned by Depth Linear Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level. Finally, an MLP Decoder is utilized to effectively fuse multi-scale features for meeting real-time requirements. Comprehensive experiments demonstrate that the proposed DiPFormer significantly addresses the issue of attention misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% / +1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI (97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes (83.4% mIoU) datasets.
{"title":"Depth Matters: Exploring Deep Interactions of RGB-D for Semantic Segmentation in Traffic Scenes","authors":"Siyu Chen, Ting Han, Changshe Zhang, Weiquan Liu, Jinhe Su, Zongyue Wang, Guorong Cai","doi":"arxiv-2409.07995","DOIUrl":"https://doi.org/arxiv-2409.07995","url":null,"abstract":"RGB-D has gradually become a crucial data source for understanding complex\u0000scenes in assisted driving. However, existing studies have paid insufficient\u0000attention to the intrinsic spatial properties of depth maps. This oversight\u0000significantly impacts the attention representation, leading to prediction\u0000errors caused by attention shift issues. To this end, we propose a novel\u0000learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the\u0000effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization\u0000(Depth SAO) as offset to represent real-world spatial relationships. Secondly,\u0000the similarity in the feature space of RGB-D is learned by Depth Linear\u0000Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level.\u0000Finally, an MLP Decoder is utilized to effectively fuse multi-scale features\u0000for meeting real-time requirements. Comprehensive experiments demonstrate that\u0000the proposed DiPFormer significantly addresses the issue of attention\u0000misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% /\u0000+1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI\u0000(97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes\u0000(83.4% mIoU) datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"433 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. J. Allen, D. Moreno-Fernández, P. Ruiz-Benito, S. W. D. Grieve, E. R. Lines
The global increase in observed forest dieback, characterised by the death of tree foliage, heralds widespread decline in forest ecosystems. This degradation causes significant changes to ecosystem services and functions, including habitat provision and carbon sequestration, which can be difficult to detect using traditional monitoring techniques, highlighting the need for large-scale and high-frequency monitoring. Contemporary developments in the instruments and methods to gather and process data at large-scales mean this monitoring is now possible. In particular, the advancement of low-cost drone technology and deep learning on consumer-level hardware provide new opportunities. Here, we use an approach based on deep learning and vegetation indices to assess crown dieback from RGB aerial data without the need for expensive instrumentation such as LiDAR. We use an iterative approach to match crown footprints predicted by deep learning with field-based inventory data from a Mediterranean ecosystem exhibiting drought-induced dieback, and compare expert field-based crown dieback estimation with vegetation index-based estimates. We obtain high overall segmentation accuracy (mAP: 0.519) without the need for additional technical development of the underlying Mask R-CNN model, underscoring the potential of these approaches for non-expert use and proving their applicability to real-world conservation. We also find colour-coordinate based estimates of dieback correlate well with expert field-based estimation. Substituting ground truth for Mask R-CNN model predictions showed negligible impact on dieback estimates, indicating robustness. Our findings demonstrate the potential of automated data collection and processing, including the application of deep learning, to improve the coverage, speed and cost of forest dieback monitoring.
{"title":"Low-Cost Tree Crown Dieback Estimation Using Deep Learning-Based Segmentation","authors":"M. J. Allen, D. Moreno-Fernández, P. Ruiz-Benito, S. W. D. Grieve, E. R. Lines","doi":"arxiv-2409.08171","DOIUrl":"https://doi.org/arxiv-2409.08171","url":null,"abstract":"The global increase in observed forest dieback, characterised by the death of\u0000tree foliage, heralds widespread decline in forest ecosystems. This degradation\u0000causes significant changes to ecosystem services and functions, including\u0000habitat provision and carbon sequestration, which can be difficult to detect\u0000using traditional monitoring techniques, highlighting the need for large-scale\u0000and high-frequency monitoring. Contemporary developments in the instruments and\u0000methods to gather and process data at large-scales mean this monitoring is now\u0000possible. In particular, the advancement of low-cost drone technology and deep\u0000learning on consumer-level hardware provide new opportunities. Here, we use an\u0000approach based on deep learning and vegetation indices to assess crown dieback\u0000from RGB aerial data without the need for expensive instrumentation such as\u0000LiDAR. We use an iterative approach to match crown footprints predicted by deep\u0000learning with field-based inventory data from a Mediterranean ecosystem\u0000exhibiting drought-induced dieback, and compare expert field-based crown\u0000dieback estimation with vegetation index-based estimates. We obtain high\u0000overall segmentation accuracy (mAP: 0.519) without the need for additional\u0000technical development of the underlying Mask R-CNN model, underscoring the\u0000potential of these approaches for non-expert use and proving their\u0000applicability to real-world conservation. We also find colour-coordinate based\u0000estimates of dieback correlate well with expert field-based estimation.\u0000Substituting ground truth for Mask R-CNN model predictions showed negligible\u0000impact on dieback estimates, indicating robustness. Our findings demonstrate\u0000the potential of automated data collection and processing, including the\u0000application of deep learning, to improve the coverage, speed and cost of forest\u0000dieback monitoring.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"112 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zero-shot subject-driven image generation aims to produce images that incorporate a subject from a given example image. The challenge lies in preserving the subject's identity while aligning with the text prompt, which often requires modifying certain aspects of the subject's appearance. Despite advancements in diffusion model based methods, existing approaches still struggle to balance identity preservation with text prompt alignment. In this study, we conducted an in-depth investigation into this issue and uncovered key insights for achieving effective identity preservation while maintaining a strong balance. Our key findings include: (1) the design of the subject image encoder significantly impacts identity preservation quality, and (2) generating an initial layout is crucial for both text alignment and identity preservation. Building on these insights, we introduce a new approach called EZIGen, which employs two main strategies: a carefully crafted subject image Encoder based on the UNet architecture of the pretrained Stable Diffusion model to ensure high-quality identity transfer, following a process that decouples the guidance stages and iteratively refines the initial image layout. Through these strategies, EZIGen achieves state-of-the-art results on multiple subject-driven benchmarks with a unified model and 100 times less training data.
{"title":"EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance","authors":"Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu","doi":"arxiv-2409.08091","DOIUrl":"https://doi.org/arxiv-2409.08091","url":null,"abstract":"Zero-shot subject-driven image generation aims to produce images that\u0000incorporate a subject from a given example image. The challenge lies in\u0000preserving the subject's identity while aligning with the text prompt, which\u0000often requires modifying certain aspects of the subject's appearance. Despite\u0000advancements in diffusion model based methods, existing approaches still\u0000struggle to balance identity preservation with text prompt alignment. In this\u0000study, we conducted an in-depth investigation into this issue and uncovered key\u0000insights for achieving effective identity preservation while maintaining a\u0000strong balance. Our key findings include: (1) the design of the subject image\u0000encoder significantly impacts identity preservation quality, and (2) generating\u0000an initial layout is crucial for both text alignment and identity preservation.\u0000Building on these insights, we introduce a new approach called EZIGen, which\u0000employs two main strategies: a carefully crafted subject image Encoder based on\u0000the UNet architecture of the pretrained Stable Diffusion model to ensure\u0000high-quality identity transfer, following a process that decouples the guidance\u0000stages and iteratively refines the initial image layout. Through these\u0000strategies, EZIGen achieves state-of-the-art results on multiple subject-driven\u0000benchmarks with a unified model and 100 times less training data.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}