Learning with limited labelled data is a challenging problem in various applications, including remote sensing. Few-shot semantic segmentation is one approach that can encourage deep learning models to learn from few labelled examples for novel classes not seen during the training. The generalized few-shot segmentation setting has an additional challenge which encourages models not only to adapt to the novel classes but also to maintain strong performance on the training base classes. While previous datasets and benchmarks discussed the few-shot segmentation setting in remote sensing, we are the first to propose a generalized few-shot segmentation benchmark for remote sensing. The generalized setting is more realistic and challenging, which necessitates exploring it within the remote sensing context. We release the dataset augmenting OpenEarthMap with additional classes labelled for the generalized few-shot evaluation setting. The dataset is released during the OpenEarthMap land cover mapping generalized few-shot challenge in the L3D-IVU workshop in conjunction with CVPR 2024. In this work, we summarize the dataset and challenge details in addition to providing the benchmark results on the two phases of the challenge for the validation and test sets.
{"title":"Generalized Few-Shot Semantic Segmentation in Remote Sensing: Challenge and Benchmark","authors":"Clifford Broni-Bediako, Junshi Xia, Jian Song, Hongruixuan Chen, Mennatullah Siam, Naoto Yokoya","doi":"arxiv-2409.11227","DOIUrl":"https://doi.org/arxiv-2409.11227","url":null,"abstract":"Learning with limited labelled data is a challenging problem in various\u0000applications, including remote sensing. Few-shot semantic segmentation is one\u0000approach that can encourage deep learning models to learn from few labelled\u0000examples for novel classes not seen during the training. The generalized\u0000few-shot segmentation setting has an additional challenge which encourages\u0000models not only to adapt to the novel classes but also to maintain strong\u0000performance on the training base classes. While previous datasets and\u0000benchmarks discussed the few-shot segmentation setting in remote sensing, we\u0000are the first to propose a generalized few-shot segmentation benchmark for\u0000remote sensing. The generalized setting is more realistic and challenging,\u0000which necessitates exploring it within the remote sensing context. We release\u0000the dataset augmenting OpenEarthMap with additional classes labelled for the\u0000generalized few-shot evaluation setting. The dataset is released during the\u0000OpenEarthMap land cover mapping generalized few-shot challenge in the L3D-IVU\u0000workshop in conjunction with CVPR 2024. In this work, we summarize the dataset\u0000and challenge details in addition to providing the benchmark results on the two\u0000phases of the challenge for the validation and test sets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The present State-of-the-Art (SotA) Image Super-Resolution (ISR) methods employ Deep Learning (DL) techniques using a large amount of image data. The primary limitation to extending the existing SotA ISR works for real-world instances is their computational and time complexities. In this paper, contrary to the existing methods, we present a novel and computationally efficient ISR algorithm that is independent of the image dataset to learn the ISR task. The proposed algorithm reformulates the ISR task from generating the Super-Resolved (SR) images to computing the inverse of the kernels that span the degradation space. We introduce Deep Identity Learning, exploiting the identity relation between the degradation and inverse degradation models. The proposed approach neither relies on the ISR dataset nor on a single input low-resolution (LR) image (like the self-supervised method i.e. ZSSR) to model the ISR task. Hence we term our model as Null-Shot Super-Resolution Using Deep Identity Learning (NSSR-DIL). The proposed NSSR-DIL model requires fewer computational resources, at least by an order of 10, and demonstrates a competitive performance on benchmark ISR datasets. Another salient aspect of our proposition is that the NSSR-DIL framework detours retraining the model and remains the same for varying scale factors like X2, X3, and X4. This makes our highly efficient ISR model more suitable for real-world applications.
{"title":"NSSR-DIL: Null-Shot Image Super-Resolution Using Deep Identity Learning","authors":"Sree Rama Vamsidhar S, Rama Krishna Gorthi","doi":"arxiv-2409.12165","DOIUrl":"https://doi.org/arxiv-2409.12165","url":null,"abstract":"The present State-of-the-Art (SotA) Image Super-Resolution (ISR) methods\u0000employ Deep Learning (DL) techniques using a large amount of image data. The\u0000primary limitation to extending the existing SotA ISR works for real-world\u0000instances is their computational and time complexities. In this paper, contrary\u0000to the existing methods, we present a novel and computationally efficient ISR\u0000algorithm that is independent of the image dataset to learn the ISR task. The\u0000proposed algorithm reformulates the ISR task from generating the Super-Resolved\u0000(SR) images to computing the inverse of the kernels that span the degradation\u0000space. We introduce Deep Identity Learning, exploiting the identity relation\u0000between the degradation and inverse degradation models. The proposed approach\u0000neither relies on the ISR dataset nor on a single input low-resolution (LR)\u0000image (like the self-supervised method i.e. ZSSR) to model the ISR task. Hence\u0000we term our model as Null-Shot Super-Resolution Using Deep Identity Learning\u0000(NSSR-DIL). The proposed NSSR-DIL model requires fewer computational resources,\u0000at least by an order of 10, and demonstrates a competitive performance on\u0000benchmark ISR datasets. Another salient aspect of our proposition is that the\u0000NSSR-DIL framework detours retraining the model and remains the same for\u0000varying scale factors like X2, X3, and X4. This makes our highly efficient ISR\u0000model more suitable for real-world applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI) data, introduced as Recon3DMind in our conference work, is of significant interest to both cognitive neuroscience and computer vision. To advance this task, we present the fMRI-3D dataset, which includes data from 15 participants and showcases a total of 4768 3D objects. The dataset comprises two components: fMRI-Shape, previously introduced and accessible at https://huggingface.co/datasets/Fudan-fMRI/fMRI-Shape, and fMRI-Objaverse, proposed in this paper and available at https://huggingface.co/datasets/Fudan-fMRI/fMRI-Objaverse. fMRI-Objaverse includes data from 5 subjects, 4 of whom are also part of the Core set in fMRI-Shape, with each subject viewing 3142 3D objects across 117 categories, all accompanied by text captions. This significantly enhances the diversity and potential applications of the dataset. Additionally, we propose MinD-3D, a novel framework designed to decode 3D visual information from fMRI signals. The framework first extracts and aggregates features from fMRI data using a neuro-fusion encoder, then employs a feature-bridge diffusion model to generate visual features, and finally reconstructs the 3D object using a generative transformer decoder. We establish new benchmarks by designing metrics at both semantic and structural levels to evaluate model performance. Furthermore, we assess our model's effectiveness in an Out-of-Distribution setting and analyze the attribution of the extracted features and the visual ROIs in fMRI signals. Our experiments demonstrate that MinD-3D not only reconstructs 3D objects with high semantic and spatial accuracy but also deepens our understanding of how human brain processes 3D visual information. Project page at: https://jianxgao.github.io/MinD-3D.
从功能性磁共振成像(fMRI)数据中重建三维视觉效果,在我们的会议工作中被称为Recon3DMind,对认知神经科学和计算机视觉都具有重要意义。为了推进这项任务,我们推出了 fMRI-3D 数据集,其中包括 15 名参与者的数据,并展示了总共 4768 个三维对象。该数据集由两部分组成:fMRI-Shape(之前已介绍过,可访问https://huggingface.co/datasets/Fudan-fMRI/fMRI-Shape)和fMRI-Objaverse(本文提出,可访问https://huggingface.co/datasets/Fudan-fMRI/fMRI-Objaverse)。fMRI-Objaverse包括来自5位受试者的数据,其中4位也是核心集fMRI-Shape的一部分,每位受试者观看了117个类别的3142个三维物体,所有物体都配有文字说明。这大大增强了数据集的多样性和潜在应用。此外,我们还提出了 MinD-3D,一个旨在从 fMRI 信号中解码 3D 视觉信息的高级框架。该框架首先使用神经融合编码器从 fMRI 数据中提取和聚合特征,然后使用特征桥扩散模型生成视觉特征,最后使用生成式变换器解码器重建三维物体。我们设计了语义和结构两个层面的指标来评估模型性能,从而建立了新的基准。我们的实验证明,MinD-3D 不仅能以较高的语义和空间准确性重建 3D 物体,还能加深我们对人脑如何处理 3D 视觉信息的理解。项目页面:https://jianxgao.github.io/MinD-3D。
{"title":"fMRI-3D: A Comprehensive Dataset for Enhancing fMRI-based 3D Reconstruction","authors":"Jianxiong Gao, Yuqian Fu, Yun Wang, Xuelin Qian, Jianfeng Feng, Yanwei Fu","doi":"arxiv-2409.11315","DOIUrl":"https://doi.org/arxiv-2409.11315","url":null,"abstract":"Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI)\u0000data, introduced as Recon3DMind in our conference work, is of significant\u0000interest to both cognitive neuroscience and computer vision. To advance this\u0000task, we present the fMRI-3D dataset, which includes data from 15 participants\u0000and showcases a total of 4768 3D objects. The dataset comprises two components:\u0000fMRI-Shape, previously introduced and accessible at\u0000https://huggingface.co/datasets/Fudan-fMRI/fMRI-Shape, and fMRI-Objaverse,\u0000proposed in this paper and available at\u0000https://huggingface.co/datasets/Fudan-fMRI/fMRI-Objaverse. fMRI-Objaverse\u0000includes data from 5 subjects, 4 of whom are also part of the Core set in\u0000fMRI-Shape, with each subject viewing 3142 3D objects across 117 categories,\u0000all accompanied by text captions. This significantly enhances the diversity and\u0000potential applications of the dataset. Additionally, we propose MinD-3D, a\u0000novel framework designed to decode 3D visual information from fMRI signals. The\u0000framework first extracts and aggregates features from fMRI data using a\u0000neuro-fusion encoder, then employs a feature-bridge diffusion model to generate\u0000visual features, and finally reconstructs the 3D object using a generative\u0000transformer decoder. We establish new benchmarks by designing metrics at both\u0000semantic and structural levels to evaluate model performance. Furthermore, we\u0000assess our model's effectiveness in an Out-of-Distribution setting and analyze\u0000the attribution of the extracted features and the visual ROIs in fMRI signals.\u0000Our experiments demonstrate that MinD-3D not only reconstructs 3D objects with\u0000high semantic and spatial accuracy but also deepens our understanding of how\u0000human brain processes 3D visual information. Project page at:\u0000https://jianxgao.github.io/MinD-3D.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"188 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenwei Wang, Tengfei Wang, Zexin He, Gerhard Hancke, Ziwei Liu, Rynson W. H. Lau
In 3D modeling, designers often use an existing 3D model as a reference to create new ones. This practice has inspired the development of Phidias, a novel generative model that uses diffusion for reference-augmented 3D generation. Given an image, our method leverages a retrieved or user-provided 3D reference model to guide the generation process, thereby enhancing the generation quality, generalization ability, and controllability. Our model integrates three key components: 1) meta-ControlNet that dynamically modulates the conditioning strength, 2) dynamic reference routing that mitigates misalignment between the input image and 3D reference, and 3) self-reference augmentations that enable self-supervised training with a progressive curriculum. Collectively, these designs result in a clear improvement over existing methods. Phidias establishes a unified framework for 3D generation using text, image, and 3D conditions with versatile applications.
{"title":"Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion","authors":"Zhenwei Wang, Tengfei Wang, Zexin He, Gerhard Hancke, Ziwei Liu, Rynson W. H. Lau","doi":"arxiv-2409.11406","DOIUrl":"https://doi.org/arxiv-2409.11406","url":null,"abstract":"In 3D modeling, designers often use an existing 3D model as a reference to\u0000create new ones. This practice has inspired the development of Phidias, a novel\u0000generative model that uses diffusion for reference-augmented 3D generation.\u0000Given an image, our method leverages a retrieved or user-provided 3D reference\u0000model to guide the generation process, thereby enhancing the generation\u0000quality, generalization ability, and controllability. Our model integrates\u0000three key components: 1) meta-ControlNet that dynamically modulates the\u0000conditioning strength, 2) dynamic reference routing that mitigates misalignment\u0000between the input image and 3D reference, and 3) self-reference augmentations\u0000that enable self-supervised training with a progressive curriculum.\u0000Collectively, these designs result in a clear improvement over existing\u0000methods. Phidias establishes a unified framework for 3D generation using text,\u0000image, and 3D conditions with versatile applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"191 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yichen Zhang, Zihan Wang, Jiali Han, Peilin Li, Jiaxun Zhang, Jianqiang Wang, Lei He, Keqiang Li
3D Gaussian Splatting (3DGS) integrates the strengths of primitive-based representations and volumetric rendering techniques, enabling real-time, high-quality rendering. However, 3DGS models typically overfit to single-scene training and are highly sensitive to the initialization of Gaussian ellipsoids, heuristically derived from Structure from Motion (SfM) point clouds, which limits both generalization and practicality. To address these limitations, we propose GS-Net, a generalizable, plug-and-play 3DGS module that densifies Gaussian ellipsoids from sparse SfM point clouds, enhancing geometric structure representation. To the best of our knowledge, GS-Net is the first plug-and-play 3DGS module with cross-scene generalization capabilities. Additionally, we introduce the CARLA-NVS dataset, which incorporates additional camera viewpoints to thoroughly evaluate reconstruction and rendering quality. Extensive experiments demonstrate that applying GS-Net to 3DGS yields a PSNR improvement of 2.08 dB for conventional viewpoints and 1.86 dB for novel viewpoints, confirming the method's effectiveness and robustness.
{"title":"GS-Net: Generalizable Plug-and-Play 3D Gaussian Splatting Module","authors":"Yichen Zhang, Zihan Wang, Jiali Han, Peilin Li, Jiaxun Zhang, Jianqiang Wang, Lei He, Keqiang Li","doi":"arxiv-2409.11307","DOIUrl":"https://doi.org/arxiv-2409.11307","url":null,"abstract":"3D Gaussian Splatting (3DGS) integrates the strengths of primitive-based\u0000representations and volumetric rendering techniques, enabling real-time,\u0000high-quality rendering. However, 3DGS models typically overfit to single-scene\u0000training and are highly sensitive to the initialization of Gaussian ellipsoids,\u0000heuristically derived from Structure from Motion (SfM) point clouds, which\u0000limits both generalization and practicality. To address these limitations, we\u0000propose GS-Net, a generalizable, plug-and-play 3DGS module that densifies\u0000Gaussian ellipsoids from sparse SfM point clouds, enhancing geometric structure\u0000representation. To the best of our knowledge, GS-Net is the first plug-and-play\u00003DGS module with cross-scene generalization capabilities. Additionally, we\u0000introduce the CARLA-NVS dataset, which incorporates additional camera\u0000viewpoints to thoroughly evaluate reconstruction and rendering quality.\u0000Extensive experiments demonstrate that applying GS-Net to 3DGS yields a PSNR\u0000improvement of 2.08 dB for conventional viewpoints and 1.86 dB for novel\u0000viewpoints, confirming the method's effectiveness and robustness.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Liu Haiyang, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, Yuexin Ma
End-to-end autonomous driving with vision-only is not only more cost-effective compared to LiDAR-vision fusion but also more reliable than traditional methods. To achieve a economical and robust purely visual autonomous driving system, we propose RenderWorld, a vision-only end-to-end autonomous driving framework, which generates 3D occupancy labels using a self-supervised gaussian-based Img2Occ Module, then encodes the labels by AM-VAE, and uses world model for forecasting and planning. RenderWorld employs Gaussian Splatting to represent 3D scenes and render 2D images greatly improves segmentation accuracy and reduces GPU memory consumption compared with NeRF-based methods. By applying AM-VAE to encode air and non-air separately, RenderWorld achieves more fine-grained scene element representation, leading to state-of-the-art performance in both 4D occupancy forecasting and motion planning from autoregressive world model.
{"title":"RenderWorld: World Model with Self-Supervised 3D Label","authors":"Ziyang Yan, Wenzhen Dong, Yihua Shao, Yuhang Lu, Liu Haiyang, Jingwen Liu, Haozhe Wang, Zhe Wang, Yan Wang, Fabio Remondino, Yuexin Ma","doi":"arxiv-2409.11356","DOIUrl":"https://doi.org/arxiv-2409.11356","url":null,"abstract":"End-to-end autonomous driving with vision-only is not only more\u0000cost-effective compared to LiDAR-vision fusion but also more reliable than\u0000traditional methods. To achieve a economical and robust purely visual\u0000autonomous driving system, we propose RenderWorld, a vision-only end-to-end\u0000autonomous driving framework, which generates 3D occupancy labels using a\u0000self-supervised gaussian-based Img2Occ Module, then encodes the labels by\u0000AM-VAE, and uses world model for forecasting and planning. RenderWorld employs\u0000Gaussian Splatting to represent 3D scenes and render 2D images greatly improves\u0000segmentation accuracy and reduces GPU memory consumption compared with\u0000NeRF-based methods. By applying AM-VAE to encode air and non-air separately,\u0000RenderWorld achieves more fine-grained scene element representation, leading to\u0000state-of-the-art performance in both 4D occupancy forecasting and motion\u0000planning from autoregressive world model.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh
Few-shot Semantic Segmentation addresses the challenge of segmenting objects in query images with only a handful of annotated examples. However, many previous state-of-the-art methods either have to discard intricate local semantic features or suffer from high computational complexity. To address these challenges, we propose a new Few-shot Semantic Segmentation framework based on the transformer architecture. Our approach introduces the spatial transformer decoder and the contextual mask generation module to improve the relational understanding between support and query images. Moreover, we introduce a multi-scale decoder to refine the segmentation mask by incorporating features from different resolutions in a hierarchical manner. Additionally, our approach integrates global features from intermediate encoder stages to improve contextual understanding, while maintaining a lightweight structure to reduce complexity. This balance between performance and efficiency enables our method to achieve state-of-the-art results on benchmark datasets such as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings. Notably, our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies. https://github.com/amirrezafateh/MSDNet
{"title":"MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping","authors":"Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh","doi":"arxiv-2409.11316","DOIUrl":"https://doi.org/arxiv-2409.11316","url":null,"abstract":"Few-shot Semantic Segmentation addresses the challenge of segmenting objects\u0000in query images with only a handful of annotated examples. However, many\u0000previous state-of-the-art methods either have to discard intricate local\u0000semantic features or suffer from high computational complexity. To address\u0000these challenges, we propose a new Few-shot Semantic Segmentation framework\u0000based on the transformer architecture. Our approach introduces the spatial\u0000transformer decoder and the contextual mask generation module to improve the\u0000relational understanding between support and query images. Moreover, we\u0000introduce a multi-scale decoder to refine the segmentation mask by\u0000incorporating features from different resolutions in a hierarchical manner.\u0000Additionally, our approach integrates global features from intermediate encoder\u0000stages to improve contextual understanding, while maintaining a lightweight\u0000structure to reduce complexity. This balance between performance and efficiency\u0000enables our method to achieve state-of-the-art results on benchmark datasets\u0000such as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings.\u0000Notably, our model with only 1.5 million parameters demonstrates competitive\u0000performance while overcoming limitations of existing methodologies.\u0000https://github.com/amirrezafateh/MSDNet","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"155 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Edgar Heinert, Stephan Tilgner, Timo Palm, Matthias Rottmann
When employing deep neural networks (DNNs) for semantic segmentation in safety-critical applications like automotive perception or medical imaging, it is important to estimate their performance at runtime, e.g. via uncertainty estimates or prediction quality estimates. Previous works mostly performed uncertainty estimation on pixel-level. In a line of research, a connected-component-wise (segment-wise) perspective was taken, approaching uncertainty estimation on an object-level by performing so-called meta classification and regression to estimate uncertainty and prediction quality, respectively. In those works, each predicted segment is considered individually to estimate its uncertainty or prediction quality. However, the neighboring segments may provide additional hints on whether a given predicted segment is of high quality, which we study in the present work. On the basis of uncertainty indicating metrics on segment-level, we use graph neural networks (GNNs) to model the relationship of a given segment's quality as a function of the given segment's metrics as well as those of its neighboring segments. We compare different GNN architectures and achieve a notable performance improvement.
{"title":"Uncertainty and Prediction Quality Estimation for Semantic Segmentation via Graph Neural Networks","authors":"Edgar Heinert, Stephan Tilgner, Timo Palm, Matthias Rottmann","doi":"arxiv-2409.11373","DOIUrl":"https://doi.org/arxiv-2409.11373","url":null,"abstract":"When employing deep neural networks (DNNs) for semantic segmentation in\u0000safety-critical applications like automotive perception or medical imaging, it\u0000is important to estimate their performance at runtime, e.g. via uncertainty\u0000estimates or prediction quality estimates. Previous works mostly performed\u0000uncertainty estimation on pixel-level. In a line of research, a\u0000connected-component-wise (segment-wise) perspective was taken, approaching\u0000uncertainty estimation on an object-level by performing so-called meta\u0000classification and regression to estimate uncertainty and prediction quality,\u0000respectively. In those works, each predicted segment is considered individually\u0000to estimate its uncertainty or prediction quality. However, the neighboring\u0000segments may provide additional hints on whether a given predicted segment is\u0000of high quality, which we study in the present work. On the basis of\u0000uncertainty indicating metrics on segment-level, we use graph neural networks\u0000(GNNs) to model the relationship of a given segment's quality as a function of\u0000the given segment's metrics as well as those of its neighboring segments. We\u0000compare different GNN architectures and achieve a notable performance\u0000improvement.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe
Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200$times$ faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.
{"title":"Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think","authors":"Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe","doi":"arxiv-2409.11355","DOIUrl":"https://doi.org/arxiv-2409.11355","url":null,"abstract":"Recent work showed that large diffusion models can be reused as highly\u0000precise monocular depth estimators by casting depth estimation as an\u0000image-conditional image generation task. While the proposed model achieved\u0000state-of-the-art results, high computational demands due to multi-step\u0000inference limited its use in many scenarios. In this paper, we show that the\u0000perceived inefficiency was caused by a flaw in the inference pipeline that has\u0000so far gone unnoticed. The fixed model performs comparably to the best\u0000previously reported configuration while being more than 200$times$ faster. To\u0000optimize for downstream task performance, we perform end-to-end fine-tuning on\u0000top of the single-step model with task-specific losses and get a deterministic\u0000model that outperforms all other diffusion-based depth and normal estimation\u0000models on common zero-shot benchmarks. We surprisingly find that this\u0000fine-tuning protocol also works directly on Stable Diffusion and achieves\u0000comparable performance to current state-of-the-art diffusion-based depth and\u0000normal estimation models, calling into question some of the conclusions drawn\u0000from prior works.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, we introduce OmniGen, a new diffusion model for unified image generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen no longer requires additional modules such as ControlNet or IP-Adapter to process diverse control conditions. OmniGenis characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports other downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. Additionally, OmniGen can handle classical computer vision tasks by transforming them into image generation tasks, such as edge detection and human pose recognition. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional text encoders. Moreover, it is more user-friendly compared to existing diffusion models, enabling complex tasks to be accomplished through instructions without the need for extra preprocessing steps (e.g., human pose estimation), thereby significantly simplifying the workflow of image generation. 3) Knowledge Transfer: Through learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model's reasoning capabilities and potential applications of chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and there remain several unresolved issues. We will open-source the related resources at https://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.
{"title":"OmniGen: Unified Image Generation","authors":"Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu","doi":"arxiv-2409.11340","DOIUrl":"https://doi.org/arxiv-2409.11340","url":null,"abstract":"In this work, we introduce OmniGen, a new diffusion model for unified image\u0000generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen\u0000no longer requires additional modules such as ControlNet or IP-Adapter to\u0000process diverse control conditions. OmniGenis characterized by the following\u0000features: 1) Unification: OmniGen not only demonstrates text-to-image\u0000generation capabilities but also inherently supports other downstream tasks,\u0000such as image editing, subject-driven generation, and visual-conditional\u0000generation. Additionally, OmniGen can handle classical computer vision tasks by\u0000transforming them into image generation tasks, such as edge detection and human\u0000pose recognition. 2) Simplicity: The architecture of OmniGen is highly\u0000simplified, eliminating the need for additional text encoders. Moreover, it is\u0000more user-friendly compared to existing diffusion models, enabling complex\u0000tasks to be accomplished through instructions without the need for extra\u0000preprocessing steps (e.g., human pose estimation), thereby significantly\u0000simplifying the workflow of image generation. 3) Knowledge Transfer: Through\u0000learning in a unified format, OmniGen effectively transfers knowledge across\u0000different tasks, manages unseen tasks and domains, and exhibits novel\u0000capabilities. We also explore the model's reasoning capabilities and potential\u0000applications of chain-of-thought mechanism. This work represents the first\u0000attempt at a general-purpose image generation model, and there remain several\u0000unresolved issues. We will open-source the related resources at\u0000https://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"205 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}