首页 > 最新文献

BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference最新文献

英文 中文
Analysis of Training Object Detection Models with Synthetic Data 基于合成数据的训练目标检测模型分析
Bram Vanherle, Steven Moonen, F. Reeth, Nick Michiels
Recently, the use of synthetic training data has been on the rise as it offers correctly labelled datasets at a lower cost. The downside of this technique is that the so-called domain gap between the real target images and synthetic training data leads to a decrease in performance. In this paper, we attempt to provide a holistic overview of how to use synthetic data for object detection. We analyse aspects of generating the data as well as techniques used to train the models. We do so by devising a number of experiments, training models on the Dataset of Industrial Metal Objects (DIMO). This dataset contains both real and synthetic images. The synthetic part has different subsets that are either exact synthetic copies of the real data or are copies with certain aspects randomised. This allows us to analyse what types of variation are good for synthetic training data and which aspects should be modelled to closely match the target data. Furthermore, we investigate what types of training techniques are beneficial towards generalisation to real data, and how to use them. Additionally, we analyse how real images can be leveraged when training on synthetic images. All these experiments are validated on real data and benchmarked to models trained on real data. The results offer a number of interesting takeaways that can serve as basic guidelines for using synthetic data for object detection. Code to reproduce results is available at https://github.com/EDM-Research/DIMO_ObjectDetection.
最近,合成训练数据的使用一直在增加,因为它以较低的成本提供正确标记的数据集。这种技术的缺点是真实目标图像和合成训练数据之间的所谓域差距导致性能下降。在本文中,我们试图提供如何使用合成数据进行目标检测的整体概述。我们分析了生成数据的各个方面以及用于训练模型的技术。我们通过设计一些实验,在工业金属物体数据集(DIMO)上训练模型来做到这一点。该数据集包含真实图像和合成图像。合成部分有不同的子集,这些子集要么是真实数据的精确合成副本,要么是某些方面随机化的副本。这使我们能够分析哪些类型的变化对综合训练数据是有益的,哪些方面应该建模以密切匹配目标数据。此外,我们研究了哪些类型的训练技术有利于推广到真实数据,以及如何使用它们。此外,我们还分析了在对合成图像进行训练时如何利用真实图像。所有的实验都在真实数据上进行了验证,并对在真实数据上训练的模型进行了基准测试。这些结果提供了一些有趣的结论,可以作为使用合成数据进行目标检测的基本指导方针。复制结果的代码可从https://github.com/EDM-Research/DIMO_ObjectDetection获得。
{"title":"Analysis of Training Object Detection Models with Synthetic Data","authors":"Bram Vanherle, Steven Moonen, F. Reeth, Nick Michiels","doi":"10.48550/arXiv.2211.16066","DOIUrl":"https://doi.org/10.48550/arXiv.2211.16066","url":null,"abstract":"Recently, the use of synthetic training data has been on the rise as it offers correctly labelled datasets at a lower cost. The downside of this technique is that the so-called domain gap between the real target images and synthetic training data leads to a decrease in performance. In this paper, we attempt to provide a holistic overview of how to use synthetic data for object detection. We analyse aspects of generating the data as well as techniques used to train the models. We do so by devising a number of experiments, training models on the Dataset of Industrial Metal Objects (DIMO). This dataset contains both real and synthetic images. The synthetic part has different subsets that are either exact synthetic copies of the real data or are copies with certain aspects randomised. This allows us to analyse what types of variation are good for synthetic training data and which aspects should be modelled to closely match the target data. Furthermore, we investigate what types of training techniques are beneficial towards generalisation to real data, and how to use them. Additionally, we analyse how real images can be leveraged when training on synthetic images. All these experiments are validated on real data and benchmarked to models trained on real data. The results offer a number of interesting takeaways that can serve as basic guidelines for using synthetic data for object detection. Code to reproduce results is available at https://github.com/EDM-Research/DIMO_ObjectDetection.","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"252 1","pages":"833"},"PeriodicalIF":0.0,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78199662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Efficient Feature Extraction for High-resolution Video Frame Interpolation 高分辨率视频帧插值的高效特征提取
M. Nottebaum, S. Roth, Simone Schaub-Meyer
Most deep learning methods for video frame interpolation consist of three main components: feature extraction, motion estimation, and image synthesis. Existing approaches are mainly distinguishable in terms of how these modules are designed. However, when interpolating high-resolution images, e.g. at 4K, the design choices for achieving high accuracy within reasonable memory requirements are limited. The feature extraction layers help to compress the input and extract relevant information for the latter stages, such as motion estimation. However, these layers are often costly in parameters, computation time, and memory. We show how ideas from dimensionality reduction combined with a lightweight optimization can be used to compress the input representation while keeping the extracted information suitable for frame interpolation. Further, we require neither a pretrained flow network nor a synthesis network, additionally reducing the number of trainable parameters and required memory. When evaluating on three 4K benchmarks, we achieve state-of-the-art image quality among the methods without pretrained flow while having the lowest network complexity and memory requirements overall.
大多数视频帧插值的深度学习方法包括三个主要部分:特征提取、运动估计和图像合成。现有方法的主要区别在于如何设计这些模块。然而,当插值高分辨率图像时,例如4K,在合理的内存要求下实现高精度的设计选择是有限的。特征提取层有助于压缩输入并提取后期阶段的相关信息,例如运动估计。然而,这些层通常在参数、计算时间和内存方面代价高昂。我们展示了如何使用降维与轻量级优化相结合的思想来压缩输入表示,同时保持提取的信息适合帧插值。此外,我们既不需要预训练的流网络,也不需要合成网络,另外减少了可训练参数的数量和所需的内存。在三个4K基准测试上进行评估时,我们在没有预训练流的方法中获得了最先进的图像质量,同时具有最低的网络复杂性和总体内存要求。
{"title":"Efficient Feature Extraction for High-resolution Video Frame Interpolation","authors":"M. Nottebaum, S. Roth, Simone Schaub-Meyer","doi":"10.48550/arXiv.2211.14005","DOIUrl":"https://doi.org/10.48550/arXiv.2211.14005","url":null,"abstract":"Most deep learning methods for video frame interpolation consist of three main components: feature extraction, motion estimation, and image synthesis. Existing approaches are mainly distinguishable in terms of how these modules are designed. However, when interpolating high-resolution images, e.g. at 4K, the design choices for achieving high accuracy within reasonable memory requirements are limited. The feature extraction layers help to compress the input and extract relevant information for the latter stages, such as motion estimation. However, these layers are often costly in parameters, computation time, and memory. We show how ideas from dimensionality reduction combined with a lightweight optimization can be used to compress the input representation while keeping the extracted information suitable for frame interpolation. Further, we require neither a pretrained flow network nor a synthesis network, additionally reducing the number of trainable parameters and required memory. When evaluating on three 4K benchmarks, we achieve state-of-the-art image quality among the methods without pretrained flow while having the lowest network complexity and memory requirements overall.","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"24 1","pages":"825"},"PeriodicalIF":0.0,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91092339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MorphPool: Efficient Non-linear Pooling & Unpooling in CNNs MorphPool: cnn中高效的非线性池化和解池化
R. Groenendijk, L. Dorst, T. Gevers
Pooling is essentially an operation from the field of Mathematical Morphology, with max pooling as a limited special case. The more general setting of MorphPooling greatly extends the tool set for building neural networks. In addition to pooling operations, encoder-decoder networks used for pixel-level predictions also require unpooling. It is common to combine unpooling with convolution or deconvolution for up-sampling. However, using its morphological properties, unpooling can be generalised and improved. Extensive experimentation on two tasks and three large-scale datasets shows that morphological pooling and unpooling lead to improved predictive performance at much reduced parameter counts.
池化本质上是一种来自数学形态学领域的操作,最大池化是一种有限的特殊情况。MorphPooling更通用的设置极大地扩展了构建神经网络的工具集。除了池化操作,用于像素级预测的编码器-解码器网络也需要解池化。通常将解池与卷积或反卷积结合起来进行上采样。然而,利用其形态特性,解池可以推广和改进。在两个任务和三个大规模数据集上进行的大量实验表明,形态池化和解池化可以在大大减少参数计数的情况下提高预测性能。
{"title":"MorphPool: Efficient Non-linear Pooling & Unpooling in CNNs","authors":"R. Groenendijk, L. Dorst, T. Gevers","doi":"10.48550/arXiv.2211.14037","DOIUrl":"https://doi.org/10.48550/arXiv.2211.14037","url":null,"abstract":"Pooling is essentially an operation from the field of Mathematical Morphology, with max pooling as a limited special case. The more general setting of MorphPooling greatly extends the tool set for building neural networks. In addition to pooling operations, encoder-decoder networks used for pixel-level predictions also require unpooling. It is common to combine unpooling with convolution or deconvolution for up-sampling. However, using its morphological properties, unpooling can be generalised and improved. Extensive experimentation on two tasks and three large-scale datasets shows that morphological pooling and unpooling lead to improved predictive performance at much reduced parameter counts.","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"21 1","pages":"56"},"PeriodicalIF":0.0,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78470486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for Urban-Scene Segmentation 复制-粘贴连贯深度区域改进了城市场景分割的对比学习
Liang Zeng, A. Lengyel, Nergis Tomen, J. V. Gemert
In this work, we leverage estimated depth to boost self-supervised contrastive learning for segmentation of urban scenes, where unlabeled videos are readily available for training self-supervised depth estimation. We argue that the semantics of a coherent group of pixels in 3D space is self-contained and invariant to the contexts in which they appear. We group coherent, semantically related pixels into coherent depth regions given their estimated depth and use copy-paste to synthetically vary their contexts. In this way, cross-context correspondences are built in contrastive learning and a context-invariant representation is learned. For unsupervised semantic segmentation of urban scenes, our method surpasses the previous state-of-the-art baseline by +7.14% in mIoU on Cityscapes and +6.65% on KITTI. For fine-tuning on Cityscapes and KITTI segmentation, our method is competitive with existing models, yet, we do not need to pre-train on ImageNet or COCO, and we are also more computationally efficient. Our code is available on https://github.com/LeungTsang/CPCDR
在这项工作中,我们利用估计的深度来促进城市场景分割的自监督对比学习,其中未标记的视频很容易用于训练自监督深度估计。我们认为,在三维空间中,一个连贯的像素组的语义是自包含的,并且对它们出现的上下文是不变的。我们将连贯的,语义相关的像素分组到给定其估计深度的连贯深度区域,并使用复制粘贴来综合改变其上下文。通过这种方式,在对比学习中建立了跨上下文对应关系,并学习了上下文不变的表示。对于城市场景的无监督语义分割,我们的方法在城市场景的mIoU和KITTI上分别超过了之前最先进的基线+7.14%和+6.65%。对于城市景观和KITTI分割的微调,我们的方法与现有模型相比具有竞争力,然而,我们不需要在ImageNet或COCO上进行预训练,并且我们的计算效率也更高。我们的代码可以在https://github.com/LeungTsang/CPCDR上找到
{"title":"Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for Urban-Scene Segmentation","authors":"Liang Zeng, A. Lengyel, Nergis Tomen, J. V. Gemert","doi":"10.48550/arXiv.2211.14074","DOIUrl":"https://doi.org/10.48550/arXiv.2211.14074","url":null,"abstract":"In this work, we leverage estimated depth to boost self-supervised contrastive learning for segmentation of urban scenes, where unlabeled videos are readily available for training self-supervised depth estimation. We argue that the semantics of a coherent group of pixels in 3D space is self-contained and invariant to the contexts in which they appear. We group coherent, semantically related pixels into coherent depth regions given their estimated depth and use copy-paste to synthetically vary their contexts. In this way, cross-context correspondences are built in contrastive learning and a context-invariant representation is learned. For unsupervised semantic segmentation of urban scenes, our method surpasses the previous state-of-the-art baseline by +7.14% in mIoU on Cityscapes and +6.65% on KITTI. For fine-tuning on Cityscapes and KITTI segmentation, our method is competitive with existing models, yet, we do not need to pre-train on ImageNet or COCO, and we are also more computationally efficient. Our code is available on https://github.com/LeungTsang/CPCDR","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"75 1","pages":"893"},"PeriodicalIF":0.0,"publicationDate":"2022-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83779256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UV-Based 3D Hand-Object Reconstruction with Grasp Optimization 基于uv的三维手-物重建与抓握优化
Ziwei Yu, Linlin Yang, You Xie, Ping Chen, Angela Yao
We propose a novel framework for 3D hand shape reconstruction and hand-object grasp optimization from a single RGB image. The representation of hand-object contact regions is critical for accurate reconstructions. Instead of approximating the contact regions with sparse points, as in previous works, we propose a dense representation in the form of a UV coordinate map. Furthermore, we introduce inference-time optimization to fine-tune the grasp and improve interactions between the hand and the object. Our pipeline increases hand shape reconstruction accuracy and produces a vibrant hand texture. Experiments on datasets such as Ho3D, FreiHAND, and DexYCB reveal that our proposed method outperforms the state-of-the-art.
本文提出了一种基于单幅RGB图像的三维手部形状重建和手物抓取优化的新框架。手-物体接触区域的表示是精确重建的关键。与之前的工作中使用稀疏点近似接触区域不同,我们提出了一种以UV坐标图形式的密集表示。此外,我们引入了推理时间优化来微调抓取并改善手与物体之间的交互。我们的流水线提高了手部形状重建的准确性,并产生了一个充满活力的手部纹理。在Ho3D、FreiHAND和DexYCB等数据集上的实验表明,我们提出的方法优于最先进的方法。
{"title":"UV-Based 3D Hand-Object Reconstruction with Grasp Optimization","authors":"Ziwei Yu, Linlin Yang, You Xie, Ping Chen, Angela Yao","doi":"10.48550/arXiv.2211.13429","DOIUrl":"https://doi.org/10.48550/arXiv.2211.13429","url":null,"abstract":"We propose a novel framework for 3D hand shape reconstruction and hand-object grasp optimization from a single RGB image. The representation of hand-object contact regions is critical for accurate reconstructions. Instead of approximating the contact regions with sparse points, as in previous works, we propose a dense representation in the form of a UV coordinate map. Furthermore, we introduce inference-time optimization to fine-tune the grasp and improve interactions between the hand and the object. Our pipeline increases hand shape reconstruction accuracy and produces a vibrant hand texture. Experiments on datasets such as Ho3D, FreiHAND, and DexYCB reveal that our proposed method outperforms the state-of-the-art.","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"46 1","pages":"111"},"PeriodicalIF":0.0,"publicationDate":"2022-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74051699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
On the Importance of Image Encoding in Automated Chest X-Ray Report Generation 论图像编码在胸部x线报告自动生成中的重要性
Otabek Nazarov, Mohammad Yaqub, K. Nandakumar
Chest X-ray is one of the most popular medical imaging modalities due to its accessibility and effectiveness. However, there is a chronic shortage of well-trained radiologists who can interpret these images and diagnose the patient's condition. Therefore, automated radiology report generation can be a very helpful tool in clinical practice. A typical report generation workflow consists of two main steps: (i) encoding the image into a latent space and (ii) generating the text of the report based on the latent image embedding. Many existing report generation techniques use a standard convolutional neural network (CNN) architecture for image encoding followed by a Transformer-based decoder for medical text generation. In most cases, CNN and the decoder are trained jointly in an end-to-end fashion. In this work, we primarily focus on understanding the relative importance of encoder and decoder components. Towards this end, we analyze four different image encoding approaches: direct, fine-grained, CLIP-based, and Cluster-CLIP-based encodings in conjunction with three different decoders on the large-scale MIMIC-CXR dataset. Among these encoders, the cluster CLIP visual encoder is a novel approach that aims to generate more discriminative and explainable representations. CLIP-based encoders produce comparable results to traditional CNN-based encoders in terms of NLP metrics, while fine-grained encoding outperforms all other encoders both in terms of NLP and clinical accuracy metrics, thereby validating the importance of image encoder to effectively extract semantic information. GitHub repository: https://github.com/mudabek/encoding-cxr-report-gen
胸部x线是最流行的医学成像方式之一,因为它的可及性和有效性。然而,长期缺乏训练有素的放射科医生来解释这些图像并诊断病人的病情。因此,自动生成放射学报告在临床实践中是一个非常有用的工具。典型的报告生成工作流程包括两个主要步骤:(i)将图像编码到潜在空间中;(ii)基于潜在图像嵌入生成报告文本。许多现有的报告生成技术使用标准的卷积神经网络(CNN)架构进行图像编码,然后使用基于transformer的解码器进行医学文本生成。在大多数情况下,CNN和解码器以端到端方式联合训练。在这项工作中,我们主要关注于理解编码器和解码器组件的相对重要性。为此,我们分析了四种不同的图像编码方法:直接编码、细粒度编码、基于clip的编码和基于cluster - clip的编码,并结合大规模MIMIC-CXR数据集上的三种不同解码器。在这些编码器中,聚类CLIP视觉编码器是一种新颖的方法,旨在生成更具区别性和可解释性的表示。基于clip的编码器在NLP指标方面与传统的基于cnn的编码器产生相当的结果,而细粒度编码在NLP和临床精度指标方面都优于所有其他编码器,从而验证了图像编码器对有效提取语义信息的重要性。GitHub存储库:https://github.com/mudabek/encoding-cxr-report-gen
{"title":"On the Importance of Image Encoding in Automated Chest X-Ray Report Generation","authors":"Otabek Nazarov, Mohammad Yaqub, K. Nandakumar","doi":"10.48550/arXiv.2211.13465","DOIUrl":"https://doi.org/10.48550/arXiv.2211.13465","url":null,"abstract":"Chest X-ray is one of the most popular medical imaging modalities due to its accessibility and effectiveness. However, there is a chronic shortage of well-trained radiologists who can interpret these images and diagnose the patient's condition. Therefore, automated radiology report generation can be a very helpful tool in clinical practice. A typical report generation workflow consists of two main steps: (i) encoding the image into a latent space and (ii) generating the text of the report based on the latent image embedding. Many existing report generation techniques use a standard convolutional neural network (CNN) architecture for image encoding followed by a Transformer-based decoder for medical text generation. In most cases, CNN and the decoder are trained jointly in an end-to-end fashion. In this work, we primarily focus on understanding the relative importance of encoder and decoder components. Towards this end, we analyze four different image encoding approaches: direct, fine-grained, CLIP-based, and Cluster-CLIP-based encodings in conjunction with three different decoders on the large-scale MIMIC-CXR dataset. Among these encoders, the cluster CLIP visual encoder is a novel approach that aims to generate more discriminative and explainable representations. CLIP-based encoders produce comparable results to traditional CNN-based encoders in terms of NLP metrics, while fine-grained encoding outperforms all other encoders both in terms of NLP and clinical accuracy metrics, thereby validating the importance of image encoder to effectively extract semantic information. GitHub repository: https://github.com/mudabek/encoding-cxr-report-gen","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"45 1","pages":"475"},"PeriodicalIF":0.0,"publicationDate":"2022-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88629050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multi-View Neural Surface Reconstruction with Structured Light 基于结构光的多视图神经表面重建
Chunyu Li, Taisuke Hashimoto, Eiichi Matsumoto, Hiroharu Kato
Three-dimensional (3D) object reconstruction based on differentiable rendering (DR) is an active research topic in computer vision. DR-based methods minimize the difference between the rendered and target images by optimizing both the shape and appearance and realizing a high visual reproductivity. However, most approaches perform poorly for textureless objects because of the geometrical ambiguity, which means that multiple shapes can have the same rendered result in such objects. To overcome this problem, we introduce active sensing with structured light (SL) into multi-view 3D object reconstruction based on DR to learn the unknown geometry and appearance of arbitrary scenes and camera poses. More specifically, our framework leverages the correspondences between pixels in different views calculated by structured light as an additional constraint in the DR-based optimization of implicit surface, color representations, and camera poses. Because camera poses can be optimized simultaneously, our method realizes high reconstruction accuracy in the textureless region and reduces efforts for camera pose calibration, which is required for conventional SL-based methods. Experiment results on both synthetic and real data demonstrate that our system outperforms conventional DR- and SL-based methods in a high-quality surface reconstruction, particularly for challenging objects with textureless or shiny surfaces.
基于可微渲染(DR)的三维物体重建是计算机视觉领域的一个活跃研究课题。基于dr的方法通过优化形状和外观,使渲染图像与目标图像之间的差异最小化,实现了较高的视觉再现率。然而,由于几何模糊性,大多数方法对于无纹理对象表现不佳,这意味着多个形状可以在此类对象中具有相同的呈现结果。为了克服这一问题,我们将主动感知与结构光(SL)引入到基于DR的多视图3D物体重建中,以学习任意场景和相机姿势的未知几何形状和外观。更具体地说,我们的框架利用了由结构光计算的不同视图中像素之间的对应关系,作为基于dr的隐式表面、颜色表示和相机姿势优化的额外约束。由于相机姿态可以同时优化,因此该方法在无纹理区域实现了较高的重建精度,减少了传统基于sl的方法所需要的相机姿态校准工作量。在合成和真实数据上的实验结果表明,我们的系统在高质量的表面重建方面优于传统的基于DR和sl的方法,特别是对于具有无纹理或有光泽表面的挑战性物体。
{"title":"Multi-View Neural Surface Reconstruction with Structured Light","authors":"Chunyu Li, Taisuke Hashimoto, Eiichi Matsumoto, Hiroharu Kato","doi":"10.48550/arXiv.2211.11971","DOIUrl":"https://doi.org/10.48550/arXiv.2211.11971","url":null,"abstract":"Three-dimensional (3D) object reconstruction based on differentiable rendering (DR) is an active research topic in computer vision. DR-based methods minimize the difference between the rendered and target images by optimizing both the shape and appearance and realizing a high visual reproductivity. However, most approaches perform poorly for textureless objects because of the geometrical ambiguity, which means that multiple shapes can have the same rendered result in such objects. To overcome this problem, we introduce active sensing with structured light (SL) into multi-view 3D object reconstruction based on DR to learn the unknown geometry and appearance of arbitrary scenes and camera poses. More specifically, our framework leverages the correspondences between pixels in different views calculated by structured light as an additional constraint in the DR-based optimization of implicit surface, color representations, and camera poses. Because camera poses can be optimized simultaneously, our method realizes high reconstruction accuracy in the textureless region and reduces efforts for camera pose calibration, which is required for conventional SL-based methods. Experiment results on both synthetic and real data demonstrate that our system outperforms conventional DR- and SL-based methods in a high-quality surface reconstruction, particularly for challenging objects with textureless or shiny surfaces.","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"31 1","pages":"550"},"PeriodicalIF":0.0,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91483064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
S2-Flow: Joint Semantic and Style Editing of Facial Images S2-Flow:面部图像语义与风格的联合编辑
Krishnakant Singh, Simone Schaub-Meyer, S. Roth
The high-quality images yielded by generative adversarial networks (GANs) have motivated investigations into their application for image editing. However, GANs are often limited in the control they provide for performing specific edits. One of the principal challenges is the entangled latent space of GANs, which is not directly suitable for performing independent and detailed edits. Recent editing methods allow for either controlled style edits or controlled semantic edits. In addition, methods that use semantic masks to edit images have difficulty preserving the identity and are unable to perform controlled style edits. We propose a method to disentangle a GAN$text{'}$s latent space into semantic and style spaces, enabling controlled semantic and style edits for face images independently within the same framework. To achieve this, we design an encoder-decoder based network architecture ($S^2$-Flow), which incorporates two proposed inductive biases. We show the suitability of $S^2$-Flow quantitatively and qualitatively by performing various semantic and style edits.
生成对抗网络(GANs)产生的高质量图像引发了对其在图像编辑中的应用的研究。然而,gan通常在提供执行特定编辑的控制方面受到限制。其中一个主要的挑战是gan的纠缠潜在空间,它不适合直接进行独立和详细的编辑。最近的编辑方法允许控制样式编辑或控制语义编辑。此外,使用语义掩码编辑图像的方法难以保留身份,并且无法执行受控样式编辑。我们提出了一种将GAN$text{'}$s潜在空间分解为语义和样式空间的方法,从而在同一框架内实现对人脸图像的受控语义和样式编辑。为了实现这一点,我们设计了一个基于编码器-解码器的网络架构($S^2$-Flow),它包含了两个提出的归纳偏置。我们通过执行各种语义和风格编辑,定量和定性地展示了$S^2$-Flow的适用性。
{"title":"S2-Flow: Joint Semantic and Style Editing of Facial Images","authors":"Krishnakant Singh, Simone Schaub-Meyer, S. Roth","doi":"10.48550/arXiv.2211.12209","DOIUrl":"https://doi.org/10.48550/arXiv.2211.12209","url":null,"abstract":"The high-quality images yielded by generative adversarial networks (GANs) have motivated investigations into their application for image editing. However, GANs are often limited in the control they provide for performing specific edits. One of the principal challenges is the entangled latent space of GANs, which is not directly suitable for performing independent and detailed edits. Recent editing methods allow for either controlled style edits or controlled semantic edits. In addition, methods that use semantic masks to edit images have difficulty preserving the identity and are unable to perform controlled style edits. We propose a method to disentangle a GAN$text{'}$s latent space into semantic and style spaces, enabling controlled semantic and style edits for face images independently within the same framework. To achieve this, we design an encoder-decoder based network architecture ($S^2$-Flow), which incorporates two proposed inductive biases. We show the suitability of $S^2$-Flow quantitatively and qualitatively by performing various semantic and style edits.","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"1 1","pages":"821"},"PeriodicalIF":0.0,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88216275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Doubly Contrastive End-to-End Semantic Segmentation for Autonomous Driving under Adverse Weather 恶劣天气下自动驾驶的双对比端到端语义分割
Jong-Lyul Jeong, Jong-Hwan Kim
Road scene understanding tasks have recently become crucial for self-driving vehicles. In particular, real-time semantic segmentation is indispensable for intelligent self-driving agents to recognize roadside objects in the driving area. As prior research works have primarily sought to improve the segmentation performance with computationally heavy operations, they require far significant hardware resources for both training and deployment, and thus are not suitable for real-time applications. As such, we propose a doubly contrastive approach to improve the performance of a more practical lightweight model for self-driving, specifically under adverse weather conditions such as fog, nighttime, rain and snow. Our proposed approach exploits both image- and pixel-level contrasts in an end-to-end supervised learning scheme without requiring a memory bank for global consistency or the pretraining step used in conventional contrastive methods. We validate the effectiveness of our method using SwiftNet on the ACDC dataset, where it achieves up to 1.34%p improvement in mIoU (ResNet-18 backbone) at 66.7 FPS (2048x1024 resolution) on a single RTX 3080 Mobile GPU at inference. Furthermore, we demonstrate that replacing image-level supervision with self-supervision achieves comparable performance when pre-trained with clear weather images.
最近,道路场景理解任务对自动驾驶汽车来说变得至关重要。特别是,实时语义分割是智能自动驾驶代理识别驾驶区域内路边物体必不可少的技术。由于先前的研究工作主要是通过计算量大的操作来提高分割性能,因此它们需要大量的硬件资源用于训练和部署,因此不适合实时应用。因此,我们提出了一种双重对比的方法来提高更实用的轻型自动驾驶模型的性能,特别是在雾、夜间、雨雪等恶劣天气条件下。我们提出的方法在端到端监督学习方案中利用图像和像素级对比,而不需要全局一致性的记忆库或传统对比方法中使用的预训练步骤。我们在ACDC数据集上使用SwiftNet验证了我们的方法的有效性,在推理时,它在单个RTX 3080移动GPU上以66.7 FPS (2048 × 1024分辨率)实现了高达1.34%的mIoU (ResNet-18骨干)改进。此外,我们证明了用自我监督取代图像级监督在使用晴朗天气图像进行预训练时取得了相当的性能。
{"title":"Doubly Contrastive End-to-End Semantic Segmentation for Autonomous Driving under Adverse Weather","authors":"Jong-Lyul Jeong, Jong-Hwan Kim","doi":"10.48550/arXiv.2211.11131","DOIUrl":"https://doi.org/10.48550/arXiv.2211.11131","url":null,"abstract":"Road scene understanding tasks have recently become crucial for self-driving vehicles. In particular, real-time semantic segmentation is indispensable for intelligent self-driving agents to recognize roadside objects in the driving area. As prior research works have primarily sought to improve the segmentation performance with computationally heavy operations, they require far significant hardware resources for both training and deployment, and thus are not suitable for real-time applications. As such, we propose a doubly contrastive approach to improve the performance of a more practical lightweight model for self-driving, specifically under adverse weather conditions such as fog, nighttime, rain and snow. Our proposed approach exploits both image- and pixel-level contrasts in an end-to-end supervised learning scheme without requiring a memory bank for global consistency or the pretraining step used in conventional contrastive methods. We validate the effectiveness of our method using SwiftNet on the ACDC dataset, where it achieves up to 1.34%p improvement in mIoU (ResNet-18 backbone) at 66.7 FPS (2048x1024 resolution) on a single RTX 3080 Mobile GPU at inference. Furthermore, we demonstrate that replacing image-level supervision with self-supervision achieves comparable performance when pre-trained with clear weather images.","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"2 1","pages":"460"},"PeriodicalIF":0.0,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79144144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PS-Transformer: Learning Sparse Photometric Stereo Network using Self-Attention Mechanism PS-Transformer:使用自注意机制学习稀疏光度立体网络
Satoshi Ikehata
Existing deep calibrated photometric stereo networks basically aggregate observations under different lights based on the pre-defined operations such as linear projection and max pooling. While they are effective with the dense capture, simple first-order operations often fail to capture the high-order interactions among observations under small number of different lights. To tackle this issue, this paper presents a deep sparse calibrated photometric stereo network named {it PS-Transformer} which leverages the learnable self-attention mechanism to properly capture the complex inter-image interactions. PS-Transformer builds upon the dual-branch design to explore both pixel-wise and image-wise features and individual feature is trained with the intermediate surface normal supervision to maximize geometric feasibility. A new synthetic dataset named CyclesPS+ is also presented with the comprehensive analysis to successfully train the photometric stereo networks. Extensive results on the publicly available benchmark datasets demonstrate that the surface normal prediction accuracy of the proposed method significantly outperforms other state-of-the-art algorithms with the same number of input images and is even comparable to that of dense algorithms which input 10$times$ larger number of images.
现有的深度标定光度立体网络基本上是基于线性投影和最大池化等预定义操作,对不同光照下的观测数据进行聚合。虽然它们对于密集捕获是有效的,但简单的一阶操作通常无法捕获在少量不同光线下观测值之间的高阶相互作用。为了解决这个问题,本文提出了一个名为{it PS-Transformer}的深度稀疏校准光度立体网络,该网络利用可学习的自注意机制来正确捕获复杂的图像间交互。PS-Transformer基于双分支设计来探索像素和图像特征,并且使用中间表面法线监督来训练单个特征,以最大化几何可行性。本文还提出了一个新的合成数据集CyclesPS+,并进行了综合分析,成功地训练了光度立体网络。在公开可用的基准数据集上的大量结果表明,该方法的表面法向预测精度显著优于其他具有相同输入图像数量的最先进算法,甚至可以与输入图像数量大10倍的密集算法相媲美。
{"title":"PS-Transformer: Learning Sparse Photometric Stereo Network using Self-Attention Mechanism","authors":"Satoshi Ikehata","doi":"10.48550/arXiv.2211.11386","DOIUrl":"https://doi.org/10.48550/arXiv.2211.11386","url":null,"abstract":"Existing deep calibrated photometric stereo networks basically aggregate observations under different lights based on the pre-defined operations such as linear projection and max pooling. While they are effective with the dense capture, simple first-order operations often fail to capture the high-order interactions among observations under small number of different lights. To tackle this issue, this paper presents a deep sparse calibrated photometric stereo network named {it PS-Transformer} which leverages the learnable self-attention mechanism to properly capture the complex inter-image interactions. PS-Transformer builds upon the dual-branch design to explore both pixel-wise and image-wise features and individual feature is trained with the intermediate surface normal supervision to maximize geometric feasibility. A new synthetic dataset named CyclesPS+ is also presented with the comprehensive analysis to successfully train the photometric stereo networks. Extensive results on the publicly available benchmark datasets demonstrate that the surface normal prediction accuracy of the proposed method significantly outperforms other state-of-the-art algorithms with the same number of input images and is even comparable to that of dense algorithms which input 10$times$ larger number of images.","PeriodicalId":72437,"journal":{"name":"BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference","volume":"11 1","pages":"30"},"PeriodicalIF":0.0,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87630751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
BMVC : proceedings of the British Machine Vision Conference. British Machine Vision Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1