Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision最新文献_第4页

Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection 三维目标检测的同质多模态特征融合与交互

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-18 DOI: 10.48550/arXiv.2210.09615

Xin Li, Botian Shi, Yuenan Hou, Xingjiao Wu, Tianlong Ma, Yikang Li, Liangbo He

Multi-modal 3D object detection has been an active research topic in autonomous driving. Nevertheless, it is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels. Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels. These fusion approaches often suffer from severe information loss, thus causing sub-optimal performance. To address these problems, we construct the homogeneous structure between the point cloud and images to avoid projective information loss by transforming the camera features into the LiDAR 3D space. In this paper, we propose a homogeneous multi-modal feature fusion and interaction method (HMFI) for 3D object detection. Specifically, we first design an image voxel lifter module (IVLM) to lift 2D image features into the 3D space and generate homogeneous image voxel features. Then, we fuse the voxelized point cloud features with the image features from different regions by introducing the self-attention based query fusion mechanism (QFM). Next, we propose a voxel feature interaction module (VFIM) to enforce the consistency of semantic information from identical objects in the homogeneous point cloud and image voxel representations, which can provide object-level alignment guidance for cross-modal feature fusion and strengthen the discriminative ability in complex backgrounds. We conduct extensive experiments on the KITTI and Waymo Open Dataset, and the proposed HMFI achieves better performance compared with the state-of-the-art multi-modal methods. Particularly, for the 3D detection of cyclist on the KITTI benchmark, HMFI surpasses all the published algorithms by a large margin.

多模态三维目标检测一直是自动驾驶领域的研究热点。然而，探索稀疏的3D点与密集的2D像素之间的跨模态特征融合并非易事。最近的方法要么将图像特征与投影到二维图像平面上的点云特征融合，要么将稀疏的点云与密集的图像像素结合起来。这些融合方法经常遭受严重的信息丢失，从而导致次优性能。为了解决这些问题，我们在点云和图像之间构建同质结构，通过将相机特征转换到LiDAR三维空间来避免投影信息的丢失。本文提出了一种均匀多模态特征融合与交互方法(HMFI)用于三维目标检测。具体来说，我们首先设计了一个图像体素提升模块(IVLM)，将二维图像特征提升到三维空间，并生成均匀的图像体素特征。然后，通过引入基于自关注的查询融合机制(QFM)，将体素化的点云特征与来自不同区域的图像特征进行融合;接下来，我们提出了一个体素特征交互模块(VFIM)来增强同质点云和图像体素表示中相同对象语义信息的一致性，可以为跨模态特征融合提供对象级对齐指导，增强复杂背景下的判别能力。我们在KITTI和Waymo开放数据集上进行了大量的实验，与最先进的多模态方法相比，所提出的HMFI取得了更好的性能。特别是在KITTI基准上对自行车手的三维检测，HMFI大大超过了所有已发表的算法。

{"title":"Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection","authors":"Xin Li, Botian Shi, Yuenan Hou, Xingjiao Wu, Tianlong Ma, Yikang Li, Liangbo He","doi":"10.48550/arXiv.2210.09615","DOIUrl":"https://doi.org/10.48550/arXiv.2210.09615","url":null,"abstract":"Multi-modal 3D object detection has been an active research topic in autonomous driving. Nevertheless, it is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels. Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels. These fusion approaches often suffer from severe information loss, thus causing sub-optimal performance. To address these problems, we construct the homogeneous structure between the point cloud and images to avoid projective information loss by transforming the camera features into the LiDAR 3D space. In this paper, we propose a homogeneous multi-modal feature fusion and interaction method (HMFI) for 3D object detection. Specifically, we first design an image voxel lifter module (IVLM) to lift 2D image features into the 3D space and generate homogeneous image voxel features. Then, we fuse the voxelized point cloud features with the image features from different regions by introducing the self-attention based query fusion mechanism (QFM). Next, we propose a voxel feature interaction module (VFIM) to enforce the consistency of semantic information from identical objects in the homogeneous point cloud and image voxel representations, which can provide object-level alignment guidance for cross-modal feature fusion and strengthen the discriminative ability in complex backgrounds. We conduct extensive experiments on the KITTI and Waymo Open Dataset, and the proposed HMFI achieves better performance compared with the state-of-the-art multi-modal methods. Particularly, for the 3D detection of cyclist on the KITTI benchmark, HMFI surpasses all the published algorithms by a large margin.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"382 1","pages":"691-707"},"PeriodicalIF":0.0,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84958408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Towards Efficient and Effective Self-Supervised Learning of Visual Representations 视觉表征的高效和有效的自监督学习

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-18 DOI: 10.48550/arXiv.2210.09866

Sravanti Addepalli, K. Bhogale, P. Dey, R. Venkatesh Babu

Self-supervision has emerged as a propitious method for visual representation learning after the recent paradigm shift from handcrafted pretext tasks to instance-similarity based approaches. Most state-of-the-art methods enforce similarity between various augmentations of a given image, while some methods additionally use contrastive approaches to explicitly ensure diverse representations. While these approaches have indeed shown promising direction, they require a significantly larger number of training iterations when compared to the supervised counterparts. In this work, we explore reasons for the slow convergence of these methods, and further propose to strengthen them using well-posed auxiliary tasks that converge significantly faster, and are also useful for representation learning. The proposed method utilizes the task of rotation prediction to improve the efficiency of existing state-of-the-art methods. We demonstrate significant gains in performance using the proposed method on multiple datasets, specifically for lower training epochs.

在最近从手工制作的借口任务到基于实例相似性的方法的范式转变之后，自我监督已经成为视觉表征学习的一种有利方法。大多数最先进的方法在给定图像的各种增强之间强制相似性，而一些方法另外使用对比方法来明确地确保不同的表示。虽然这些方法确实显示出了有希望的方向，但是与有监督的方法相比，它们需要大量的训练迭代。在这项工作中，我们探讨了这些方法收敛缓慢的原因，并进一步提出使用良好定位的辅助任务来加强它们，这些任务的收敛速度要快得多，并且对表示学习也很有用。该方法利用旋转预测任务来提高现有最先进方法的效率。我们证明了在多个数据集上使用所提出的方法在性能上的显着提高，特别是对于较低的训练周期。

{"title":"Towards Efficient and Effective Self-Supervised Learning of Visual Representations","authors":"Sravanti Addepalli, K. Bhogale, P. Dey, R. Venkatesh Babu","doi":"10.48550/arXiv.2210.09866","DOIUrl":"https://doi.org/10.48550/arXiv.2210.09866","url":null,"abstract":"Self-supervision has emerged as a propitious method for visual representation learning after the recent paradigm shift from handcrafted pretext tasks to instance-similarity based approaches. Most state-of-the-art methods enforce similarity between various augmentations of a given image, while some methods additionally use contrastive approaches to explicitly ensure diverse representations. While these approaches have indeed shown promising direction, they require a significantly larger number of training iterations when compared to the supervised counterparts. In this work, we explore reasons for the slow convergence of these methods, and further propose to strengthen them using well-posed auxiliary tasks that converge significantly faster, and are also useful for representation learning. The proposed method utilizes the task of rotation prediction to improve the efficiency of existing state-of-the-art methods. We demonstrate significant gains in performance using the proposed method on multiple datasets, specifically for lower training epochs.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"43 1","pages":"523-538"},"PeriodicalIF":0.0,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85405996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection 相机-雷达融合与光线约束交叉注意鲁棒三维目标检测

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-17 DOI: 10.48550/arXiv.2210.09267

Jyh-Jing Hwang, Henrik Kretzschmar, Joshua M. Manela, Sean M. Rafferty, N. Armstrong-Crews, Tiffany Chen, Drago Anguelov

Robust 3D object detection is critical for safe autonomous driving. Camera and radar sensors are synergistic as they capture complementary information and work well under different environmental conditions. Fusing camera and radar data is challenging, however, as each of the sensors lacks information along a perpendicular axis, that is, depth is unknown to camera and elevation is unknown to radar. We propose the camera-radar matching network CramNet, an efficient approach to fuse the sensor readings from camera and radar in a joint 3D space. To leverage radar range measurements for better camera depth predictions, we propose a novel ray-constrained cross-attention mechanism that resolves the ambiguity in the geometric correspondences between camera features and radar features. Our method supports training with sensor modality dropout, which leads to robust 3D object detection, even when a camera or radar sensor suddenly malfunctions on a vehicle. We demonstrate the effectiveness of our fusion approach through extensive experiments on the RADIATE dataset, one of the few large-scale datasets that provide radar radio frequency imagery. A camera-only variant of our method achieves competitive performance in monocular 3D object detection on the Waymo Open Dataset.

鲁棒的3D目标检测对于安全的自动驾驶至关重要。相机和雷达传感器是协同的，因为它们捕获互补的信息，在不同的环境条件下都能很好地工作。然而，融合相机和雷达数据具有挑战性，因为每个传感器都缺乏垂直轴上的信息，也就是说，相机不知道深度，雷达不知道仰角。我们提出了相机-雷达匹配网络CramNet，这是一种在联合三维空间中融合相机和雷达传感器读数的有效方法。为了利用雷达距离测量来更好地预测相机深度，我们提出了一种新的射线约束交叉注意机制，该机制解决了相机特征和雷达特征之间几何对应关系的模糊性。我们的方法支持传感器模态dropout训练，即使在车辆上的相机或雷达传感器突然发生故障时，也能实现鲁棒的3D物体检测。我们通过在辐射数据集(为数不多的提供雷达射频图像的大型数据集之一)上的大量实验证明了我们的融合方法的有效性。在Waymo开放数据集上，我们的方法的一个仅用于相机的变体在单目3D物体检测方面取得了具有竞争力的性能。

{"title":"CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection","authors":"Jyh-Jing Hwang, Henrik Kretzschmar, Joshua M. Manela, Sean M. Rafferty, N. Armstrong-Crews, Tiffany Chen, Drago Anguelov","doi":"10.48550/arXiv.2210.09267","DOIUrl":"https://doi.org/10.48550/arXiv.2210.09267","url":null,"abstract":"Robust 3D object detection is critical for safe autonomous driving. Camera and radar sensors are synergistic as they capture complementary information and work well under different environmental conditions. Fusing camera and radar data is challenging, however, as each of the sensors lacks information along a perpendicular axis, that is, depth is unknown to camera and elevation is unknown to radar. We propose the camera-radar matching network CramNet, an efficient approach to fuse the sensor readings from camera and radar in a joint 3D space. To leverage radar range measurements for better camera depth predictions, we propose a novel ray-constrained cross-attention mechanism that resolves the ambiguity in the geometric correspondences between camera features and radar features. Our method supports training with sensor modality dropout, which leads to robust 3D object detection, even when a camera or radar sensor suddenly malfunctions on a vehicle. We demonstrate the effectiveness of our fusion approach through extensive experiments on the RADIATE dataset, one of the few large-scale datasets that provide radar radio frequency imagery. A camera-only variant of our method achieves competitive performance in monocular 3D object detection on the Waymo Open Dataset.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"11 1","pages":"388-405"},"PeriodicalIF":0.0,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84939731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Selective Query-Guided Debiasing for Video Corpus Moment Retrieval 视频语料库时刻检索的选择性查询导向去偏

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-17 DOI: 10.1007/978-3-031-20059-5_11

Sunjae Yoon, Jiajing Hong, Eunseop Yoon, Dahyun Kim, Junyeong Kim, Hee Suk Yoon, Changdong Yoo

引用次数: 4

Distilling Object Detectors With Global Knowledge 利用全局知识提取对象检测器

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-17 DOI: 10.48550/arXiv.2210.09022

Sanli Tang, Zhongyu Zhang, Zhanzhan Cheng, Jing Lu, Yunlu Xu, Yi Niu, Fan He

. Knowledge distillation learns a lightweight student model that mimics a cumbersome teacher. Existing methods regard the knowledge as the feature of each instance or their relations, which is the instance-level knowledge only from the teacher model, i.e., the local knowledge. However, the empirical studies show that the local knowledge is much noisy in object detection tasks, especially on the blurred, occluded, or small instances. Thus, a more intrinsic approach is to measure the representations of instances w.r.t. a group of common basis vectors in the two feature spaces of the teacher and the student detectors, i.e., global knowledge. Then, the distilling algorithm can be applied as space alignment. To this end, a novel prototype generation module (PGM) is proposed to find the common basis vectors, dubbed prototypes , in the two feature spaces. Then, a robust distilling module (RDM) is applied to construct the global knowledge based on the prototypes and filtrate noisy local knowledge by measuring the discrepancy of the representations in two feature spaces. Experiments with Faster-RCNN and RetinaNet on PASCAL and COCO datasets show that our method achieves the best performance for distilling object detectors with various backbones, which even surpasses the performance of the teacher model. We also show that the existing methods can be easily combined with global knowledge and obtain further improvement. Code is available: https://github.com/hikvision-research/DAVAR-Lab-ML . to (1) construct the global knowledge by projecting the instances w.r.t. the prototypes, and (2) robustly distill the global and local knowledge by measuring their discrepancy in the two spaces. Experiments show that the proposed method achieves state-of-the-art performance on two popular detection frameworks and benchmarks. The extensive experimental results show that the proposed method can be easily stretched with larger teachers and the existing knowledge distillation methods to obtain further improvement.

。知识蒸馏学习一个轻量级的学生模型，模仿一个笨重的老师。现有的方法将知识视为每个实例的特征或它们之间的关系，这是仅来自教师模型的实例级知识，即局部知识。然而，实证研究表明，在目标检测任务中，局部知识存在较大的噪声，特别是在模糊、遮挡或小样本情况下。因此，一种更内在的方法是测量实例的表示，而不是教师和学生检测器的两个特征空间中的一组公共基向量，即全局知识。然后，将提取算法应用于空间对齐。为此，提出了一种新的原型生成模块(PGM)来寻找两个特征空间中的公共基向量，称为原型。然后，采用鲁棒提取模块(RDM)构建基于原型的全局知识，并通过测量两个特征空间中表示的差异来过滤有噪声的局部知识。在PASCAL和COCO数据集上对Faster-RCNN和RetinaNet进行的实验表明，我们的方法在提取具有多种主干的目标检测器方面取得了最好的性能，甚至超过了教师模型的性能。我们还表明，现有的方法可以很容易地与全局知识相结合，并得到进一步的改进。可获得代码:https://github.com/hikvision-research/DAVAR-Lab-ML。(1)通过在原型的基础上投影实例来构建全局知识;(2)通过测量它们在两个空间中的差异来稳健地提取全局和局部知识。实验表明，该方法在两种流行的检测框架和基准测试中都达到了最先进的性能。广泛的实验结果表明，该方法可以很容易地扩展到更大的教师和现有的知识蒸馏方法，以获得进一步的改进。

{"title":"Distilling Object Detectors With Global Knowledge","authors":"Sanli Tang, Zhongyu Zhang, Zhanzhan Cheng, Jing Lu, Yunlu Xu, Yi Niu, Fan He","doi":"10.48550/arXiv.2210.09022","DOIUrl":"https://doi.org/10.48550/arXiv.2210.09022","url":null,"abstract":". Knowledge distillation learns a lightweight student model that mimics a cumbersome teacher. Existing methods regard the knowledge as the feature of each instance or their relations, which is the instance-level knowledge only from the teacher model, i.e., the local knowledge. However, the empirical studies show that the local knowledge is much noisy in object detection tasks, especially on the blurred, occluded, or small instances. Thus, a more intrinsic approach is to measure the representations of instances w.r.t. a group of common basis vectors in the two feature spaces of the teacher and the student detectors, i.e., global knowledge. Then, the distilling algorithm can be applied as space alignment. To this end, a novel prototype generation module (PGM) is proposed to find the common basis vectors, dubbed prototypes , in the two feature spaces. Then, a robust distilling module (RDM) is applied to construct the global knowledge based on the prototypes and filtrate noisy local knowledge by measuring the discrepancy of the representations in two feature spaces. Experiments with Faster-RCNN and RetinaNet on PASCAL and COCO datasets show that our method achieves the best performance for distilling object detectors with various backbones, which even surpasses the performance of the teacher model. We also show that the existing methods can be easily combined with global knowledge and obtain further improvement. Code is available: https://github.com/hikvision-research/DAVAR-Lab-ML . to (1) construct the global knowledge by projecting the instances w.r.t. the prototypes, and (2) robustly distill the global and local knowledge by measuring their discrepancy in the two spaces. Experiments show that the proposed method achieves state-of-the-art performance on two popular detection frameworks and benchmarks. The extensive experimental results show that the proposed method can be easily stretched with larger teachers and the existing knowledge distillation methods to obtain further improvement.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"18 1","pages":"422-438"},"PeriodicalIF":0.0,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81107469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Geometric Representation Learning for Document Image Rectification 基于几何表示学习的文档图像校正

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-15 DOI: 10.48550/arXiv.2210.08161

Hao Feng, Wen-gang Zhou, Jiajun Deng, Yuechen Wang, Houqiang Li

In document image rectification, there exist rich geometric constraints between the distorted image and the ground truth one. However, such geometric constraints are largely ignored in existing advanced solutions, which limits the rectification performance. To this end, we present DocGeoNet for document image rectification by introducing explicit geometric representation. Technically, two typical attributes of the document image are involved in the proposed geometric representation learning, i.e., 3D shape and textlines. Our motivation arises from the insight that 3D shape provides global unwarping cues for rectifying a distorted document image while overlooking the local structure. On the other hand, textlines complementarily provide explicit geometric constraints for local patterns. The learned geometric representation effectively bridges the distorted image and the ground truth one. Extensive experiments show the effectiveness of our framework and demonstrate the superiority of our DocGeoNet over state-of-the-art methods on both the DocUNet Benchmark dataset and our proposed DIR300 test set. The code is available at https://github.com/fh2019ustc/DocGeoNet.

在文档图像校正中，失真图像与地面真实图像之间存在着丰富的几何约束。然而，在现有的先进解决方案中，这种几何约束在很大程度上被忽略，从而限制了整流性能。为此，我们通过引入显式几何表示，提出了用于文档图像校正的DocGeoNet。从技术上讲，所提出的几何表示学习涉及文档图像的两个典型属性，即3D形状和文本线。我们的动机源于这样一种见解，即3D形状为纠正扭曲的文档图像提供了全局解扭曲线索，同时忽略了局部结构。另一方面，文本线补充地为局部模式提供明确的几何约束。学习到的几何表示有效地连接了扭曲图像和真实图像。大量的实验表明了我们的框架的有效性，并证明了我们的DocGeoNet在DocUNet基准数据集和我们提出的DIR300测试集上优于最先进的方法。代码可在https://github.com/fh2019ustc/DocGeoNet上获得。

{"title":"Geometric Representation Learning for Document Image Rectification","authors":"Hao Feng, Wen-gang Zhou, Jiajun Deng, Yuechen Wang, Houqiang Li","doi":"10.48550/arXiv.2210.08161","DOIUrl":"https://doi.org/10.48550/arXiv.2210.08161","url":null,"abstract":"In document image rectification, there exist rich geometric constraints between the distorted image and the ground truth one. However, such geometric constraints are largely ignored in existing advanced solutions, which limits the rectification performance. To this end, we present DocGeoNet for document image rectification by introducing explicit geometric representation. Technically, two typical attributes of the document image are involved in the proposed geometric representation learning, i.e., 3D shape and textlines. Our motivation arises from the insight that 3D shape provides global unwarping cues for rectifying a distorted document image while overlooking the local structure. On the other hand, textlines complementarily provide explicit geometric constraints for local patterns. The learned geometric representation effectively bridges the distorted image and the ground truth one. Extensive experiments show the effectiveness of our framework and demonstrate the superiority of our DocGeoNet over state-of-the-art methods on both the DocUNet Benchmark dataset and our proposed DIR300 test set. The code is available at https://github.com/fh2019ustc/DocGeoNet.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"607 1","pages":"475-492"},"PeriodicalIF":0.0,"publicationDate":"2022-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78955782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds LESS:激光雷达点云的高效标签语义分割

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-14 DOI: 10.48550/arXiv.2210.08064

Minghua Liu, Yin Zhou, C. Qi, Boqing Gong, Hao Su, Drago Anguelov

Semantic segmentation of LiDAR point clouds is an important task in autonomous driving. However, training deep models via conventional supervised methods requires large datasets which are costly to label. It is critical to have label-efficient segmentation approaches to scale up the model to new operational domains or to improve performance on rare cases. While most prior works focus on indoor scenes, we are one of the first to propose a label-efficient semantic segmentation pipeline for outdoor scenes with LiDAR point clouds. Our method co-designs an efficient labeling process with semi/weakly supervised learning and is applicable to nearly any 3D semantic segmentation backbones. Specifically, we leverage geometry patterns in outdoor scenes to have a heuristic pre-segmentation to reduce the manual labeling and jointly design the learning targets with the labeling process. In the learning step, we leverage prototype learning to get more descriptive point embeddings and use multi-scan distillation to exploit richer semantics from temporally aggregated point clouds to boost the performance of single-scan models. Evaluated on the SemanticKITTI and the nuScenes datasets, we show that our proposed method outperforms existing label-efficient methods. With extremely limited human annotations (e.g., 0.1% point labels), our proposed method is even highly competitive compared to the fully supervised counterpart with 100% labels.

激光雷达点云的语义分割是自动驾驶中的一项重要任务。然而，通过传统的监督方法训练深度模型需要大量的数据集，而这些数据集的标记成本很高。使用标签高效的分割方法将模型扩展到新的操作领域或在极少数情况下提高性能是至关重要的。虽然大多数先前的工作都集中在室内场景，但我们是第一个为带有LiDAR点云的室外场景提出标签高效语义分割管道的人之一。我们的方法与半/弱监督学习共同设计了一个高效的标记过程，适用于几乎任何3D语义分割主干。具体而言，我们利用户外场景中的几何图案进行启发式预分割，减少人工标注，并与标注过程共同设计学习目标。在学习步骤中，我们利用原型学习来获得更具描述性的点嵌入，并使用多扫描蒸馏从时间聚合的点云中挖掘更丰富的语义，以提高单扫描模型的性能。在SemanticKITTI和nuScenes数据集上进行了评估，结果表明我们提出的方法优于现有的标签高效方法。在人工标注极其有限的情况下(例如，0.1%的点标签)，我们提出的方法甚至比具有100%标签的完全监督方法更具竞争力。

{"title":"LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds","authors":"Minghua Liu, Yin Zhou, C. Qi, Boqing Gong, Hao Su, Drago Anguelov","doi":"10.48550/arXiv.2210.08064","DOIUrl":"https://doi.org/10.48550/arXiv.2210.08064","url":null,"abstract":"Semantic segmentation of LiDAR point clouds is an important task in autonomous driving. However, training deep models via conventional supervised methods requires large datasets which are costly to label. It is critical to have label-efficient segmentation approaches to scale up the model to new operational domains or to improve performance on rare cases. While most prior works focus on indoor scenes, we are one of the first to propose a label-efficient semantic segmentation pipeline for outdoor scenes with LiDAR point clouds. Our method co-designs an efficient labeling process with semi/weakly supervised learning and is applicable to nearly any 3D semantic segmentation backbones. Specifically, we leverage geometry patterns in outdoor scenes to have a heuristic pre-segmentation to reduce the manual labeling and jointly design the learning targets with the labeling process. In the learning step, we leverage prototype learning to get more descriptive point embeddings and use multi-scan distillation to exploit richer semantics from temporally aggregated point clouds to boost the performance of single-scan models. Evaluated on the SemanticKITTI and the nuScenes datasets, we show that our proposed method outperforms existing label-efficient methods. With extremely limited human annotations (e.g., 0.1% point labels), our proposed method is even highly competitive compared to the fully supervised counterpart with 100% labels.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"75 1","pages":"70-89"},"PeriodicalIF":0.0,"publicationDate":"2022-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73860713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

The Surprisingly Straightforward Scene Text Removal Method With Gated Attention and Region of Interest Generation: A Comprehensive Prominent Model Analysis 具有门控注意和兴趣区域生成的令人惊讶的简单场景文本去除方法:一个全面的突出模型分析

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-14 DOI: 10.48550/arXiv.2210.07489

Hyeonsu Lee, Chankyu Choi

Scene text removal (STR), a task of erasing text from natural scene images, has recently attracted attention as an important component of editing text or concealing private information such as ID, telephone, and license plate numbers. While there are a variety of different methods for STR actively being researched, it is difficult to evaluate superiority because previously proposed methods do not use the same standardized training/evaluation dataset. We use the same standardized training/testing dataset to evaluate the performance of several previous methods after standardized re-implementation. We also introduce a simple yet extremely effective Gated Attention (GA) and Region-of-Interest Generation (RoIG) methodology in this paper. GA uses attention to focus on the text stroke as well as the textures and colors of the surrounding regions to remove text from the input image much more precisely. RoIG is applied to focus on only the region with text instead of the entire image to train the model more efficiently. Experimental results on the benchmark dataset show that our method significantly outperforms existing state-of-the-art methods in almost all metrics with remarkably higher-quality results. Furthermore, because our model does not generate a text stroke mask explicitly, there is no need for additional refinement steps or sub-models, making our model extremely fast with fewer parameters. The dataset and code are available at this https://github.com/naver/garnet.

场景文本删除(Scene text removal, STR)是一种从自然场景图像中删除文本的技术，近年来作为编辑文本或隐藏身份证件、电话号码、车牌号码等私人信息的重要组成部分而受到关注。虽然有各种不同的STR方法正在被积极研究，但很难评估其优越性，因为之前提出的方法没有使用相同的标准化训练/评估数据集。我们使用相同的标准化训练/测试数据集来评估标准化后重新实现的几种先前方法的性能。本文还介绍了一种简单但非常有效的门控注意(GA)和兴趣区域生成(RoIG)方法。GA将注意力集中在文本笔画以及周围区域的纹理和颜色上，从而更精确地从输入图像中删除文本。RoIG应用于只关注有文本的区域，而不是整个图像，以更有效地训练模型。在基准数据集上的实验结果表明，我们的方法在几乎所有指标上都明显优于现有的最先进的方法，并且结果质量显著提高。此外，由于我们的模型不显式地生成文本笔画遮罩，因此不需要额外的细化步骤或子模型，从而使我们的模型以更少的参数非常快。数据集和代码可在此获得https://github.com/naver/garnet。

{"title":"The Surprisingly Straightforward Scene Text Removal Method With Gated Attention and Region of Interest Generation: A Comprehensive Prominent Model Analysis","authors":"Hyeonsu Lee, Chankyu Choi","doi":"10.48550/arXiv.2210.07489","DOIUrl":"https://doi.org/10.48550/arXiv.2210.07489","url":null,"abstract":"Scene text removal (STR), a task of erasing text from natural scene images, has recently attracted attention as an important component of editing text or concealing private information such as ID, telephone, and license plate numbers. While there are a variety of different methods for STR actively being researched, it is difficult to evaluate superiority because previously proposed methods do not use the same standardized training/evaluation dataset. We use the same standardized training/testing dataset to evaluate the performance of several previous methods after standardized re-implementation. We also introduce a simple yet extremely effective Gated Attention (GA) and Region-of-Interest Generation (RoIG) methodology in this paper. GA uses attention to focus on the text stroke as well as the textures and colors of the surrounding regions to remove text from the input image much more precisely. RoIG is applied to focus on only the region with text instead of the entire image to train the model more efficiently. Experimental results on the benchmark dataset show that our method significantly outperforms existing state-of-the-art methods in almost all metrics with remarkably higher-quality results. Furthermore, because our model does not generate a text stroke mask explicitly, there is no need for additional refinement steps or sub-models, making our model extremely fast with fewer parameters. The dataset and code are available at this https://github.com/naver/garnet.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"31 1","pages":"457-472"},"PeriodicalIF":0.0,"publicationDate":"2022-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85301718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving 自动驾驶中运动启发的无监督感知和预测

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-14 DOI: 10.48550/arXiv.2210.08061

Mahyar Najibi, Jingwei Ji, Yin Zhou, C. Qi, Xinchen Yan, S. Ettinger, Drago Anguelov

Learning-based perception and prediction modules in modern autonomous driving systems typically rely on expensive human annotation and are designed to perceive only a handful of predefined object categories. This closed-set paradigm is insufficient for the safety-critical autonomous driving task, where the autonomous vehicle needs to process arbitrarily many types of traffic participants and their motion behaviors in a highly dynamic world. To address this difficulty, this paper pioneers a novel and challenging direction, i.e., training perception and prediction models to understand open-set moving objects, with no human supervision. Our proposed framework uses self-learned flow to trigger an automated meta labeling pipeline to achieve automatic supervision. 3D detection experiments on the Waymo Open Dataset show that our method significantly outperforms classical unsupervised approaches and is even competitive to the counterpart with supervised scene flow. We further show that our approach generates highly promising results in open-set 3D detection and trajectory prediction, confirming its potential in closing the safety gap of fully supervised systems.

现代自动驾驶系统中基于学习的感知和预测模块通常依赖于昂贵的人工注释，并且只能感知少数预定义的对象类别。这种闭集范式不足以满足安全关键型自动驾驶任务，因为自动驾驶车辆需要在高度动态的世界中处理任意多种类型的交通参与者及其运动行为。为了解决这一困难，本文开创了一个新颖而具有挑战性的方向，即训练感知和预测模型来理解开放集运动物体，而无需人工监督。我们提出的框架使用自学习流来触发自动元标记管道，以实现自动监督。在Waymo开放数据集上的3D检测实验表明，我们的方法明显优于经典的无监督方法，甚至可以与有监督的场景流相媲美。我们进一步表明，我们的方法在开放集3D检测和轨迹预测方面产生了非常有希望的结果，证实了它在缩小完全监督系统的安全差距方面的潜力。

{"title":"Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving","authors":"Mahyar Najibi, Jingwei Ji, Yin Zhou, C. Qi, Xinchen Yan, S. Ettinger, Drago Anguelov","doi":"10.48550/arXiv.2210.08061","DOIUrl":"https://doi.org/10.48550/arXiv.2210.08061","url":null,"abstract":"Learning-based perception and prediction modules in modern autonomous driving systems typically rely on expensive human annotation and are designed to perceive only a handful of predefined object categories. This closed-set paradigm is insufficient for the safety-critical autonomous driving task, where the autonomous vehicle needs to process arbitrarily many types of traffic participants and their motion behaviors in a highly dynamic world. To address this difficulty, this paper pioneers a novel and challenging direction, i.e., training perception and prediction models to understand open-set moving objects, with no human supervision. Our proposed framework uses self-learned flow to trigger an automated meta labeling pipeline to achieve automatic supervision. 3D detection experiments on the Waymo Open Dataset show that our method significantly outperforms classical unsupervised approaches and is even competitive to the counterpart with supervised scene flow. We further show that our approach generates highly promising results in open-set 3D detection and trajectory prediction, confirming its potential in closing the safety gap of fully supervised systems.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"102 1","pages":"424-443"},"PeriodicalIF":0.0,"publicationDate":"2022-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75645026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Real Spike: Learning Real-valued Spikes for Spiking Neural Networks 真实尖峰:学习尖峰神经网络的实值尖峰

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-10-13 DOI: 10.48550/arXiv.2210.06686

Yu-Zhu Guo, Liwen Zhang, Y. Chen, Xinyi Tong, Xiaode Liu, Yinglei Wang, Xuhui Huang, Zhe Ma

Brain-inspired spiking neural networks (SNNs) have recently drawn more and more attention due to their event-driven and energyefficient characteristics. The integration of storage and computation paradigm on neuromorphic hardwares makes SNNs much different from Deep Neural Networks (DNNs). In this paper, we argue that SNNs may not benefit from the weight-sharing mechanism, which can effectively reduce parameters and improve inference efficiency in DNNs, in some hardwares, and assume that an SNN with unshared convolution kernels could perform better. Motivated by this assumption, a training-inference decoupling method for SNNs named as Real Spike is proposed, which not only enjoys both unshared convolution kernels and binary spikes in inference-time but also maintains both shared convolution kernels and Real-valued Spikes during training. This decoupling mechanism of SNN is realized by a re-parameterization technique. Furthermore, based on the training-inference-decoupled idea, a series of different forms for implementing Real Spike on different levels are presented, which also enjoy shared convolutions in the inference and are friendly to both neuromorphic and non-neuromorphic hardware platforms. A theoretical proof is given to clarify that the Real Spike-based SNN network is superior to its vanilla counterpart. Experimental results show that all different Real Spike versions can consistently improve the SNN performance. Moreover, the proposed method outperforms the state-of-the-art models on both non-spiking static and neuromorphic datasets.

近年来，脑激发型脉冲神经网络(SNNs)因其事件驱动和能量高效的特点而受到越来越多的关注。神经形态硬件上存储和计算模式的集成使得snn与深度神经网络(Deep Neural Networks, dnn)有很大的不同。在本文中，我们认为SNN在某些硬件中可能无法从权值共享机制中获益，而权值共享机制可以有效地减少dnn中的参数并提高推理效率，并假设具有非共享卷积核的SNN可以表现得更好。基于这一假设，提出了一种snn的训练-推理解耦方法Real Spike，该方法不仅在推理时间内具有非共享卷积核和二值尖峰，而且在训练过程中保持共享卷积核和实值尖峰。SNN的这种解耦机制是通过一种重参数化技术实现的。此外，基于训练-推理-解耦的思想，提出了一系列在不同层次上实现Real Spike的不同形式，这些形式在推理中具有共享卷积，并且对神经形态和非神经形态硬件平台都友好。从理论上证明了基于Real spike的SNN网络优于普通SNN网络。实验结果表明，所有不同的Real Spike版本都可以一致地提高SNN性能。此外，该方法在非尖峰静态和神经形态数据集上都优于最先进的模型。

{"title":"Real Spike: Learning Real-valued Spikes for Spiking Neural Networks","authors":"Yu-Zhu Guo, Liwen Zhang, Y. Chen, Xinyi Tong, Xiaode Liu, Yinglei Wang, Xuhui Huang, Zhe Ma","doi":"10.48550/arXiv.2210.06686","DOIUrl":"https://doi.org/10.48550/arXiv.2210.06686","url":null,"abstract":"Brain-inspired spiking neural networks (SNNs) have recently drawn more and more attention due to their event-driven and energyefficient characteristics. The integration of storage and computation paradigm on neuromorphic hardwares makes SNNs much different from Deep Neural Networks (DNNs). In this paper, we argue that SNNs may not benefit from the weight-sharing mechanism, which can effectively reduce parameters and improve inference efficiency in DNNs, in some hardwares, and assume that an SNN with unshared convolution kernels could perform better. Motivated by this assumption, a training-inference decoupling method for SNNs named as Real Spike is proposed, which not only enjoys both unshared convolution kernels and binary spikes in inference-time but also maintains both shared convolution kernels and Real-valued Spikes during training. This decoupling mechanism of SNN is realized by a re-parameterization technique. Furthermore, based on the training-inference-decoupled idea, a series of different forms for implementing Real Spike on different levels are presented, which also enjoy shared convolutions in the inference and are friendly to both neuromorphic and non-neuromorphic hardware platforms. A theoretical proof is given to clarify that the Real Spike-based SNN network is superior to its vanilla counterpart. Experimental results show that all different Real Spike versions can consistently improve the SNN performance. Moreover, the proposed method outperforms the state-of-the-art models on both non-spiking static and neuromorphic datasets.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"108 1","pages":"52-68"},"PeriodicalIF":0.0,"publicationDate":"2022-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86125613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12