首页 > 最新文献

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision最新文献

英文 中文
CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection 相机-雷达融合与光线约束交叉注意鲁棒三维目标检测
Jyh-Jing Hwang, Henrik Kretzschmar, Joshua M. Manela, Sean M. Rafferty, N. Armstrong-Crews, Tiffany Chen, Drago Anguelov
Robust 3D object detection is critical for safe autonomous driving. Camera and radar sensors are synergistic as they capture complementary information and work well under different environmental conditions. Fusing camera and radar data is challenging, however, as each of the sensors lacks information along a perpendicular axis, that is, depth is unknown to camera and elevation is unknown to radar. We propose the camera-radar matching network CramNet, an efficient approach to fuse the sensor readings from camera and radar in a joint 3D space. To leverage radar range measurements for better camera depth predictions, we propose a novel ray-constrained cross-attention mechanism that resolves the ambiguity in the geometric correspondences between camera features and radar features. Our method supports training with sensor modality dropout, which leads to robust 3D object detection, even when a camera or radar sensor suddenly malfunctions on a vehicle. We demonstrate the effectiveness of our fusion approach through extensive experiments on the RADIATE dataset, one of the few large-scale datasets that provide radar radio frequency imagery. A camera-only variant of our method achieves competitive performance in monocular 3D object detection on the Waymo Open Dataset.
鲁棒的3D目标检测对于安全的自动驾驶至关重要。相机和雷达传感器是协同的,因为它们捕获互补的信息,在不同的环境条件下都能很好地工作。然而,融合相机和雷达数据具有挑战性,因为每个传感器都缺乏垂直轴上的信息,也就是说,相机不知道深度,雷达不知道仰角。我们提出了相机-雷达匹配网络CramNet,这是一种在联合三维空间中融合相机和雷达传感器读数的有效方法。为了利用雷达距离测量来更好地预测相机深度,我们提出了一种新的射线约束交叉注意机制,该机制解决了相机特征和雷达特征之间几何对应关系的模糊性。我们的方法支持传感器模态dropout训练,即使在车辆上的相机或雷达传感器突然发生故障时,也能实现鲁棒的3D物体检测。我们通过在辐射数据集(为数不多的提供雷达射频图像的大型数据集之一)上的大量实验证明了我们的融合方法的有效性。在Waymo开放数据集上,我们的方法的一个仅用于相机的变体在单目3D物体检测方面取得了具有竞争力的性能。
{"title":"CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection","authors":"Jyh-Jing Hwang, Henrik Kretzschmar, Joshua M. Manela, Sean M. Rafferty, N. Armstrong-Crews, Tiffany Chen, Drago Anguelov","doi":"10.48550/arXiv.2210.09267","DOIUrl":"https://doi.org/10.48550/arXiv.2210.09267","url":null,"abstract":"Robust 3D object detection is critical for safe autonomous driving. Camera and radar sensors are synergistic as they capture complementary information and work well under different environmental conditions. Fusing camera and radar data is challenging, however, as each of the sensors lacks information along a perpendicular axis, that is, depth is unknown to camera and elevation is unknown to radar. We propose the camera-radar matching network CramNet, an efficient approach to fuse the sensor readings from camera and radar in a joint 3D space. To leverage radar range measurements for better camera depth predictions, we propose a novel ray-constrained cross-attention mechanism that resolves the ambiguity in the geometric correspondences between camera features and radar features. Our method supports training with sensor modality dropout, which leads to robust 3D object detection, even when a camera or radar sensor suddenly malfunctions on a vehicle. We demonstrate the effectiveness of our fusion approach through extensive experiments on the RADIATE dataset, one of the few large-scale datasets that provide radar radio frequency imagery. A camera-only variant of our method achieves competitive performance in monocular 3D object detection on the Waymo Open Dataset.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"11 1","pages":"388-405"},"PeriodicalIF":0.0,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84939731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Selective Query-Guided Debiasing for Video Corpus Moment Retrieval 视频语料库时刻检索的选择性查询导向去偏
Sunjae Yoon, Jiajing Hong, Eunseop Yoon, Dahyun Kim, Junyeong Kim, Hee Suk Yoon, Changdong Yoo
{"title":"Selective Query-Guided Debiasing for Video Corpus Moment Retrieval","authors":"Sunjae Yoon, Jiajing Hong, Eunseop Yoon, Dahyun Kim, Junyeong Kim, Hee Suk Yoon, Changdong Yoo","doi":"10.1007/978-3-031-20059-5_11","DOIUrl":"https://doi.org/10.1007/978-3-031-20059-5_11","url":null,"abstract":"","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"1 1","pages":"185-200"},"PeriodicalIF":0.0,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77177316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Distilling Object Detectors With Global Knowledge 利用全局知识提取对象检测器
Sanli Tang, Zhongyu Zhang, Zhanzhan Cheng, Jing Lu, Yunlu Xu, Yi Niu, Fan He
. Knowledge distillation learns a lightweight student model that mimics a cumbersome teacher. Existing methods regard the knowledge as the feature of each instance or their relations, which is the instance-level knowledge only from the teacher model, i.e., the local knowledge. However, the empirical studies show that the local knowledge is much noisy in object detection tasks, especially on the blurred, occluded, or small instances. Thus, a more intrinsic approach is to measure the representations of instances w.r.t. a group of common basis vectors in the two feature spaces of the teacher and the student detectors, i.e., global knowledge. Then, the distilling algorithm can be applied as space alignment. To this end, a novel prototype generation module (PGM) is proposed to find the common basis vectors, dubbed prototypes , in the two feature spaces. Then, a robust distilling module (RDM) is applied to construct the global knowledge based on the prototypes and filtrate noisy local knowledge by measuring the discrepancy of the representations in two feature spaces. Experiments with Faster-RCNN and RetinaNet on PASCAL and COCO datasets show that our method achieves the best performance for distilling object detectors with various backbones, which even surpasses the performance of the teacher model. We also show that the existing methods can be easily combined with global knowledge and obtain further improvement. Code is available: https://github.com/hikvision-research/DAVAR-Lab-ML . to (1) construct the global knowledge by projecting the instances w.r.t. the prototypes, and (2) robustly distill the global and local knowledge by measuring their discrepancy in the two spaces. Experiments show that the proposed method achieves state-of-the-art performance on two popular detection frameworks and benchmarks. The extensive experimental results show that the proposed method can be easily stretched with larger teachers and the existing knowledge distillation methods to obtain further improvement.
。知识蒸馏学习一个轻量级的学生模型,模仿一个笨重的老师。现有的方法将知识视为每个实例的特征或它们之间的关系,这是仅来自教师模型的实例级知识,即局部知识。然而,实证研究表明,在目标检测任务中,局部知识存在较大的噪声,特别是在模糊、遮挡或小样本情况下。因此,一种更内在的方法是测量实例的表示,而不是教师和学生检测器的两个特征空间中的一组公共基向量,即全局知识。然后,将提取算法应用于空间对齐。为此,提出了一种新的原型生成模块(PGM)来寻找两个特征空间中的公共基向量,称为原型。然后,采用鲁棒提取模块(RDM)构建基于原型的全局知识,并通过测量两个特征空间中表示的差异来过滤有噪声的局部知识。在PASCAL和COCO数据集上对Faster-RCNN和RetinaNet进行的实验表明,我们的方法在提取具有多种主干的目标检测器方面取得了最好的性能,甚至超过了教师模型的性能。我们还表明,现有的方法可以很容易地与全局知识相结合,并得到进一步的改进。可获得代码:https://github.com/hikvision-research/DAVAR-Lab-ML。(1)通过在原型的基础上投影实例来构建全局知识;(2)通过测量它们在两个空间中的差异来稳健地提取全局和局部知识。实验表明,该方法在两种流行的检测框架和基准测试中都达到了最先进的性能。广泛的实验结果表明,该方法可以很容易地扩展到更大的教师和现有的知识蒸馏方法,以获得进一步的改进。
{"title":"Distilling Object Detectors With Global Knowledge","authors":"Sanli Tang, Zhongyu Zhang, Zhanzhan Cheng, Jing Lu, Yunlu Xu, Yi Niu, Fan He","doi":"10.48550/arXiv.2210.09022","DOIUrl":"https://doi.org/10.48550/arXiv.2210.09022","url":null,"abstract":". Knowledge distillation learns a lightweight student model that mimics a cumbersome teacher. Existing methods regard the knowledge as the feature of each instance or their relations, which is the instance-level knowledge only from the teacher model, i.e., the local knowledge. However, the empirical studies show that the local knowledge is much noisy in object detection tasks, especially on the blurred, occluded, or small instances. Thus, a more intrinsic approach is to measure the representations of instances w.r.t. a group of common basis vectors in the two feature spaces of the teacher and the student detectors, i.e., global knowledge. Then, the distilling algorithm can be applied as space alignment. To this end, a novel prototype generation module (PGM) is proposed to find the common basis vectors, dubbed prototypes , in the two feature spaces. Then, a robust distilling module (RDM) is applied to construct the global knowledge based on the prototypes and filtrate noisy local knowledge by measuring the discrepancy of the representations in two feature spaces. Experiments with Faster-RCNN and RetinaNet on PASCAL and COCO datasets show that our method achieves the best performance for distilling object detectors with various backbones, which even surpasses the performance of the teacher model. We also show that the existing methods can be easily combined with global knowledge and obtain further improvement. Code is available: https://github.com/hikvision-research/DAVAR-Lab-ML . to (1) construct the global knowledge by projecting the instances w.r.t. the prototypes, and (2) robustly distill the global and local knowledge by measuring their discrepancy in the two spaces. Experiments show that the proposed method achieves state-of-the-art performance on two popular detection frameworks and benchmarks. The extensive experimental results show that the proposed method can be easily stretched with larger teachers and the existing knowledge distillation methods to obtain further improvement.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"18 1","pages":"422-438"},"PeriodicalIF":0.0,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81107469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Geometric Representation Learning for Document Image Rectification 基于几何表示学习的文档图像校正
Hao Feng, Wen-gang Zhou, Jiajun Deng, Yuechen Wang, Houqiang Li
In document image rectification, there exist rich geometric constraints between the distorted image and the ground truth one. However, such geometric constraints are largely ignored in existing advanced solutions, which limits the rectification performance. To this end, we present DocGeoNet for document image rectification by introducing explicit geometric representation. Technically, two typical attributes of the document image are involved in the proposed geometric representation learning, i.e., 3D shape and textlines. Our motivation arises from the insight that 3D shape provides global unwarping cues for rectifying a distorted document image while overlooking the local structure. On the other hand, textlines complementarily provide explicit geometric constraints for local patterns. The learned geometric representation effectively bridges the distorted image and the ground truth one. Extensive experiments show the effectiveness of our framework and demonstrate the superiority of our DocGeoNet over state-of-the-art methods on both the DocUNet Benchmark dataset and our proposed DIR300 test set. The code is available at https://github.com/fh2019ustc/DocGeoNet.
在文档图像校正中,失真图像与地面真实图像之间存在着丰富的几何约束。然而,在现有的先进解决方案中,这种几何约束在很大程度上被忽略,从而限制了整流性能。为此,我们通过引入显式几何表示,提出了用于文档图像校正的DocGeoNet。从技术上讲,所提出的几何表示学习涉及文档图像的两个典型属性,即3D形状和文本线。我们的动机源于这样一种见解,即3D形状为纠正扭曲的文档图像提供了全局解扭曲线索,同时忽略了局部结构。另一方面,文本线补充地为局部模式提供明确的几何约束。学习到的几何表示有效地连接了扭曲图像和真实图像。大量的实验表明了我们的框架的有效性,并证明了我们的DocGeoNet在DocUNet基准数据集和我们提出的DIR300测试集上优于最先进的方法。代码可在https://github.com/fh2019ustc/DocGeoNet上获得。
{"title":"Geometric Representation Learning for Document Image Rectification","authors":"Hao Feng, Wen-gang Zhou, Jiajun Deng, Yuechen Wang, Houqiang Li","doi":"10.48550/arXiv.2210.08161","DOIUrl":"https://doi.org/10.48550/arXiv.2210.08161","url":null,"abstract":"In document image rectification, there exist rich geometric constraints between the distorted image and the ground truth one. However, such geometric constraints are largely ignored in existing advanced solutions, which limits the rectification performance. To this end, we present DocGeoNet for document image rectification by introducing explicit geometric representation. Technically, two typical attributes of the document image are involved in the proposed geometric representation learning, i.e., 3D shape and textlines. Our motivation arises from the insight that 3D shape provides global unwarping cues for rectifying a distorted document image while overlooking the local structure. On the other hand, textlines complementarily provide explicit geometric constraints for local patterns. The learned geometric representation effectively bridges the distorted image and the ground truth one. Extensive experiments show the effectiveness of our framework and demonstrate the superiority of our DocGeoNet over state-of-the-art methods on both the DocUNet Benchmark dataset and our proposed DIR300 test set. The code is available at https://github.com/fh2019ustc/DocGeoNet.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"607 1","pages":"475-492"},"PeriodicalIF":0.0,"publicationDate":"2022-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78955782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds LESS:激光雷达点云的高效标签语义分割
Minghua Liu, Yin Zhou, C. Qi, Boqing Gong, Hao Su, Drago Anguelov
Semantic segmentation of LiDAR point clouds is an important task in autonomous driving. However, training deep models via conventional supervised methods requires large datasets which are costly to label. It is critical to have label-efficient segmentation approaches to scale up the model to new operational domains or to improve performance on rare cases. While most prior works focus on indoor scenes, we are one of the first to propose a label-efficient semantic segmentation pipeline for outdoor scenes with LiDAR point clouds. Our method co-designs an efficient labeling process with semi/weakly supervised learning and is applicable to nearly any 3D semantic segmentation backbones. Specifically, we leverage geometry patterns in outdoor scenes to have a heuristic pre-segmentation to reduce the manual labeling and jointly design the learning targets with the labeling process. In the learning step, we leverage prototype learning to get more descriptive point embeddings and use multi-scan distillation to exploit richer semantics from temporally aggregated point clouds to boost the performance of single-scan models. Evaluated on the SemanticKITTI and the nuScenes datasets, we show that our proposed method outperforms existing label-efficient methods. With extremely limited human annotations (e.g., 0.1% point labels), our proposed method is even highly competitive compared to the fully supervised counterpart with 100% labels.
激光雷达点云的语义分割是自动驾驶中的一项重要任务。然而,通过传统的监督方法训练深度模型需要大量的数据集,而这些数据集的标记成本很高。使用标签高效的分割方法将模型扩展到新的操作领域或在极少数情况下提高性能是至关重要的。虽然大多数先前的工作都集中在室内场景,但我们是第一个为带有LiDAR点云的室外场景提出标签高效语义分割管道的人之一。我们的方法与半/弱监督学习共同设计了一个高效的标记过程,适用于几乎任何3D语义分割主干。具体而言,我们利用户外场景中的几何图案进行启发式预分割,减少人工标注,并与标注过程共同设计学习目标。在学习步骤中,我们利用原型学习来获得更具描述性的点嵌入,并使用多扫描蒸馏从时间聚合的点云中挖掘更丰富的语义,以提高单扫描模型的性能。在SemanticKITTI和nuScenes数据集上进行了评估,结果表明我们提出的方法优于现有的标签高效方法。在人工标注极其有限的情况下(例如,0.1%的点标签),我们提出的方法甚至比具有100%标签的完全监督方法更具竞争力。
{"title":"LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds","authors":"Minghua Liu, Yin Zhou, C. Qi, Boqing Gong, Hao Su, Drago Anguelov","doi":"10.48550/arXiv.2210.08064","DOIUrl":"https://doi.org/10.48550/arXiv.2210.08064","url":null,"abstract":"Semantic segmentation of LiDAR point clouds is an important task in autonomous driving. However, training deep models via conventional supervised methods requires large datasets which are costly to label. It is critical to have label-efficient segmentation approaches to scale up the model to new operational domains or to improve performance on rare cases. While most prior works focus on indoor scenes, we are one of the first to propose a label-efficient semantic segmentation pipeline for outdoor scenes with LiDAR point clouds. Our method co-designs an efficient labeling process with semi/weakly supervised learning and is applicable to nearly any 3D semantic segmentation backbones. Specifically, we leverage geometry patterns in outdoor scenes to have a heuristic pre-segmentation to reduce the manual labeling and jointly design the learning targets with the labeling process. In the learning step, we leverage prototype learning to get more descriptive point embeddings and use multi-scan distillation to exploit richer semantics from temporally aggregated point clouds to boost the performance of single-scan models. Evaluated on the SemanticKITTI and the nuScenes datasets, we show that our proposed method outperforms existing label-efficient methods. With extremely limited human annotations (e.g., 0.1% point labels), our proposed method is even highly competitive compared to the fully supervised counterpart with 100% labels.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"75 1","pages":"70-89"},"PeriodicalIF":0.0,"publicationDate":"2022-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73860713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
The Surprisingly Straightforward Scene Text Removal Method With Gated Attention and Region of Interest Generation: A Comprehensive Prominent Model Analysis 具有门控注意和兴趣区域生成的令人惊讶的简单场景文本去除方法:一个全面的突出模型分析
Hyeonsu Lee, Chankyu Choi
Scene text removal (STR), a task of erasing text from natural scene images, has recently attracted attention as an important component of editing text or concealing private information such as ID, telephone, and license plate numbers. While there are a variety of different methods for STR actively being researched, it is difficult to evaluate superiority because previously proposed methods do not use the same standardized training/evaluation dataset. We use the same standardized training/testing dataset to evaluate the performance of several previous methods after standardized re-implementation. We also introduce a simple yet extremely effective Gated Attention (GA) and Region-of-Interest Generation (RoIG) methodology in this paper. GA uses attention to focus on the text stroke as well as the textures and colors of the surrounding regions to remove text from the input image much more precisely. RoIG is applied to focus on only the region with text instead of the entire image to train the model more efficiently. Experimental results on the benchmark dataset show that our method significantly outperforms existing state-of-the-art methods in almost all metrics with remarkably higher-quality results. Furthermore, because our model does not generate a text stroke mask explicitly, there is no need for additional refinement steps or sub-models, making our model extremely fast with fewer parameters. The dataset and code are available at this https://github.com/naver/garnet.
场景文本删除(Scene text removal, STR)是一种从自然场景图像中删除文本的技术,近年来作为编辑文本或隐藏身份证件、电话号码、车牌号码等私人信息的重要组成部分而受到关注。虽然有各种不同的STR方法正在被积极研究,但很难评估其优越性,因为之前提出的方法没有使用相同的标准化训练/评估数据集。我们使用相同的标准化训练/测试数据集来评估标准化后重新实现的几种先前方法的性能。本文还介绍了一种简单但非常有效的门控注意(GA)和兴趣区域生成(RoIG)方法。GA将注意力集中在文本笔画以及周围区域的纹理和颜色上,从而更精确地从输入图像中删除文本。RoIG应用于只关注有文本的区域,而不是整个图像,以更有效地训练模型。在基准数据集上的实验结果表明,我们的方法在几乎所有指标上都明显优于现有的最先进的方法,并且结果质量显著提高。此外,由于我们的模型不显式地生成文本笔画遮罩,因此不需要额外的细化步骤或子模型,从而使我们的模型以更少的参数非常快。数据集和代码可在此获得https://github.com/naver/garnet。
{"title":"The Surprisingly Straightforward Scene Text Removal Method With Gated Attention and Region of Interest Generation: A Comprehensive Prominent Model Analysis","authors":"Hyeonsu Lee, Chankyu Choi","doi":"10.48550/arXiv.2210.07489","DOIUrl":"https://doi.org/10.48550/arXiv.2210.07489","url":null,"abstract":"Scene text removal (STR), a task of erasing text from natural scene images, has recently attracted attention as an important component of editing text or concealing private information such as ID, telephone, and license plate numbers. While there are a variety of different methods for STR actively being researched, it is difficult to evaluate superiority because previously proposed methods do not use the same standardized training/evaluation dataset. We use the same standardized training/testing dataset to evaluate the performance of several previous methods after standardized re-implementation. We also introduce a simple yet extremely effective Gated Attention (GA) and Region-of-Interest Generation (RoIG) methodology in this paper. GA uses attention to focus on the text stroke as well as the textures and colors of the surrounding regions to remove text from the input image much more precisely. RoIG is applied to focus on only the region with text instead of the entire image to train the model more efficiently. Experimental results on the benchmark dataset show that our method significantly outperforms existing state-of-the-art methods in almost all metrics with remarkably higher-quality results. Furthermore, because our model does not generate a text stroke mask explicitly, there is no need for additional refinement steps or sub-models, making our model extremely fast with fewer parameters. The dataset and code are available at this https://github.com/naver/garnet.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"31 1","pages":"457-472"},"PeriodicalIF":0.0,"publicationDate":"2022-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85301718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving 自动驾驶中运动启发的无监督感知和预测
Mahyar Najibi, Jingwei Ji, Yin Zhou, C. Qi, Xinchen Yan, S. Ettinger, Drago Anguelov
Learning-based perception and prediction modules in modern autonomous driving systems typically rely on expensive human annotation and are designed to perceive only a handful of predefined object categories. This closed-set paradigm is insufficient for the safety-critical autonomous driving task, where the autonomous vehicle needs to process arbitrarily many types of traffic participants and their motion behaviors in a highly dynamic world. To address this difficulty, this paper pioneers a novel and challenging direction, i.e., training perception and prediction models to understand open-set moving objects, with no human supervision. Our proposed framework uses self-learned flow to trigger an automated meta labeling pipeline to achieve automatic supervision. 3D detection experiments on the Waymo Open Dataset show that our method significantly outperforms classical unsupervised approaches and is even competitive to the counterpart with supervised scene flow. We further show that our approach generates highly promising results in open-set 3D detection and trajectory prediction, confirming its potential in closing the safety gap of fully supervised systems.
现代自动驾驶系统中基于学习的感知和预测模块通常依赖于昂贵的人工注释,并且只能感知少数预定义的对象类别。这种闭集范式不足以满足安全关键型自动驾驶任务,因为自动驾驶车辆需要在高度动态的世界中处理任意多种类型的交通参与者及其运动行为。为了解决这一困难,本文开创了一个新颖而具有挑战性的方向,即训练感知和预测模型来理解开放集运动物体,而无需人工监督。我们提出的框架使用自学习流来触发自动元标记管道,以实现自动监督。在Waymo开放数据集上的3D检测实验表明,我们的方法明显优于经典的无监督方法,甚至可以与有监督的场景流相媲美。我们进一步表明,我们的方法在开放集3D检测和轨迹预测方面产生了非常有希望的结果,证实了它在缩小完全监督系统的安全差距方面的潜力。
{"title":"Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving","authors":"Mahyar Najibi, Jingwei Ji, Yin Zhou, C. Qi, Xinchen Yan, S. Ettinger, Drago Anguelov","doi":"10.48550/arXiv.2210.08061","DOIUrl":"https://doi.org/10.48550/arXiv.2210.08061","url":null,"abstract":"Learning-based perception and prediction modules in modern autonomous driving systems typically rely on expensive human annotation and are designed to perceive only a handful of predefined object categories. This closed-set paradigm is insufficient for the safety-critical autonomous driving task, where the autonomous vehicle needs to process arbitrarily many types of traffic participants and their motion behaviors in a highly dynamic world. To address this difficulty, this paper pioneers a novel and challenging direction, i.e., training perception and prediction models to understand open-set moving objects, with no human supervision. Our proposed framework uses self-learned flow to trigger an automated meta labeling pipeline to achieve automatic supervision. 3D detection experiments on the Waymo Open Dataset show that our method significantly outperforms classical unsupervised approaches and is even competitive to the counterpart with supervised scene flow. We further show that our approach generates highly promising results in open-set 3D detection and trajectory prediction, confirming its potential in closing the safety gap of fully supervised systems.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"102 1","pages":"424-443"},"PeriodicalIF":0.0,"publicationDate":"2022-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75645026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Real Spike: Learning Real-valued Spikes for Spiking Neural Networks 真实尖峰:学习尖峰神经网络的实值尖峰
Yu-Zhu Guo, Liwen Zhang, Y. Chen, Xinyi Tong, Xiaode Liu, Yinglei Wang, Xuhui Huang, Zhe Ma
Brain-inspired spiking neural networks (SNNs) have recently drawn more and more attention due to their event-driven and energyefficient characteristics. The integration of storage and computation paradigm on neuromorphic hardwares makes SNNs much different from Deep Neural Networks (DNNs). In this paper, we argue that SNNs may not benefit from the weight-sharing mechanism, which can effectively reduce parameters and improve inference efficiency in DNNs, in some hardwares, and assume that an SNN with unshared convolution kernels could perform better. Motivated by this assumption, a training-inference decoupling method for SNNs named as Real Spike is proposed, which not only enjoys both unshared convolution kernels and binary spikes in inference-time but also maintains both shared convolution kernels and Real-valued Spikes during training. This decoupling mechanism of SNN is realized by a re-parameterization technique. Furthermore, based on the training-inference-decoupled idea, a series of different forms for implementing Real Spike on different levels are presented, which also enjoy shared convolutions in the inference and are friendly to both neuromorphic and non-neuromorphic hardware platforms. A theoretical proof is given to clarify that the Real Spike-based SNN network is superior to its vanilla counterpart. Experimental results show that all different Real Spike versions can consistently improve the SNN performance. Moreover, the proposed method outperforms the state-of-the-art models on both non-spiking static and neuromorphic datasets.
近年来,脑激发型脉冲神经网络(SNNs)因其事件驱动和能量高效的特点而受到越来越多的关注。神经形态硬件上存储和计算模式的集成使得snn与深度神经网络(Deep Neural Networks, dnn)有很大的不同。在本文中,我们认为SNN在某些硬件中可能无法从权值共享机制中获益,而权值共享机制可以有效地减少dnn中的参数并提高推理效率,并假设具有非共享卷积核的SNN可以表现得更好。基于这一假设,提出了一种snn的训练-推理解耦方法Real Spike,该方法不仅在推理时间内具有非共享卷积核和二值尖峰,而且在训练过程中保持共享卷积核和实值尖峰。SNN的这种解耦机制是通过一种重参数化技术实现的。此外,基于训练-推理-解耦的思想,提出了一系列在不同层次上实现Real Spike的不同形式,这些形式在推理中具有共享卷积,并且对神经形态和非神经形态硬件平台都友好。从理论上证明了基于Real spike的SNN网络优于普通SNN网络。实验结果表明,所有不同的Real Spike版本都可以一致地提高SNN性能。此外,该方法在非尖峰静态和神经形态数据集上都优于最先进的模型。
{"title":"Real Spike: Learning Real-valued Spikes for Spiking Neural Networks","authors":"Yu-Zhu Guo, Liwen Zhang, Y. Chen, Xinyi Tong, Xiaode Liu, Yinglei Wang, Xuhui Huang, Zhe Ma","doi":"10.48550/arXiv.2210.06686","DOIUrl":"https://doi.org/10.48550/arXiv.2210.06686","url":null,"abstract":"Brain-inspired spiking neural networks (SNNs) have recently drawn more and more attention due to their event-driven and energyefficient characteristics. The integration of storage and computation paradigm on neuromorphic hardwares makes SNNs much different from Deep Neural Networks (DNNs). In this paper, we argue that SNNs may not benefit from the weight-sharing mechanism, which can effectively reduce parameters and improve inference efficiency in DNNs, in some hardwares, and assume that an SNN with unshared convolution kernels could perform better. Motivated by this assumption, a training-inference decoupling method for SNNs named as Real Spike is proposed, which not only enjoys both unshared convolution kernels and binary spikes in inference-time but also maintains both shared convolution kernels and Real-valued Spikes during training. This decoupling mechanism of SNN is realized by a re-parameterization technique. Furthermore, based on the training-inference-decoupled idea, a series of different forms for implementing Real Spike on different levels are presented, which also enjoy shared convolutions in the inference and are friendly to both neuromorphic and non-neuromorphic hardware platforms. A theoretical proof is given to clarify that the Real Spike-based SNN network is superior to its vanilla counterpart. Experimental results show that all different Real Spike versions can consistently improve the SNN performance. Moreover, the proposed method outperforms the state-of-the-art models on both non-spiking static and neuromorphic datasets.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"108 1","pages":"52-68"},"PeriodicalIF":0.0,"publicationDate":"2022-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86125613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Autoregressive Uncertainty Modeling for 3D Bounding Box Prediction 三维边界盒预测的自回归不确定性建模
Yuxuan Liu, Nikhil Mishra, Maximilian Sieb, Yide Shentu, P. Abbeel, Xi Chen
. 3D bounding boxes are a widespread intermediate representation in many computer vision applications. However, predicting them is a challenging task, largely due to partial observability, which motivates the need for a strong sense of uncertainty. While many recent methods have explored better architectures for consuming sparse and unstructured point cloud data, we hypothesize that there is room for improve-ment in the modeling of the output distribution and explore how this can be achieved using an autoregressive prediction head. Additionally, we release a simulated dataset, COB-3D, which highlights new types of ambiguity that arise in real-world robotics applications, where 3D bounding box prediction has largely been underexplored. We propose methods for leveraging our autoregressive model to make high confidence predictions and meaningful uncertainty measures, achieving strong results on SUN-RGBD, Scannet, KITTI, and our new dataset 3 .
. 在许多计算机视觉应用中,三维边界框是一种广泛的中间表示。然而,预测它们是一项具有挑战性的任务,主要是由于部分可观测性,这激发了对强烈不确定性的需求。虽然许多最近的方法已经探索了更好的架构来消费稀疏和非结构化的点云数据,但我们假设在输出分布的建模方面有改进的空间,并探索如何使用自回归预测头来实现这一点。此外,我们发布了一个模拟数据集COB-3D,它突出了现实世界机器人应用中出现的新型模糊性,其中3D边界盒预测在很大程度上尚未得到充分探索。我们提出了利用我们的自回归模型进行高置信度预测和有意义的不确定性测量的方法,在SUN-RGBD、Scannet、KITTI和我们的新数据集3上取得了强有力的结果。
{"title":"Autoregressive Uncertainty Modeling for 3D Bounding Box Prediction","authors":"Yuxuan Liu, Nikhil Mishra, Maximilian Sieb, Yide Shentu, P. Abbeel, Xi Chen","doi":"10.48550/arXiv.2210.07424","DOIUrl":"https://doi.org/10.48550/arXiv.2210.07424","url":null,"abstract":". 3D bounding boxes are a widespread intermediate representation in many computer vision applications. However, predicting them is a challenging task, largely due to partial observability, which motivates the need for a strong sense of uncertainty. While many recent methods have explored better architectures for consuming sparse and unstructured point cloud data, we hypothesize that there is room for improve-ment in the modeling of the output distribution and explore how this can be achieved using an autoregressive prediction head. Additionally, we release a simulated dataset, COB-3D, which highlights new types of ambiguity that arise in real-world robotics applications, where 3D bounding box prediction has largely been underexplored. We propose methods for leveraging our autoregressive model to make high confidence predictions and meaningful uncertainty measures, achieving strong results on SUN-RGBD, Scannet, KITTI, and our new dataset 3 .","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"33 1","pages":"673-694"},"PeriodicalIF":0.0,"publicationDate":"2022-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91030375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds SWFormer:用于点云中3D物体检测的稀疏窗口转换器
Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, Drago Anguelov
3D object detection in point clouds is a core component for modern robotics and autonomous driving systems. A key challenge in 3D object detection comes from the inherent sparse nature of point occupancy within the 3D scene. In this paper, we propose Sparse Window Transformer (SWFormer ), a scalable and accurate model for 3D object detection, which can take full advantage of the sparsity of point clouds. Built upon the idea of window-based Transformers, SWFormer converts 3D points into sparse voxels and windows, and then processes these variable-length sparse windows efficiently using a bucketing scheme. In addition to self-attention within each spatial window, our SWFormer also captures cross-window correlation with multi-scale feature fusion and window shifting operations. To further address the unique challenge of detecting 3D objects accurately from sparse features, we propose a new voxel diffusion technique. Experimental results on the Waymo Open Dataset show our SWFormer achieves state-of-the-art 73.36 L2 mAPH on vehicle and pedestrian for 3D object detection on the official test set, outperforming all previous single-stage and two-stage models, while being much more efficient.
点云中的三维目标检测是现代机器人和自动驾驶系统的核心组成部分。3D目标检测的一个关键挑战来自于3D场景中点占用的固有稀疏性。本文提出了一种可扩展且精确的三维目标检测模型——稀疏窗口变压器(SWFormer),该模型可以充分利用点云的稀疏性。SWFormer基于基于窗口的transformer的思想,将3D点转换为稀疏体素和窗口,然后使用桶式方案有效地处理这些变长稀疏窗口。除了每个空间窗口内的自关注外,我们的SWFormer还通过多尺度特征融合和窗口移动操作捕获跨窗口相关性。为了进一步解决从稀疏特征中准确检测3D物体的独特挑战,我们提出了一种新的体素扩散技术。在Waymo开放数据集上的实验结果表明,我们的SWFormer在车辆和行人的官方测试集上实现了最先进的73.36 L2 mAPH,用于3D物体检测,优于所有以前的单级和两级模型,同时效率更高。
{"title":"SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds","authors":"Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, Drago Anguelov","doi":"10.48550/arXiv.2210.07372","DOIUrl":"https://doi.org/10.48550/arXiv.2210.07372","url":null,"abstract":"3D object detection in point clouds is a core component for modern robotics and autonomous driving systems. A key challenge in 3D object detection comes from the inherent sparse nature of point occupancy within the 3D scene. In this paper, we propose Sparse Window Transformer (SWFormer ), a scalable and accurate model for 3D object detection, which can take full advantage of the sparsity of point clouds. Built upon the idea of window-based Transformers, SWFormer converts 3D points into sparse voxels and windows, and then processes these variable-length sparse windows efficiently using a bucketing scheme. In addition to self-attention within each spatial window, our SWFormer also captures cross-window correlation with multi-scale feature fusion and window shifting operations. To further address the unique challenge of detecting 3D objects accurately from sparse features, we propose a new voxel diffusion technique. Experimental results on the Waymo Open Dataset show our SWFormer achieves state-of-the-art 73.36 L2 mAPH on vehicle and pedestrian for 3D object detection on the official test set, outperforming all previous single-stage and two-stage models, while being much more efficient.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"119 17","pages":"426-442"},"PeriodicalIF":0.0,"publicationDate":"2022-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91408690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
期刊
Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1