首页 > 最新文献

2022 19th Conference on Robots and Vision (CRV)最新文献

英文 中文
Occlusion-Aware Self-Supervised Stereo Matching with Confidence Guided Raw Disparity Fusion 基于置信度引导的原始视差融合的闭塞感知自监督立体匹配
Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00025
Xiule Fan, Soo Jeon, B. Fidan
Commercially available stereo cameras used in robots and other intelligent systems to obtain depth information typically rely on traditional stereo matching algorithms. Although their raw (predicted) disparity maps contain incorrect estimates, these algorithms can still provide useful prior information towards more accurate prediction. We propose a pipeline to incorporate this prior information to produce more accurate disparity maps. The proposed pipeline includes a confidence generation component to identify raw disparity inaccuracies as well as a self-supervised deep neural network (DNN) to predict disparity and compute the corresponding occlusion masks. The proposed DNN consists of a feature extraction module, a confidence guided raw disparity fusion module to generate an initial disparity map, and a hierarchical occlusion-aware disparity refinement module to compute the final estimates. Experimental results on public datasets verify that the proposed pipeline has competitive accuracy with real-time processing rate. We also test the pipeline with images captured by commercial stereo cameras to show its effectiveness in improving their raw disparity estimates.
用于机器人和其他智能系统的商用立体摄像机通常依赖于传统的立体匹配算法来获取深度信息。尽管他们的原始(预测的)视差图包含不正确的估计,这些算法仍然可以为更准确的预测提供有用的先验信息。我们提出了一个管道来整合这些先验信息,以产生更准确的视差图。该管道包括一个置信度生成组件来识别原始视差不准确性,以及一个自监督深度神经网络(DNN)来预测视差并计算相应的遮挡掩模。提出的深度神经网络由特征提取模块、置信度引导的原始视差融合模块生成初始视差图,以及分层遮挡感知的视差细化模块计算最终估计。在公共数据集上的实验结果验证了所提出的管道具有相当的准确性和实时处理速率。我们还用商用立体相机捕获的图像测试了该管道,以显示其在改善原始视差估计方面的有效性。
{"title":"Occlusion-Aware Self-Supervised Stereo Matching with Confidence Guided Raw Disparity Fusion","authors":"Xiule Fan, Soo Jeon, B. Fidan","doi":"10.1109/CRV55824.2022.00025","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00025","url":null,"abstract":"Commercially available stereo cameras used in robots and other intelligent systems to obtain depth information typically rely on traditional stereo matching algorithms. Although their raw (predicted) disparity maps contain incorrect estimates, these algorithms can still provide useful prior information towards more accurate prediction. We propose a pipeline to incorporate this prior information to produce more accurate disparity maps. The proposed pipeline includes a confidence generation component to identify raw disparity inaccuracies as well as a self-supervised deep neural network (DNN) to predict disparity and compute the corresponding occlusion masks. The proposed DNN consists of a feature extraction module, a confidence guided raw disparity fusion module to generate an initial disparity map, and a hierarchical occlusion-aware disparity refinement module to compute the final estimates. Experimental results on public datasets verify that the proposed pipeline has competitive accuracy with real-time processing rate. We also test the pipeline with images captured by commercial stereo cameras to show its effectiveness in improving their raw disparity estimates.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116524836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Classification of handwritten annotations in mixed-media documents 混合媒体文档中手写注释的分类
Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00027
Amanda Dash, A. Albu
Handwritten annotations in documents contain valuable information, but they are challenging to detect and identify. This paper addresses this challenge. We propose an al-gorithm for generating a novel mixed-media document dataset, Annotated Docset, that consists of 14 classes of machine-printed and handwritten elements and annotations. We also propose a novel loss function, Dense Loss, which can correctly identify small objects in complex documents when used in fully convolutional networks (e.g. U-NET, DeepLabV3+). Our Dense Loss function is a compound function that uses local region homogeneity to promote contiguous and smooth segmentation predictions while also using an L1-norm loss to reconstruct the dense-labelled ground truth. By using regression instead of a probabilistic approach to pixel classification, we avoid the pitfalls of training on datasets with small or underrepre-sented objects. We show that our loss function outperforms other semantic segmentation loss functions for imbalanced datasets, containing few elements that occupy small areas. Experimental results show that the proposed method achieved a mean Intersection-over-Union (mIoU) score of 0.7163 for all document classes and 0.6290 for handwritten annotations, thus outperforming state-of-the-art loss functions.
文档中的手写注释包含有价值的信息,但它们很难检测和识别。本文解决了这一挑战。我们提出了一种算法来生成一种新的混合媒体文档数据集,Annotated Docset,它由14类机器打印和手写的元素和注释组成。我们还提出了一种新的损失函数,Dense loss,当在全卷积网络(例如U-NET, DeepLabV3+)中使用时,它可以正确识别复杂文档中的小对象。我们的Dense Loss函数是一个复合函数,它使用局部区域同质性来促进连续和平滑的分割预测,同时也使用l1范数损失来重建密集标记的地面真值。通过使用回归而不是概率方法来进行像素分类,我们避免了在具有小或未充分表示对象的数据集上进行训练的陷阱。我们表明,对于不平衡数据集,我们的损失函数优于其他语义分割损失函数,这些数据集包含很少的元素,占用很小的区域。实验结果表明,该方法对所有文档类的平均mIoU分数为0.7163,对手写注释的平均mIoU分数为0.6290,优于最先进的损失函数。
{"title":"Classification of handwritten annotations in mixed-media documents","authors":"Amanda Dash, A. Albu","doi":"10.1109/CRV55824.2022.00027","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00027","url":null,"abstract":"Handwritten annotations in documents contain valuable information, but they are challenging to detect and identify. This paper addresses this challenge. We propose an al-gorithm for generating a novel mixed-media document dataset, Annotated Docset, that consists of 14 classes of machine-printed and handwritten elements and annotations. We also propose a novel loss function, Dense Loss, which can correctly identify small objects in complex documents when used in fully convolutional networks (e.g. U-NET, DeepLabV3+). Our Dense Loss function is a compound function that uses local region homogeneity to promote contiguous and smooth segmentation predictions while also using an L1-norm loss to reconstruct the dense-labelled ground truth. By using regression instead of a probabilistic approach to pixel classification, we avoid the pitfalls of training on datasets with small or underrepre-sented objects. We show that our loss function outperforms other semantic segmentation loss functions for imbalanced datasets, containing few elements that occupy small areas. Experimental results show that the proposed method achieved a mean Intersection-over-Union (mIoU) score of 0.7163 for all document classes and 0.6290 for handwritten annotations, thus outperforming state-of-the-art loss functions.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128668276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proceedings 2022 19th Conference on Robots and Vision 2022年第19届机器人与视觉会议论文集
Pub Date : 2022-05-01 DOI: 10.1109/crv55824.2022.00001
{"title":"Proceedings 2022 19th Conference on Robots and Vision","authors":"","doi":"10.1109/crv55824.2022.00001","DOIUrl":"https://doi.org/10.1109/crv55824.2022.00001","url":null,"abstract":"","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"257 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133557166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Occluded Text Detection and Recognition in the Wild 野外遮挡文本检测与识别
Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00026
Z. Raisi, J. Zelek
The performance of existing deep-learning scene text recognition-based methods fails significantly on occluded text instances or even partially occluded characters in a text due to their reliance on the visibility of the target characters in images. This failure is often due to features generated by the current architectures with limited robustness to occlusion, which opens the possibility of improving the feature extractors and/or the learning models to better handle these severe occlusions. In this paper, we first evaluate the performance of the current scene text detection, scene text recognition, and scene text spotting models using two publicly-available occlusion datasets: Occlusion Scene Text (OST) that is designed explicitly for scene text recognition, and we also prepare an Occluded Character-level using the Total-Text (OCTT) dataset for evaluating the scene text spotting and detection models. Then we utilize a very recent Transformer-based framework in deep learning, namely Masked Auto Encoder (MAE), as a backbone for scene text detection and recognition pipelines to mitigate the occlusion problem. The performance of our scene text recognition and end-to-end scene text spotting models improves by transfer learning on the pre-trained MAE backbone. For example, our recognition model witnessed a 4% word recognition accuracy on the OST dataset. Our end-to-end text spotting model achieved 68.5% F-measure performance outperforming the stat-of-the-art methods when equipped with an MAE backbone compared to a convolutional neural network (CNN) backbone on the OCTT dataset.
现有的基于深度学习的场景文本识别方法,由于依赖于目标字符在图像中的可见性,在被遮挡的文本实例甚至部分被遮挡的文本字符上,其性能明显失败。这种失败通常是由于当前架构生成的特征对遮挡的鲁棒性有限,这为改进特征提取器和/或学习模型以更好地处理这些严重遮挡打开了可能性。在本文中,我们首先使用两个公开可用的遮挡数据集来评估当前场景文本检测、场景文本识别和场景文本斑点模型的性能:一个是明确为场景文本识别设计的遮挡场景文本(OST),另一个是使用Total-Text (OCTT)数据集来评估场景文本斑点和检测模型的遮挡字符级别。然后,我们利用深度学习中最新的基于transformer的框架,即掩码自动编码器(MAE),作为场景文本检测和识别管道的主干,以减轻遮挡问题。我们的场景文本识别和端到端场景文本识别模型通过在预训练的MAE主干上的迁移学习提高了性能。例如,我们的识别模型在OST数据集上的单词识别准确率为4%。与OCTT数据集上的卷积神经网络(CNN)骨干网相比,我们的端到端文本识别模型在配备MAE骨干网时实现了68.5%的F-measure性能,优于最先进的方法。
{"title":"Occluded Text Detection and Recognition in the Wild","authors":"Z. Raisi, J. Zelek","doi":"10.1109/CRV55824.2022.00026","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00026","url":null,"abstract":"The performance of existing deep-learning scene text recognition-based methods fails significantly on occluded text instances or even partially occluded characters in a text due to their reliance on the visibility of the target characters in images. This failure is often due to features generated by the current architectures with limited robustness to occlusion, which opens the possibility of improving the feature extractors and/or the learning models to better handle these severe occlusions. In this paper, we first evaluate the performance of the current scene text detection, scene text recognition, and scene text spotting models using two publicly-available occlusion datasets: Occlusion Scene Text (OST) that is designed explicitly for scene text recognition, and we also prepare an Occluded Character-level using the Total-Text (OCTT) dataset for evaluating the scene text spotting and detection models. Then we utilize a very recent Transformer-based framework in deep learning, namely Masked Auto Encoder (MAE), as a backbone for scene text detection and recognition pipelines to mitigate the occlusion problem. The performance of our scene text recognition and end-to-end scene text spotting models improves by transfer learning on the pre-trained MAE backbone. For example, our recognition model witnessed a 4% word recognition accuracy on the OST dataset. Our end-to-end text spotting model achieved 68.5% F-measure performance outperforming the stat-of-the-art methods when equipped with an MAE backbone compared to a convolutional neural network (CNN) backbone on the OCTT dataset.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115002103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Instance Segmentation of Herring and Salmon Schools in Acoustic Echograms using a Hybrid U-Net 基于混合U-Net的声回波图中鲱鱼和鲑鱼种群的实例分割
Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00010
Alex L. Slonimer, Melissa Cote, T. Marques, A. Rezvanifar, S. Dosso, A. Albu, Kaan Ersahin, T. Mudge, S. Gauthier
The automated classification of fish, such as herring and salmon, in multi-frequency echograms is important for ecosystems monitoring. This paper implements a novel approach to instance segmentation: a hybrid of deep-learning and heuristic methods. This approach implements semantic segmentation by a U-Net to detect fish, which are converted to instances of fish-schools derived from candidate components within a defined linking distance. In addition to four frequency channels of echogram data (67.5, 125, 200, 455 kHz), two simulated channels (water depth and solar elevation angle) are included to encode spatial and temporal information, which leads to substantial improvement in model performance. The model is shown to out-perform recent experiments that have used a Mask R-CNN architecture. This approach demonstrates the ability to classify sparsely distributed objects in a way that is not possible with state-of-the-art instance segmentation methods.
在多频回波图中对鱼类(如鲱鱼和鲑鱼)进行自动分类对生态系统监测具有重要意义。本文实现了一种新的实例分割方法:深度学习和启发式方法的混合。该方法通过U-Net实现语义分割来检测鱼,并将其转换为在定义的连接距离内从候选组件派生的鱼群实例。在回波图数据的4个频率通道(67.5、125、200、455 kHz)之外,还包括水深和太阳俯仰角两个模拟通道来编码时空信息,从而大大提高了模型的性能。该模型的性能优于最近使用Mask R-CNN架构的实验。这种方法展示了以最先进的实例分割方法无法实现的方式对稀疏分布对象进行分类的能力。
{"title":"Instance Segmentation of Herring and Salmon Schools in Acoustic Echograms using a Hybrid U-Net","authors":"Alex L. Slonimer, Melissa Cote, T. Marques, A. Rezvanifar, S. Dosso, A. Albu, Kaan Ersahin, T. Mudge, S. Gauthier","doi":"10.1109/CRV55824.2022.00010","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00010","url":null,"abstract":"The automated classification of fish, such as herring and salmon, in multi-frequency echograms is important for ecosystems monitoring. This paper implements a novel approach to instance segmentation: a hybrid of deep-learning and heuristic methods. This approach implements semantic segmentation by a U-Net to detect fish, which are converted to instances of fish-schools derived from candidate components within a defined linking distance. In addition to four frequency channels of echogram data (67.5, 125, 200, 455 kHz), two simulated channels (water depth and solar elevation angle) are included to encode spatial and temporal information, which leads to substantial improvement in model performance. The model is shown to out-perform recent experiments that have used a Mask R-CNN architecture. This approach demonstrates the ability to classify sparsely distributed objects in a way that is not possible with state-of-the-art instance segmentation methods.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124585471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Program Committee: CRV 2022 项目委员会:CRV 2022
Pub Date : 2022-05-01 DOI: 10.1109/crv55824.2022.00007
{"title":"Program Committee: CRV 2022","authors":"","doi":"10.1109/crv55824.2022.00007","DOIUrl":"https://doi.org/10.1109/crv55824.2022.00007","url":null,"abstract":"","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133879922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Lasso Method for Multi-Robot Foraging 多机器人觅食的套索法
Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00022
A. Vardy
We propose a novel approach to multi-robot foraging. This approach makes use of a scalar field to guide robots throughout an environment while gathering objects towards the goal. The environment must be planar with a closed, contiguous boundary. However, the boundary's shape can be arbitrary. Conventional robot foraging methods assume an open environment or a simple boundary that never impedes the robots—a limitation which our method overcomes. Our distributed control algorithm causes the robots to circumnavigate the environment and nudge objects inwards towards the goal. We demonstrate the performance of our approach using real-world and simulated experiments and study the impact of the number of robots, the complexity of the boundary, and limitations on the sensing range.
我们提出了一种新的多机器人觅食方法。这种方法利用标量场来引导机器人在整个环境中收集目标物体。环境必须是平面的,有一个封闭的、连续的边界。然而,边界的形状可以是任意的。传统的机器人觅食方法假设一个开放的环境或一个简单的边界,不会阻碍机器人-我们的方法克服了一个限制。我们的分布式控制算法使机器人绕过环境并将物体向内推向目标。我们使用现实世界和模拟实验证明了我们的方法的性能,并研究了机器人数量,边界复杂性和传感范围限制的影响。
{"title":"The Lasso Method for Multi-Robot Foraging","authors":"A. Vardy","doi":"10.1109/CRV55824.2022.00022","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00022","url":null,"abstract":"We propose a novel approach to multi-robot foraging. This approach makes use of a scalar field to guide robots throughout an environment while gathering objects towards the goal. The environment must be planar with a closed, contiguous boundary. However, the boundary's shape can be arbitrary. Conventional robot foraging methods assume an open environment or a simple boundary that never impedes the robots—a limitation which our method overcomes. Our distributed control algorithm causes the robots to circumnavigate the environment and nudge objects inwards towards the goal. We demonstrate the performance of our approach using real-world and simulated experiments and study the impact of the number of robots, the complexity of the boundary, and limitations on the sensing range.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129602458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An Exact Fast Fourier Method for Morphological Dilation and Erosion Using the Umbra Technique 基于本影技术的形态扩张和侵蚀的精确快速傅立叶方法
Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00032
V. Sridhar, M. Breuß
In this paper we consider the fundamental operations dilation and erosion of mathematical morphology. It is well known that many powerful image filtering operations can be constructed by their combinations. We propose a fast and novel algorithm based on the Fast Fourier Transform to compute grey-value morphological operations on an image. The novel method may deal with non-flat filters and incorporates no restrictions on shape and size of the filtering window, in contrast to many other fast methods in the field. Unlike fast Fourier techniques from previous works, the novel method gives exact results and is not an approximation. The key aspect which allows to achieve this is to explore here for the first time in this context the umbra formulation of images and filters. We show that the new method is in practice particularly suitable for filtering images with small tonal range or when employing large filter sizes.
本文考虑数学形态学的基本运算膨胀和侵蚀。众所周知,许多强大的图像滤波操作可以通过它们的组合来构建。提出了一种基于快速傅里叶变换的图像灰度形态学运算算法。与该领域的许多其他快速方法相比,该方法可以处理非平面滤波器,并且对滤波窗口的形状和大小没有限制。与以前的快速傅里叶技术不同,这种新方法给出了精确的结果,而不是近似。实现这一目标的关键方面是在此背景下首次探索图像和过滤器的本影公式。我们表明,新方法在实践中特别适合滤波图像与小色调范围或使用大滤波器尺寸。
{"title":"An Exact Fast Fourier Method for Morphological Dilation and Erosion Using the Umbra Technique","authors":"V. Sridhar, M. Breuß","doi":"10.1109/CRV55824.2022.00032","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00032","url":null,"abstract":"In this paper we consider the fundamental operations dilation and erosion of mathematical morphology. It is well known that many powerful image filtering operations can be constructed by their combinations. We propose a fast and novel algorithm based on the Fast Fourier Transform to compute grey-value morphological operations on an image. The novel method may deal with non-flat filters and incorporates no restrictions on shape and size of the filtering window, in contrast to many other fast methods in the field. Unlike fast Fourier techniques from previous works, the novel method gives exact results and is not an approximation. The key aspect which allows to achieve this is to explore here for the first time in this context the umbra formulation of images and filters. We show that the new method is in practice particularly suitable for filtering images with small tonal range or when employing large filter sizes.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127481240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Semi-supervised Grounding Alignment for Multi-modal Feature Learning 多模态特征学习的半监督接地对齐
Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00015
Shih-Han Chou, Zicong Fan, J. Little, L. Sigal
Self-supervised transformer-based architectures, such as ViLBERT [1] and others, have recently emerged as dominant paradigms for multi-modal feature learning. Such architectures leverage large-scale datasets (e.g., Conceptual Captions [2]) and, typically, image-sentence pairings, for self-supervision. However, conventional multi-modal feature learning requires huge datasets and computing for both pre-training and fine-tuning to the target task. In this paper, we illustrate that more granular semi-supervised alignment at a region-phrase level is an additional useful cue and can further improve the performance of such representations. To this end, we propose a novel semi-supervised grounding alignment loss, which leverages an off-the-shelf pre-trained phrase grounding model for pseudo-supervision (by producing region-phrase alignments). This semi-supervised formulation enables better feature learning in the absence of any additional human annotations on the large-scale (Conceptual Captions) dataset. Further, it shows an even larger margin of improvement on smaller data splits, leading to effective data-efficient feature learning. We illustrate the superiority of the learned features by fine-tuning the resulting models to multiple vision-language downstream tasks: visual question answering (VQA), visual commonsense reasoning (VCR), and visual grounding. Experiments on the VQA, VCR, and grounding benchmarks demonstrate the improvement of up to 1.3% in accuracy (in visual grounding) with large-scale training; up to 5.9% (in VQA) with 1/8 of the data for pre-training and fine-tuning11We will release the code and all pre-trained models upon acceptance..
基于自监督变压器的架构,如ViLBERT[1]等,最近成为多模态特征学习的主导范式。这种架构利用大规模的数据集(例如,Conceptual Captions[2]),通常使用图像-句子配对进行自我监督。然而,传统的多模态特征学习需要庞大的数据集和计算来进行预训练和对目标任务的微调。在本文中,我们说明了在区域-短语级别上更细粒度的半监督对齐是一个额外的有用线索,可以进一步提高这种表示的性能。为此,我们提出了一种新的半监督接地对齐损失,它利用现成的预训练短语接地模型进行伪监督(通过产生区域-短语对齐)。这种半监督的公式可以在大规模(概念说明)数据集上没有任何额外的人工注释的情况下更好地进行特征学习。此外,它在较小的数据分割上显示出更大的改进余地,从而实现有效的数据高效特征学习。我们通过将结果模型微调到多个视觉语言下游任务来说明学习特征的优越性:视觉问答(VQA)、视觉常识推理(VCR)和视觉基础。在VQA, VCR和接地基准上的实验表明,通过大规模训练,准确度(视觉接地)提高了1.3%;高达5.9%(在VQA中),1/8的数据用于预训练和微调11我们将在验收后发布代码和所有预训练的模型。
{"title":"Semi-supervised Grounding Alignment for Multi-modal Feature Learning","authors":"Shih-Han Chou, Zicong Fan, J. Little, L. Sigal","doi":"10.1109/CRV55824.2022.00015","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00015","url":null,"abstract":"Self-supervised transformer-based architectures, such as ViLBERT [1] and others, have recently emerged as dominant paradigms for multi-modal feature learning. Such architectures leverage large-scale datasets (e.g., Conceptual Captions [2]) and, typically, image-sentence pairings, for self-supervision. However, conventional multi-modal feature learning requires huge datasets and computing for both pre-training and fine-tuning to the target task. In this paper, we illustrate that more granular semi-supervised alignment at a region-phrase level is an additional useful cue and can further improve the performance of such representations. To this end, we propose a novel semi-supervised grounding alignment loss, which leverages an off-the-shelf pre-trained phrase grounding model for pseudo-supervision (by producing region-phrase alignments). This semi-supervised formulation enables better feature learning in the absence of any additional human annotations on the large-scale (Conceptual Captions) dataset. Further, it shows an even larger margin of improvement on smaller data splits, leading to effective data-efficient feature learning. We illustrate the superiority of the learned features by fine-tuning the resulting models to multiple vision-language downstream tasks: visual question answering (VQA), visual commonsense reasoning (VCR), and visual grounding. Experiments on the VQA, VCR, and grounding benchmarks demonstrate the improvement of up to 1.3% in accuracy (in visual grounding) with large-scale training; up to 5.9% (in VQA) with 1/8 of the data for pre-training and fine-tuning11We will release the code and all pre-trained models upon acceptance..","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116919633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Safe Landing Zones Detection for UAVs Using Deep Regression 基于深度回归的无人机安全着陆区检测
Pub Date : 2022-05-01 DOI: 10.1109/CRV55824.2022.00035
Sakineh Abdollahzadeh, Pier-Luc Proulx, M. S. Allili, J. Lapointe
Finding safe landing zones (SLZ) in urban areas and natural scenes is one of the many challenges that must be overcome in automating Unmanned Aerial Vehicles (UAV) navigation. Using passive vision sensors to achieve this objective is a very promising avenue due to their low cost and the potential they provide for performing simultaneous terrain analysis and 3D reconstruction. In this paper, we propose using a deep learning approach on UAV imagery to assess the SLZ. The model is built on a semantic segmentation architecture whereby thematic classes of the terrain are mapped into safety scores for UAV landing. Contrary to past methods, which use hard classification into safe/unsafe landing zones, our approach provides a continuous safety map that is more practical for an emergency landing. Experiments on public datasets have shown promising results.
在城市地区和自然场景中寻找安全着陆区(SLZ)是无人驾驶飞行器(UAV)自动化导航必须克服的众多挑战之一。使用被动视觉传感器来实现这一目标是一个非常有前途的途径,因为它们成本低,并且具有同时进行地形分析和3D重建的潜力。在本文中,我们建议使用无人机图像的深度学习方法来评估SLZ。该模型建立在语义分割架构上,将地形的主题类映射为无人机着陆的安全分数。与过去使用安全/不安全着陆区域的硬分类方法相反,我们的方法提供了一个连续的安全地图,对于紧急着陆更实用。在公共数据集上的实验显示出了令人鼓舞的结果。
{"title":"Safe Landing Zones Detection for UAVs Using Deep Regression","authors":"Sakineh Abdollahzadeh, Pier-Luc Proulx, M. S. Allili, J. Lapointe","doi":"10.1109/CRV55824.2022.00035","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00035","url":null,"abstract":"Finding safe landing zones (SLZ) in urban areas and natural scenes is one of the many challenges that must be overcome in automating Unmanned Aerial Vehicles (UAV) navigation. Using passive vision sensors to achieve this objective is a very promising avenue due to their low cost and the potential they provide for performing simultaneous terrain analysis and 3D reconstruction. In this paper, we propose using a deep learning approach on UAV imagery to assess the SLZ. The model is built on a semantic segmentation architecture whereby thematic classes of the terrain are mapped into safety scores for UAV landing. Contrary to past methods, which use hard classification into safe/unsafe landing zones, our approach provides a continuous safety map that is more practical for an emergency landing. Experiments on public datasets have shown promising results.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124039864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2022 19th Conference on Robots and Vision (CRV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1