首页 > 最新文献

2022 19th Conference on Robots and Vision (CRV)最新文献

英文 中文
Understanding the impact of image and input resolution on deep digital pathology patch classifiers 了解图像和输入分辨率对深度数字病理贴片分类器的影响
Pub Date : 2022-04-29 DOI: 10.1109/CRV55824.2022.00028
Eu Wern Teh, Graham W. Taylor
We consider annotation efficient learning in Digital Pathology (DP), where expert annotations are expensive and thus scarce. We explore the impact of image and input resolution on DP patch classification performance. We use two cancer patch classification datasets PCam and CRC, to validate the results of our study. Our experiments show that patch classification performance can be improved by manipulating both the image and input resolution in annotation-scarce and annotation-rich environments. We show a positive correlation between the image and input resolution and the patch classification accuracy on both datasets. By exploiting the image and input resolution, our final model trained on < 1% of data performs equally well compared to the model trained on 100% of data in the original image resolution on the PCam dataset.
我们考虑在数字病理学(DP)中,专家注释是昂贵的,因此稀缺的注释高效学习。我们探讨了图像和输入分辨率对DP补丁分类性能的影响。我们使用两个癌症斑块分类数据集PCam和CRC来验证我们的研究结果。我们的实验表明,在注释稀缺和注释丰富的环境下,通过控制图像和输入分辨率可以提高补丁分类的性能。我们在两个数据集上显示了图像与输入分辨率和补丁分类精度之间的正相关关系。通过利用图像和输入分辨率,我们在< 1%的数据上训练的最终模型与在PCam数据集上原始图像分辨率下100%的数据上训练的模型表现同样好。
{"title":"Understanding the impact of image and input resolution on deep digital pathology patch classifiers","authors":"Eu Wern Teh, Graham W. Taylor","doi":"10.1109/CRV55824.2022.00028","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00028","url":null,"abstract":"We consider annotation efficient learning in Digital Pathology (DP), where expert annotations are expensive and thus scarce. We explore the impact of image and input resolution on DP patch classification performance. We use two cancer patch classification datasets PCam and CRC, to validate the results of our study. Our experiments show that patch classification performance can be improved by manipulating both the image and input resolution in annotation-scarce and annotation-rich environments. We show a positive correlation between the image and input resolution and the patch classification accuracy on both datasets. By exploiting the image and input resolution, our final model trained on < 1% of data performs equally well compared to the model trained on 100% of data in the original image resolution on the PCam dataset.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129823388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Simple Method to Boost Human Pose Estimation Accuracy by Correcting the Joint Regressor for the Human3.6m Dataset 基于Human3.6m数据集的联合回归校正提高人体姿态估计精度的简单方法
Pub Date : 2022-04-29 DOI: 10.1109/CRV55824.2022.00009
Eric Hedlin, Helge Rhodin, K. M. Yi
Many human pose estimation methods estimate Skinned Multi-Person Linear (SMPL) models and regress the human joints from these SMPL estimates. In this work, we show that the most widely used SMPL-to-joint linear layer (joint regressor) is inaccurate, which may mislead pose evaluation results. To achieve a more accurate joint regressor, we propose a method to create pseudo-ground-truth SMPL poses, which can then be used to train an improved regressor. Specifically, we optimize SMPL estimates coming from a state-of-the-art method so that its projection matches the silhouettes of humans in the scene, as well as the ground-truth 2D joint locations. While the quality of this pseudo-ground-truth is chal-lenging to assess due to the lack of actual ground-truth SMPL, with the Human 3.6m dataset, we qualitatively show that our joint locations are more accurate and that our regressor leads to improved pose estimations results on the test set without any need for retraining. We release our code and joint regressor at https://github.com/ubc-vision/joint-regressor-refinement
许多人体姿态估计方法估计蒙皮多人线性(SMPL)模型,并从这些SMPL估计中回归人体关节。在这项工作中,我们发现最广泛使用的smpl -关节线性层(关节回归器)是不准确的,这可能会误导姿态评估结果。为了获得更准确的联合回归量,我们提出了一种方法来创建伪地真SMPL姿势,然后可以用来训练改进的回归量。具体来说,我们优化了来自最先进方法的SMPL估计,使其投影与场景中人类的轮廓以及地面真实的2D关节位置相匹配。虽然由于缺乏实际的地面真值SMPL,这种伪地面真值的质量很难评估,但使用Human 360万数据集,我们定性地表明,我们的联合位置更准确,我们的回归器在不需要再训练的情况下改善了测试集上的姿态估计结果。我们在https://github.com/ubc-vision/joint-regressor-refinement上发布了代码和联合回归器
{"title":"A Simple Method to Boost Human Pose Estimation Accuracy by Correcting the Joint Regressor for the Human3.6m Dataset","authors":"Eric Hedlin, Helge Rhodin, K. M. Yi","doi":"10.1109/CRV55824.2022.00009","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00009","url":null,"abstract":"Many human pose estimation methods estimate Skinned Multi-Person Linear (SMPL) models and regress the human joints from these SMPL estimates. In this work, we show that the most widely used SMPL-to-joint linear layer (joint regressor) is inaccurate, which may mislead pose evaluation results. To achieve a more accurate joint regressor, we propose a method to create pseudo-ground-truth SMPL poses, which can then be used to train an improved regressor. Specifically, we optimize SMPL estimates coming from a state-of-the-art method so that its projection matches the silhouettes of humans in the scene, as well as the ground-truth 2D joint locations. While the quality of this pseudo-ground-truth is chal-lenging to assess due to the lack of actual ground-truth SMPL, with the Human 3.6m dataset, we qualitatively show that our joint locations are more accurate and that our regressor leads to improved pose estimations results on the test set without any need for retraining. We release our code and joint regressor at https://github.com/ubc-vision/joint-regressor-refinement","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129633551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
CellDefectNet: A Machine-designed Attention Condenser Network for Electroluminescence-based Photovoltaic Cell Defect Inspection CellDefectNet:一种基于电致发光的光伏电池缺陷检测的机器设计关注聚光镜网络
Pub Date : 2022-04-25 DOI: 10.1109/CRV55824.2022.00036
Carol Xu, M. Famouri, Gautam Bathla, Saeejith Nair, M. Shafiee, Alexander Wong
Photovoltaic cells are electronic devices that convert light energy to electricity, forming the backbone of solar energy harvesting systems. An essential step in the manufacturing process for photovoltaic cells is visual quality inspection using electroluminescence imaging to identify defects such as cracks, finger interruptions, and broken cells. A big challenge faced by industry in photovoltaic cell visual inspection is the fact that it is currently done manually by human inspectors, which is extremely time consuming, laborious, and prone to human error. While deep learning approaches holds great potential to automating this inspection, the hardware resource-constrained manufac-turing scenario makes it challenging for deploying complex deep neural network architectures. In this work, we introduce CellDefectNet, a highly efficient attention condenser network designed via machine-driven design exploration specifically for electroluminesence-based photovoltaic cell defect detection on the edge. We demonstrate the efficacy of CellDetectNet on a benchmark dataset comprising of a diversity of photovoltaic cells captured using electroluminescence imagery, achieving an accuracy of $sim 86.3%$ while possessing just 410K parameters $(sim 13times$ lower than EfficientNet-B0, respectively) and $sim 115mathrm{M}$ FLOPs $(sim 12times$ lower than EfficientNet-B0) and $sim 13times$ faster on an ARM Cortex A-72 embedded processor when compared to EfficientNet-B0.
光伏电池是一种将光能转化为电能的电子设备,是太阳能收集系统的支柱。光伏电池制造过程中的一个重要步骤是使用电致发光成像进行视觉质量检查,以识别裂纹,手指中断和损坏的电池等缺陷。光伏电池目视检测是目前光伏电池目视检测行业面临的一大挑战,主要是由人工检测人员手工完成,费时费力,容易出现人为错误。虽然深度学习方法在自动化检测方面具有巨大潜力,但硬件资源受限的制造场景使得部署复杂的深度神经网络架构具有挑战性。在这项工作中,我们介绍了CellDefectNet,这是一个高效的注意力聚光网络,通过机器驱动的设计探索设计,专门用于基于电致发光的光伏电池边缘缺陷检测。我们在一个基准数据集上展示了CellDetectNet的有效性,该数据集由使用电发光图像捕获的各种光伏电池组成,在仅具有410K参数$(分别比EfficientNet-B0低13倍)和$sim 115mathrm{M}$ FLOPs $(比EfficientNet-B0低12倍)的情况下,实现了$sim 86.3% $的精度,并且与效率相比,在ARM Cortex a -72嵌入式处理器上的$sim快13倍$。
{"title":"CellDefectNet: A Machine-designed Attention Condenser Network for Electroluminescence-based Photovoltaic Cell Defect Inspection","authors":"Carol Xu, M. Famouri, Gautam Bathla, Saeejith Nair, M. Shafiee, Alexander Wong","doi":"10.1109/CRV55824.2022.00036","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00036","url":null,"abstract":"Photovoltaic cells are electronic devices that convert light energy to electricity, forming the backbone of solar energy harvesting systems. An essential step in the manufacturing process for photovoltaic cells is visual quality inspection using electroluminescence imaging to identify defects such as cracks, finger interruptions, and broken cells. A big challenge faced by industry in photovoltaic cell visual inspection is the fact that it is currently done manually by human inspectors, which is extremely time consuming, laborious, and prone to human error. While deep learning approaches holds great potential to automating this inspection, the hardware resource-constrained manufac-turing scenario makes it challenging for deploying complex deep neural network architectures. In this work, we introduce CellDefectNet, a highly efficient attention condenser network designed via machine-driven design exploration specifically for electroluminesence-based photovoltaic cell defect detection on the edge. We demonstrate the efficacy of CellDetectNet on a benchmark dataset comprising of a diversity of photovoltaic cells captured using electroluminescence imagery, achieving an accuracy of $sim 86.3%$ while possessing just 410K parameters $(sim 13times$ lower than EfficientNet-B0, respectively) and $sim 115mathrm{M}$ FLOPs $(sim 12times$ lower than EfficientNet-B0) and $sim 13times$ faster on an ARM Cortex A-72 embedded processor when compared to EfficientNet-B0.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124906812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Improving tracking with a tracklet associator 使用tracklet关联器改进跟踪
Pub Date : 2022-04-22 DOI: 10.1109/CRV55824.2022.00030
R'emi Nahon, Guillaume-Alexandre Bilodeau, G. Pesant
Multiple object tracking (MOT) is a task in computer vision that aims to detect the position of various objects in videos and to associate them to a unique identity. We propose an approach based on Constraint Programming $(CP)$ whose goal is to be grafted to any existing tracker in order to improve its object association results. We developed a modular algorithm divided into three independent phases. The first phase consists in recovering the tracklets pro-vided by a base tracker and to cut them at the places where uncertain associations are spotted, for exam-ple, when tracklets overlap, which may cause identity switches. In the second phase, we associate the previ-ously constructed tracklets using a Belief Propagation Constraint Programming algorithm, where we pro-pose various constraints that assign scores to each of the tracklets based on multiple characteristics, such as their dynamics or the distance between them in time and space. Finally, the third phase is a rudimen-tary interpolation model to fill in the remaining holes in the trajectories we built. Experiments show that our model leads to improvements in the results for all three of the state-of-the-art trackers on which we tested it (3 to 4 points gained on HOTA and IDF1).
多目标跟踪(MOT)是计算机视觉中的一项任务,旨在检测视频中各种物体的位置,并将它们与唯一的身份联系起来。我们提出了一种基于约束规划的方法,其目标是将其嫁接到任何现有的跟踪器上,以改善其对象关联结果。我们开发了一个模块化算法,分为三个独立的阶段。第一阶段包括恢复由基础跟踪器提供的跟踪器,并在发现不确定关联的地方切断它们,例如,当跟踪器重叠时,可能导致身份转换。在第二阶段,我们使用信念传播约束规划算法将之前构建的tracklet关联起来,其中我们提出各种约束,根据多个特征(例如它们的动态或它们在时间和空间上的距离)为每个tracklet分配分数。最后,第三阶段是一个基本的插值模型,以填补我们建立的轨迹中剩余的孔。实验表明,我们的模型可以改善我们测试的所有三种最先进的跟踪器的结果(在HOTA和IDF1上获得3到4分)。
{"title":"Improving tracking with a tracklet associator","authors":"R'emi Nahon, Guillaume-Alexandre Bilodeau, G. Pesant","doi":"10.1109/CRV55824.2022.00030","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00030","url":null,"abstract":"Multiple object tracking (MOT) is a task in computer vision that aims to detect the position of various objects in videos and to associate them to a unique identity. We propose an approach based on Constraint Programming $(CP)$ whose goal is to be grafted to any existing tracker in order to improve its object association results. We developed a modular algorithm divided into three independent phases. The first phase consists in recovering the tracklets pro-vided by a base tracker and to cut them at the places where uncertain associations are spotted, for exam-ple, when tracklets overlap, which may cause identity switches. In the second phase, we associate the previ-ously constructed tracklets using a Belief Propagation Constraint Programming algorithm, where we pro-pose various constraints that assign scores to each of the tracklets based on multiple characteristics, such as their dynamics or the distance between them in time and space. Finally, the third phase is a rudimen-tary interpolation model to fill in the remaining holes in the trajectories we built. Experiments show that our model leads to improvements in the results for all three of the state-of-the-art trackers on which we tested it (3 to 4 points gained on HOTA and IDF1).","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"08 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133056200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive Memory Management for Video Object Segmentation 视频对象分割的自适应内存管理
Pub Date : 2022-04-13 DOI: 10.1109/CRV55824.2022.00018
Ali Pourganjalikhan, Charalambos (Charis) Poullis
Matching-based networks have achieved state-of-the-art performance for video object segmentation (VOS) tasks by storing every-k frames in an external memory bank for future inference. Storing the intermediate frames' predictions provides the network with richer cues for segmenting an object in the current frame. However, the size of the memory bank gradually increases with the length of the video, which slows down inference speed and makes it impractical to handle arbitrary length videos. This paper proposes an adaptive memory bank strategy for matching-based networks for semi-supervised video object segmentation (VOS) that can handle videos of arbitrary length by discarding obsolete features. Features are indexed based on their importance in the segmentation of the objects in previous frames. Based on the index, we discard unimportant features to accommodate new features. We present our experiments on DAVIS 2016, DAVIS 2017, and Youtube-VOS that demonstrate that our method outperforms state-of-the-art that employ first-and-latest strategy with fixed-sized memory banks and achieves comparable performance to the every-k strategy with increasing-sized memory banks. Furthermore, experiments show that our method increases inference speed by up to 80% over the every-k and 35% over first-and-latest strategies.
基于匹配的网络通过将每k帧存储在外部存储器中以供未来推断,实现了视频对象分割(VOS)任务的最先进性能。存储中间帧的预测为网络在当前帧中分割对象提供了更丰富的线索。然而,随着视频长度的增加,存储库的大小逐渐增加,这减慢了推理速度,使得处理任意长度的视频变得不切实际。本文提出了一种基于匹配的半监督视频对象分割网络的自适应记忆库策略,该策略可以通过丢弃过时的特征来处理任意长度的视频。特征是根据它们在前一帧的目标分割中的重要性进行索引的。基于索引,我们丢弃不重要的特性以适应新的特性。我们展示了我们在DAVIS 2016、DAVIS 2017和Youtube-VOS上的实验,这些实验表明,我们的方法优于采用固定大小内存库的最新和最新策略的最先进的方法,并且实现了与使用增大大小内存库的每k策略相当的性能。此外,实验表明,我们的方法比每k的推理速度提高了80%,比第一次和最新的推理速度提高了35%。
{"title":"Adaptive Memory Management for Video Object Segmentation","authors":"Ali Pourganjalikhan, Charalambos (Charis) Poullis","doi":"10.1109/CRV55824.2022.00018","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00018","url":null,"abstract":"Matching-based networks have achieved state-of-the-art performance for video object segmentation (VOS) tasks by storing every-k frames in an external memory bank for future inference. Storing the intermediate frames' predictions provides the network with richer cues for segmenting an object in the current frame. However, the size of the memory bank gradually increases with the length of the video, which slows down inference speed and makes it impractical to handle arbitrary length videos. This paper proposes an adaptive memory bank strategy for matching-based networks for semi-supervised video object segmentation (VOS) that can handle videos of arbitrary length by discarding obsolete features. Features are indexed based on their importance in the segmentation of the objects in previous frames. Based on the index, we discard unimportant features to accommodate new features. We present our experiments on DAVIS 2016, DAVIS 2017, and Youtube-VOS that demonstrate that our method outperforms state-of-the-art that employ first-and-latest strategy with fixed-sized memory banks and achieves comparable performance to the every-k strategy with increasing-sized memory banks. Furthermore, experiments show that our method increases inference speed by up to 80% over the every-k and 35% over first-and-latest strategies.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131397955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers 基于自监督预训练视觉变压器的单目机器人导航
Pub Date : 2022-03-07 DOI: 10.1109/CRV55824.2022.00033
Miguel A. Saavedra-Ruiz, Sacha Morin, L. Paull
In this work, we consider the problem of learning a perception model for monocular robot navigation using few annotated images. Using a Vision Transformer (ViT) pretrained with a label-free self-supervised method, we successfully train a coarse image segmentation model for the Duckietown environment using 70 training images. Our model performs coarse image segmentation at the $8times 8$ patch level, and the inference resolution can be adjusted to balance prediction granularity and real-time perception constraints. We study how best to adapt a ViT to our task and environment, and find that some lightweight architectures can yield good single-image segmentations at a usable frame rate, even on CPU. The resulting perception model is used as the backbone for a simple yet robust visual servoing agent, which we deploy on a differential drive mobile robot to perform two tasks: lane following and obstacle avoidance.
在这项工作中,我们考虑了使用少量注释图像学习单眼机器人导航的感知模型的问题。采用无标签自监督方法预训练的视觉变形器(Vision Transformer, ViT),利用70张训练图像成功训练了Duckietown环境下的粗图像分割模型。我们的模型在$8 × 8$ patch级别执行粗图像分割,并且可以调整推理分辨率以平衡预测粒度和实时感知约束。我们研究了如何最好地使ViT适应我们的任务和环境,并发现一些轻量级架构可以在可用的帧速率下产生良好的单图像分割,即使在CPU上也是如此。所得到的感知模型被用作一个简单而鲁棒的视觉伺服代理的主干,我们将其部署在差动驱动移动机器人上,以执行两项任务:车道跟随和避障。
{"title":"Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers","authors":"Miguel A. Saavedra-Ruiz, Sacha Morin, L. Paull","doi":"10.1109/CRV55824.2022.00033","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00033","url":null,"abstract":"In this work, we consider the problem of learning a perception model for monocular robot navigation using few annotated images. Using a Vision Transformer (ViT) pretrained with a label-free self-supervised method, we successfully train a coarse image segmentation model for the Duckietown environment using 70 training images. Our model performs coarse image segmentation at the $8times 8$ patch level, and the inference resolution can be adjusted to balance prediction granularity and real-time perception constraints. We study how best to adapt a ViT to our task and environment, and find that some lightweight architectures can yield good single-image segmentations at a usable frame rate, even on CPU. The resulting perception model is used as the backbone for a simple yet robust visual servoing agent, which we deploy on a differential drive mobile robot to perform two tasks: lane following and obstacle avoidance.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"480 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123057271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Attention based Occlusion Removal for Hybrid Telepresence Systems 基于注意力的混合网真系统遮挡去除
Pub Date : 2021-12-02 DOI: 10.1109/CRV55824.2022.00029
Surabhi Gupta, Ashwath Shetty, Avinash Sharma
Traditionally, video conferencing is a widely adopted solution for remote communication, but a lack of immersiveness comes inherently due to the 2D nature of facial representation. The integration of Virtual Reality (VR) in a communication/telepresence system through Head Mounted Displays (HMDs) promises to provide users with a much better immersive experience. However, HMDs cause hindrance by blocking the facial appearance and expressions of the user. We propose a novel attention-enabled encoder-decoder architecture for HMD de-occlusion to overcome these issues. We also propose to train our person-specific model using short videos of the user, captured in varying appearances, and demonstrated generalization to unseen poses and appearances of the user. We report superior qualitative and quantitative results over state-of-the-art methods. We also present applications of this approach to hybrid video teleconferencing using existing animation and 3D face reconstruction pipelines. Dataset is available at this website.
传统上,视频会议是一种广泛采用的远程通信解决方案,但由于面部表现的2D性质,缺乏沉浸感。通过头戴式显示器(hmd)将虚拟现实(VR)集成到通信/远程呈现系统中,有望为用户提供更好的沉浸式体验。然而,头戴式显示器会阻碍用户的面部表情和表情。为了克服这些问题,我们提出了一种新的基于注意力的编码器-解码器结构。我们还建议使用用户的短视频来训练我们的个人特定模型,这些视频以不同的外观拍摄,并展示了对用户未见过的姿势和外观的概括。我们报告优于最先进方法的定性和定量结果。我们还介绍了这种方法在混合视频电话会议中的应用,使用现有的动画和3D面部重建管道。数据集可在本网站获得。
{"title":"Attention based Occlusion Removal for Hybrid Telepresence Systems","authors":"Surabhi Gupta, Ashwath Shetty, Avinash Sharma","doi":"10.1109/CRV55824.2022.00029","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00029","url":null,"abstract":"Traditionally, video conferencing is a widely adopted solution for remote communication, but a lack of immersiveness comes inherently due to the 2D nature of facial representation. The integration of Virtual Reality (VR) in a communication/telepresence system through Head Mounted Displays (HMDs) promises to provide users with a much better immersive experience. However, HMDs cause hindrance by blocking the facial appearance and expressions of the user. We propose a novel attention-enabled encoder-decoder architecture for HMD de-occlusion to overcome these issues. We also propose to train our person-specific model using short videos of the user, captured in varying appearances, and demonstrated generalization to unseen poses and appearances of the user. We report superior qualitative and quantitative results over state-of-the-art methods. We also present applications of this approach to hybrid video teleconferencing using existing animation and 3D face reconstruction pipelines. Dataset is available at this website.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127240676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
M2A: Motion Aware Attention for Accurate Video Action Recognition M2A:精确视频动作识别的动作感知注意力
Pub Date : 2021-11-18 DOI: 10.1109/CRV55824.2022.00019
Brennan Gebotys, Alexander Wong, David A Clausi
Advancements in attention mechanisms have led to significant performance improvements in a variety of areas in machine learning due to its ability to enable the dynamic modeling of temporal sequences. A particular area in computer vision that is likely to benefit greatly from the incorporation of attention mechanisms in video action recognition. However, much of the current research's focus on attention mechanisms have been on spatial and temporal attention, which are unable to take advantage of the inherent motion found in videos. Motivated by this, we develop a new attention mechanism called Motion Aware Attention (M2A) that explicitly incorporates motion characteris-tics. More specifically, M2A extracts motion information between consecutive frames and utilizes attention to focus on the motion patterns found across frames to accurately recognize actions in videos. The proposed M2A mechanism is simple to implement and can be easily incorporated into any neural network backbone architecture. We show that incorporating motion mechanisms with attention mechanisms using the proposed M2A mechanism can lead to a $+15%$ to $+26%$ improvement in top-1 accuracy across different backbone architectures, with only a small in-crease in computational complexity. We further compared the performance of M2A with other state-of-the-art motion and at-tention mechanisms on the Something-Something V1 video action recognition benchmark. Experimental results showed that M2A can lead to further improvements when combined with other temporal mechanisms and that it outperforms other motion-only or attention-only mechanisms by as much as $+60%$ in top-1 accuracy for specific classes in the benchmark. We make our code available at: https://github.com/gebob19/M2A.
由于注意机制能够实现时间序列的动态建模,因此在机器学习的各个领域中,注意机制的进步导致了显著的性能改进。计算机视觉中的一个特殊领域,可能会从视频动作识别中加入注意机制中获益良多。然而,目前对注意力机制的研究大多集中在空间和时间注意力上,这无法利用视频中固有的运动。受此启发,我们开发了一种新的注意力机制,称为运动感知注意力(M2A),明确地结合了运动特征。更具体地说,M2A提取连续帧之间的运动信息,并利用注意力集中在跨帧发现的运动模式上,以准确识别视频中的动作。提出的M2A机制实现简单,可以很容易地集成到任何神经网络骨干架构中。我们表明,使用所提出的M2A机制将运动机制与注意力机制结合起来,可以在不同的骨干架构中导致top-1精度的+ 15%到+ 26%的提高,而计算复杂性只有很小的增加。我们进一步在Something-Something V1视频动作识别基准上比较了M2A与其他最先进的运动和注意力机制的性能。实验结果表明,当与其他时间机制相结合时,M2A可以带来进一步的改进,并且在基准测试中的特定类别中,它比其他仅运动或仅注意机制在前1的准确率上高出高达60%。我们在https://github.com/gebob19/M2A上提供了我们的代码。
{"title":"M2A: Motion Aware Attention for Accurate Video Action Recognition","authors":"Brennan Gebotys, Alexander Wong, David A Clausi","doi":"10.1109/CRV55824.2022.00019","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00019","url":null,"abstract":"Advancements in attention mechanisms have led to significant performance improvements in a variety of areas in machine learning due to its ability to enable the dynamic modeling of temporal sequences. A particular area in computer vision that is likely to benefit greatly from the incorporation of attention mechanisms in video action recognition. However, much of the current research's focus on attention mechanisms have been on spatial and temporal attention, which are unable to take advantage of the inherent motion found in videos. Motivated by this, we develop a new attention mechanism called Motion Aware Attention (M2A) that explicitly incorporates motion characteris-tics. More specifically, M2A extracts motion information between consecutive frames and utilizes attention to focus on the motion patterns found across frames to accurately recognize actions in videos. The proposed M2A mechanism is simple to implement and can be easily incorporated into any neural network backbone architecture. We show that incorporating motion mechanisms with attention mechanisms using the proposed M2A mechanism can lead to a $+15%$ to $+26%$ improvement in top-1 accuracy across different backbone architectures, with only a small in-crease in computational complexity. We further compared the performance of M2A with other state-of-the-art motion and at-tention mechanisms on the Something-Something V1 video action recognition benchmark. Experimental results showed that M2A can lead to further improvements when combined with other temporal mechanisms and that it outperforms other motion-only or attention-only mechanisms by as much as $+60%$ in top-1 accuracy for specific classes in the benchmark. We make our code available at: https://github.com/gebob19/M2A.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125263242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Temporal Convolutions for Multi-Step Quadrotor Motion Prediction 多步四旋翼运动预测的时间卷积
Pub Date : 2021-10-08 DOI: 10.1109/CRV55824.2022.00013
Sam Looper, Steven L. Waslander
Model-based control methods for robotic systems such as quadrotors, autonomous driving vehicles and flexible manipulators require motion models that generate accurate predictions of complex nonlinear system dynamics over long periods of time. Temporal Convolutional Networks (TCNs) can be adapted to this challenge by formulating multi-step prediction as a sequence-to-sequence modeling problem. We present End2End-TCN: a fully convolutional architecture that integrates future control inputs to compute multi-step motion predictions in one forward pass. We demonstrate the approach with a thorough analysis of TCN performance for the quadrotor modeling task, which includes an investigation of scaling effects and ablation studies. Ultimately, End2End- Tcnprovides 55% error reduction over the state of the art in multi-step prediction on an aggressive indoor quadrotor flight d ataset. The model yields accurate predictions across 90 timestep horizons over a 900 ms interval.
机器人系统(如四旋翼飞行器、自动驾驶车辆和柔性机械手)的基于模型的控制方法需要运动模型,这些模型可以长时间对复杂的非线性系统动力学进行准确预测。时间卷积网络(tcn)可以通过将多步预测作为序列到序列的建模问题来适应这一挑战。我们提出了End2End-TCN:一个完全卷积的架构,集成了未来的控制输入,以计算一个向前传递的多步运动预测。我们通过对四旋翼建模任务的TCN性能进行全面分析来证明该方法,其中包括对缩放效应和烧蚀研究的调查。最终,End2End- tcn在室内四旋翼飞行数据集的多步预测中提供了55%的误差减少。该模型在900毫秒的时间间隔内产生90个时间步的准确预测。
{"title":"Temporal Convolutions for Multi-Step Quadrotor Motion Prediction","authors":"Sam Looper, Steven L. Waslander","doi":"10.1109/CRV55824.2022.00013","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00013","url":null,"abstract":"Model-based control methods for robotic systems such as quadrotors, autonomous driving vehicles and flexible manipulators require motion models that generate accurate predictions of complex nonlinear system dynamics over long periods of time. Temporal Convolutional Networks (TCNs) can be adapted to this challenge by formulating multi-step prediction as a sequence-to-sequence modeling problem. We present End2End-TCN: a fully convolutional architecture that integrates future control inputs to compute multi-step motion predictions in one forward pass. We demonstrate the approach with a thorough analysis of TCN performance for the quadrotor modeling task, which includes an investigation of scaling effects and ablation studies. Ultimately, End2End- Tcnprovides 55% error reduction over the state of the art in multi-step prediction on an aggressive indoor quadrotor flight d ataset. The model yields accurate predictions across 90 timestep horizons over a 900 ms interval.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130364994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ROS-X-Habitat: Bridging the ROS Ecosystem with Embodied AI ROS- x - habitat:连接ROS生态系统与嵌入式人工智能
Pub Date : 2021-09-16 DOI: 10.1109/CRV55824.2022.00012
Guanxiong Chen, Haoyu Yang, Ian M. Mitchell
We introduce ROS-X-Habitat, a software interface that bridges the AI Habitat platform for embodied learning-based agents with other robotics resources via ROS. This interface not only offers standardized communication protocols between embodied agents and simulators, but also enables physically and photorealistic simulation that benefits the training and/or testing of vision-based embodied agents. With this interface, roboticists can evaluate their own Habitat RL agents in another ROS-based simulator or use Habitat Sim v2 as the test bed for their own robotic algorithms. Through in silico experiments, we demonstrate that ROS-X-Habitat has minimal impact on the navigation performance and simulation speed of a Habitat RGBD agent; that a standard set of ROS mapping, planning and navigation tools can run in Habitat Sim v2; and that a Habitat agent can run in the standard ROS simulator Gazebo.
我们介绍ROS- x -Habitat,这是一个软件接口,通过ROS将AI Habitat平台与其他机器人资源连接起来。该接口不仅提供了具身代理和模拟器之间的标准化通信协议,而且还实现了物理和逼真的模拟,有利于基于视觉的具身代理的培训和/或测试。有了这个接口,机器人专家可以在另一个基于ros的模拟器中评估他们自己的Habitat RL代理,或者使用Habitat Sim v2作为他们自己的机器人算法的测试平台。通过计算机实验,我们证明了ROS-X-Habitat对生境RGBD机器人的导航性能和仿真速度的影响很小;一套标准的ROS制图、规划和导航工具可以在Habitat Sim v2中运行;Habitat代理可以在标准ROS模拟器Gazebo中运行。
{"title":"ROS-X-Habitat: Bridging the ROS Ecosystem with Embodied AI","authors":"Guanxiong Chen, Haoyu Yang, Ian M. Mitchell","doi":"10.1109/CRV55824.2022.00012","DOIUrl":"https://doi.org/10.1109/CRV55824.2022.00012","url":null,"abstract":"We introduce ROS-X-Habitat, a software interface that bridges the AI Habitat platform for embodied learning-based agents with other robotics resources via ROS. This interface not only offers standardized communication protocols between embodied agents and simulators, but also enables physically and photorealistic simulation that benefits the training and/or testing of vision-based embodied agents. With this interface, roboticists can evaluate their own Habitat RL agents in another ROS-based simulator or use Habitat Sim v2 as the test bed for their own robotic algorithms. Through in silico experiments, we demonstrate that ROS-X-Habitat has minimal impact on the navigation performance and simulation speed of a Habitat RGBD agent; that a standard set of ROS mapping, planning and navigation tools can run in Habitat Sim v2; and that a Habitat agent can run in the standard ROS simulator Gazebo.","PeriodicalId":131142,"journal":{"name":"2022 19th Conference on Robots and Vision (CRV)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130084253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2022 19th Conference on Robots and Vision (CRV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1