首页 > 最新文献

2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW)最新文献

英文 中文
Neural vision-based semantic 3D world modeling 基于神经视觉的语义三维世界建模
Pub Date : 2021-01-01 DOI: 10.1109/WACVW52041.2021.00024
Sotirios Papadopoulos, Ioannis Mademlis, I. Pitas
Scene geometry estimation and semantic segmentation using image/video data are two active machine learning/computer vision research topics. Given monocular or stereoscopic 3D images, depicted scene/object geometry in the form of depth maps can be successfully estimated, while modern Deep Neural Network (DNN) architectures can accurately predict semantic masks on an image. In several scenarios, both tasks are required at once, leading to a need for combined semantic 3D world mapping methods. In the wake of modern autonomous systems, DNNs that simultaneously handle both tasks have arisen, exploiting machine/deep learning to save up considerably on computational resources and enhance performance, as these tasks can mutually benefit from each other A great application area is 3D road scene modeling and semantic segmentation, e.g., for an autonomous car to identify and localize in 3D space visible pavement regions (marked as “road”) that are essential for autonomous car driving. Due to the significance of this field, this paper surveys the state-of-the-art DNN-based methods for scene geometry estimation, image semantic segmentation and joint inference of both.
基于图像/视频数据的场景几何估计和语义分割是两个活跃的机器学习/计算机视觉研究课题。给定单眼或立体3D图像,可以成功地估计以深度图形式描绘的场景/物体几何形状,而现代深度神经网络(DNN)架构可以准确地预测图像上的语义掩模。在一些场景中,这两项任务同时需要,导致需要组合语义3D世界映射方法。随着现代自主系统的发展,同时处理这两项任务的dnn已经出现,利用机器/深度学习来节省大量的计算资源并提高性能,因为这些任务可以相互受益。一个很好的应用领域是3D道路场景建模和语义分割,例如,自动驾驶汽车在3D空间中识别和定位可见的路面区域(标记为“道路”),这对自动驾驶汽车驾驶至关重要。鉴于该领域的重要意义,本文综述了基于dnn的场景几何估计、图像语义分割以及两者联合推理的最新方法。
{"title":"Neural vision-based semantic 3D world modeling","authors":"Sotirios Papadopoulos, Ioannis Mademlis, I. Pitas","doi":"10.1109/WACVW52041.2021.00024","DOIUrl":"https://doi.org/10.1109/WACVW52041.2021.00024","url":null,"abstract":"Scene geometry estimation and semantic segmentation using image/video data are two active machine learning/computer vision research topics. Given monocular or stereoscopic 3D images, depicted scene/object geometry in the form of depth maps can be successfully estimated, while modern Deep Neural Network (DNN) architectures can accurately predict semantic masks on an image. In several scenarios, both tasks are required at once, leading to a need for combined semantic 3D world mapping methods. In the wake of modern autonomous systems, DNNs that simultaneously handle both tasks have arisen, exploiting machine/deep learning to save up considerably on computational resources and enhance performance, as these tasks can mutually benefit from each other A great application area is 3D road scene modeling and semantic segmentation, e.g., for an autonomous car to identify and localize in 3D space visible pavement regions (marked as “road”) that are essential for autonomous car driving. Due to the significance of this field, this paper surveys the state-of-the-art DNN-based methods for scene geometry estimation, image semantic segmentation and joint inference of both.","PeriodicalId":313062,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130459317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Automatic Virtual 3D City Generation for Synthetic Data Collection 合成数据采集的自动虚拟三维城市生成
Pub Date : 2021-01-01 DOI: 10.1109/WACVW52041.2021.00022
Bingyu Shen, Boyang Li, W. Scheirer
Computer vision has achieved superior results with the rapid development of new techniques in deep neural networks. Object detection in the wild is a core task in computer vision, and already has many successful applications in the real world. However, deep neural networks for object detection usually consist of hundreds, and sometimes even thousands, of layers. Training such networks is challenging, and training data has a fundamental impact on model performance. Because data collection and annotation are expensive and labor-intensive, lots of data augmentation methods have been proposed to generate synthetic data for neural network training. Most of those methods focus on manipulating 2D images. In contrast to that, in this paper, we leverage the realistic visual effects of 3D environments and propose a new way of generating synthetic data for computer vision tasks related to city scenes. Specifically, we describe a pipeline that can generate a 3D city model from an input of a 2D image that portrays the layout design of a city. This pipeline also takes optional parameters to further customize the output 3D city model. Using our pipeline, a virtual 3D city model with high-quality textures can be generated within seconds, and the output is an object ready to render. The model generated will assist people with limited 3D development knowledge to create high quality city scenes for different needs. As examples, we show the use of generated 3D city models as the synthetic data source for a scene text detection task and a traffic sign detection task. Both qualitative and quantitative results show that the generated virtual city is a good match to real-world data and potentially can benefit other computer vision tasks with similar contexts.
随着深度神经网络新技术的迅速发展,计算机视觉取得了优异的效果。野外目标检测是计算机视觉的核心任务,在现实世界中已经有了许多成功的应用。然而,用于目标检测的深度神经网络通常由数百层,有时甚至数千层组成。训练这样的网络是具有挑战性的,并且训练数据对模型性能具有根本性的影响。由于数据收集和标注成本高、劳动强度大,人们提出了许多数据增强方法来生成用于神经网络训练的合成数据。这些方法大多集中在处理二维图像上。与此相反,在本文中,我们利用3D环境的逼真视觉效果,提出了一种为与城市场景相关的计算机视觉任务生成合成数据的新方法。具体来说,我们描述了一个管道,它可以从描绘城市布局设计的2D图像的输入生成3D城市模型。该管道还接受可选参数,以进一步定制输出的3D城市模型。使用我们的管道,一个具有高质量纹理的虚拟3D城市模型可以在几秒钟内生成,输出是一个准备渲染的对象。生成的模型将帮助有限的3D开发知识的人创建高质量的城市场景,以满足不同的需求。作为示例,我们展示了将生成的3D城市模型作为场景文本检测任务和交通标志检测任务的合成数据源的使用。定性和定量结果都表明,生成的虚拟城市与现实世界的数据很好地匹配,并有可能使其他具有类似背景的计算机视觉任务受益。
{"title":"Automatic Virtual 3D City Generation for Synthetic Data Collection","authors":"Bingyu Shen, Boyang Li, W. Scheirer","doi":"10.1109/WACVW52041.2021.00022","DOIUrl":"https://doi.org/10.1109/WACVW52041.2021.00022","url":null,"abstract":"Computer vision has achieved superior results with the rapid development of new techniques in deep neural networks. Object detection in the wild is a core task in computer vision, and already has many successful applications in the real world. However, deep neural networks for object detection usually consist of hundreds, and sometimes even thousands, of layers. Training such networks is challenging, and training data has a fundamental impact on model performance. Because data collection and annotation are expensive and labor-intensive, lots of data augmentation methods have been proposed to generate synthetic data for neural network training. Most of those methods focus on manipulating 2D images. In contrast to that, in this paper, we leverage the realistic visual effects of 3D environments and propose a new way of generating synthetic data for computer vision tasks related to city scenes. Specifically, we describe a pipeline that can generate a 3D city model from an input of a 2D image that portrays the layout design of a city. This pipeline also takes optional parameters to further customize the output 3D city model. Using our pipeline, a virtual 3D city model with high-quality textures can be generated within seconds, and the output is an object ready to render. The model generated will assist people with limited 3D development knowledge to create high quality city scenes for different needs. As examples, we show the use of generated 3D city models as the synthetic data source for a scene text detection task and a traffic sign detection task. Both qualitative and quantitative results show that the generated virtual city is a good match to real-world data and potentially can benefit other computer vision tasks with similar contexts.","PeriodicalId":313062,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114162221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Focused LRP: Explainable AI for Face Morphing Attack Detection 聚焦LRP:面部变形攻击检测的可解释AI
Pub Date : 2021-01-01 DOI: 10.1109/WACVW52041.2021.00014
Clemens Seibold, A. Hilsmann, P. Eisert
The task of detecting morphed face images has become highly relevant in recent years to ensure the security of automatic verification systems based on facial images, e.g. automated border control gates. Detection methods based on Deep Neural Networks (DNN) have been shown to be very suitable to this end. However, they do not provide transparency in the decision making and it is not clear how they distinguish between genuine and morphed face images. This is particularly relevant for systems intended to assist a human operator, who should be able to understand the reasoning. In this paper, we tackle this problem and present Focused Layer-wise Relevance Propagation (FLRP). This framework explains to a human inspector on a precise pixel level, which image regions are used by a Deep Neural Network to distinguish between a genuine and a morphed face image. Additionally, we propose another framework to objectively analyze the quality of our method and compare FLRP to other DNN interpretability methods. This evaluation framework is based on removing detected artifacts and analyzing the influence of these changes on the decision of the DNN. Especially, if the DNN is uncertain in its decision or even incorrect, FLRP performs much better in highlighting visible artifacts compared to other methods.
近年来,检测变形人脸图像已成为确保基于人脸图像的自动验证系统(如自动边境控制门)安全性的重要任务。基于深度神经网络(DNN)的检测方法已被证明非常适合这一目的。然而,它们在决策过程中没有提供透明度,也不清楚它们如何区分真实的和变形的人脸图像。这对于旨在帮助人类操作员的系统尤其重要,因为操作员应该能够理解推理。在本文中,我们解决了这个问题,并提出了聚焦分层相关传播(FLRP)。这个框架在精确的像素级别上向人类检查员解释,深度神经网络使用哪些图像区域来区分真实的人脸图像和变形的人脸图像。此外,我们提出了另一个框架来客观地分析我们的方法的质量,并将FLRP与其他DNN可解释性方法进行比较。该评估框架基于去除检测到的伪影并分析这些变化对深度神经网络决策的影响。特别是,如果DNN的决定是不确定的,甚至是不正确的,FLRP在突出显示可见伪像方面比其他方法表现得更好。
{"title":"Focused LRP: Explainable AI for Face Morphing Attack Detection","authors":"Clemens Seibold, A. Hilsmann, P. Eisert","doi":"10.1109/WACVW52041.2021.00014","DOIUrl":"https://doi.org/10.1109/WACVW52041.2021.00014","url":null,"abstract":"The task of detecting morphed face images has become highly relevant in recent years to ensure the security of automatic verification systems based on facial images, e.g. automated border control gates. Detection methods based on Deep Neural Networks (DNN) have been shown to be very suitable to this end. However, they do not provide transparency in the decision making and it is not clear how they distinguish between genuine and morphed face images. This is particularly relevant for systems intended to assist a human operator, who should be able to understand the reasoning. In this paper, we tackle this problem and present Focused Layer-wise Relevance Propagation (FLRP). This framework explains to a human inspector on a precise pixel level, which image regions are used by a Deep Neural Network to distinguish between a genuine and a morphed face image. Additionally, we propose another framework to objectively analyze the quality of our method and compare FLRP to other DNN interpretability methods. This evaluation framework is based on removing detected artifacts and analyzing the influence of these changes on the decision of the DNN. Especially, if the DNN is uncertain in its decision or even incorrect, FLRP performs much better in highlighting visible artifacts compared to other methods.","PeriodicalId":313062,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129666547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Context-Aware Personality Inference in Dyadic Scenarios: Introducing the UDIVA Dataset 二元场景中上下文感知的人格推断:引入UDIVA数据集
Pub Date : 2020-12-28 DOI: 10.1109/WACVW52041.2021.00005
Cristina Palmero, Javier Selva, Sorina Smeureanu, Julio C. S. Jacques Junior, Albert Clapés, Alexa Mosegu'i, Zejian Zhang, D. Gallardo-Pujol, G. Guilera, D. Leiva, Sergio Escalera
This paper introduces UDIVA, a new non-acted dataset of face-to-face dyadic interactions, where interlocutors perform competitive and collaborative tasks with different behavior elicitation and cognitive workload. The dataset consists of 90.5 hours of dyadic interactions among 147 participants distributed in 188 sessions, recorded using multiple audiovisual and physiological sensors. Currently, it includes sociodemographic, self- and peer-reported personality, internal state, and relationship profiling from participants. As an initial analysis on UDIVA, we propose a transformer-based method for self-reported personality inference in dyadic scenarios, which uses audiovisual data and different sources of context from both interlocutors to regress a target person’s personality traits. Preliminary results from an incremental study show consistent improvements when using all available context information.
本文介绍了UDIVA,一个新的面对面二元互动的非行为数据集,其中对话者执行具有不同行为引出和认知工作量的竞争和协作任务。该数据集由分布在188次会议中的147名参与者之间90.5小时的二元互动组成,使用多种视听和生理传感器记录。目前,它包括社会人口学、自我和同伴报告的个性、内部状态和参与者的关系分析。作为对UDIVA的初步分析,我们提出了一种基于转换器的二元情景下自我报告人格推断方法,该方法使用来自对话者的视听数据和不同的上下文来源来回归目标人的人格特征。一项渐进式研究的初步结果表明,在使用所有可用的上下文信息时,改进是一致的。
{"title":"Context-Aware Personality Inference in Dyadic Scenarios: Introducing the UDIVA Dataset","authors":"Cristina Palmero, Javier Selva, Sorina Smeureanu, Julio C. S. Jacques Junior, Albert Clapés, Alexa Mosegu'i, Zejian Zhang, D. Gallardo-Pujol, G. Guilera, D. Leiva, Sergio Escalera","doi":"10.1109/WACVW52041.2021.00005","DOIUrl":"https://doi.org/10.1109/WACVW52041.2021.00005","url":null,"abstract":"This paper introduces UDIVA, a new non-acted dataset of face-to-face dyadic interactions, where interlocutors perform competitive and collaborative tasks with different behavior elicitation and cognitive workload. The dataset consists of 90.5 hours of dyadic interactions among 147 participants distributed in 188 sessions, recorded using multiple audiovisual and physiological sensors. Currently, it includes sociodemographic, self- and peer-reported personality, internal state, and relationship profiling from participants. As an initial analysis on UDIVA, we propose a transformer-based method for self-reported personality inference in dyadic scenarios, which uses audiovisual data and different sources of context from both interlocutors to regress a target person’s personality traits. Preliminary results from an incremental study show consistent improvements when using all available context information.","PeriodicalId":313062,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123687280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on ShineOn:实用的基于视频的虚拟服装试穿的照明设计选择
Pub Date : 2020-12-18 DOI: 10.1109/WACVW52041.2021.00025
Gaurav Kuppa, Andrew Jong, Vera Liu, Ziwei Liu, Teng-Sheng Moh
Virtual try-on has garnered interest as a neural rendering benchmark task to evaluate complex object transfer and scene composition. Recent works in virtual clothing try-on feature a plethora of possible architectural and data representation choices. However, they present little clarity on quantifying the isolated visual effect of each choice, nor do they specify the hyperparameter details that are key to experimental reproduction. Our work, ShineOn, approaches the try-on task from a bottom-up approach and aims to shine light on the visual and quantitative effects of each experiment. We build a series of scientific experiments to isolate effective design choices in video synthesis for virtual clothing try-on. Specifically, we investigate the effect of different pose annotations, self-attention layer placement, and activation functions on the quantitative and qualitative performance of video virtual try-on. We find that Dense-Pose annotations not only enhance face details but also decrease memory usage and training time. Next, we find that attention layers improve face and neck quality. Finally, we show that GELU and ReLU activation functions are the most effective in our experiments despite the appeal of newer activations such as Swish and Sine. We will release a well-organized code base, hyperparameters, and model checkpoints to support the reproducibility of our results. We expect our extensive experiments and code to greatly inform future design choices in video virtual try-on. Our code may be accessed at https://github.com/andrewjong/ShineOn-Virtual-Tryon.
虚拟试穿作为一种评估复杂物体转移和场景构成的神经渲染基准任务,已经引起了人们的兴趣。最近在虚拟服装试穿方面的工作有大量可能的架构和数据表示选择。然而,它们在量化每种选择的孤立视觉效果方面表现得很少,也没有规定对实验再现至关重要的超参数细节。我们的作品,ShineOn,从自下而上的方法来完成试戴任务,旨在揭示每个实验的视觉和定量效果。我们建立了一系列的科学实验,在视频合成中分离出有效的设计选择,用于虚拟服装试穿。具体而言,我们研究了不同姿势注释、自注意层放置和激活函数对视频虚拟试戴的定量和定性性能的影响。我们发现Dense-Pose标注不仅增强了人脸的细节,而且减少了内存的使用和训练时间。接下来,我们发现注意力层改善了面部和颈部的质量。最后,我们证明了GELU和ReLU激活函数在我们的实验中是最有效的,尽管较新的激活函数如Swish和Sine具有吸引力。我们将发布一个组织良好的代码库、超参数和模型检查点,以支持结果的可重复性。我们希望我们广泛的实验和代码,极大地告知未来的设计选择在视频虚拟试戴。我们的代码可以访问https://github.com/andrewjong/ShineOn-Virtual-Tryon。
{"title":"ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on","authors":"Gaurav Kuppa, Andrew Jong, Vera Liu, Ziwei Liu, Teng-Sheng Moh","doi":"10.1109/WACVW52041.2021.00025","DOIUrl":"https://doi.org/10.1109/WACVW52041.2021.00025","url":null,"abstract":"Virtual try-on has garnered interest as a neural rendering benchmark task to evaluate complex object transfer and scene composition. Recent works in virtual clothing try-on feature a plethora of possible architectural and data representation choices. However, they present little clarity on quantifying the isolated visual effect of each choice, nor do they specify the hyperparameter details that are key to experimental reproduction. Our work, ShineOn, approaches the try-on task from a bottom-up approach and aims to shine light on the visual and quantitative effects of each experiment. We build a series of scientific experiments to isolate effective design choices in video synthesis for virtual clothing try-on. Specifically, we investigate the effect of different pose annotations, self-attention layer placement, and activation functions on the quantitative and qualitative performance of video virtual try-on. We find that Dense-Pose annotations not only enhance face details but also decrease memory usage and training time. Next, we find that attention layers improve face and neck quality. Finally, we show that GELU and ReLU activation functions are the most effective in our experiments despite the appeal of newer activations such as Swish and Sine. We will release a well-organized code base, hyperparameters, and model checkpoints to support the reproducibility of our results. We expect our extensive experiments and code to greatly inform future design choices in video virtual try-on. Our code may be accessed at https://github.com/andrewjong/ShineOn-Virtual-Tryon.","PeriodicalId":313062,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128860941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A Log-likelihood Regularized KL Divergence for Video Prediction With a 3D Convolutional Variational Recurrent Network 三维卷积变分递归网络视频预测的对数似然正则化KL散度
Pub Date : 2020-12-11 DOI: 10.1109/WACVW52041.2021.00027
Haziq Razali, Basura Fernando
The use of latent variable models has shown to be a powerful tool for modeling probability distributions over sequences. In this paper, we introduce a new variational model that extends the recurrent network in two ways for the task of video frame prediction. First, we introduce 3D convolutions inside all modules including the recurrent model for future frame prediction, inputting and outputting a sequence of video frames at each timestep. This enables us to better exploit spatiotemporal information inside the variational recurrent model, allowing us to generate high-quality predictions. Second, we enhance the latent loss of the variational model by introducing a maximum likelihood estimate in addition to the KL divergence that is commonly used in variational models. This simple extension acts as a stronger regularizer in the variational autoencoder loss function and lets us obtain better results and generalizability. Experiments show that our model outperforms existing video prediction methods on several benchmarks while requiring fewer parameters.
潜在变量模型的使用已被证明是对序列上的概率分布进行建模的有力工具。本文介绍了一种新的变分模型,它从两个方面扩展了递归网络,用于视频帧预测。首先,我们在所有模块中引入3D卷积,包括用于未来帧预测的循环模型,在每个时间步长输入和输出视频帧序列。这使我们能够更好地利用变分循环模型中的时空信息,使我们能够生成高质量的预测。其次,我们通过在变分模型中常用的KL散度之外引入最大似然估计来增强变分模型的潜在损失。这个简单的扩展在变分自编码器损失函数中充当了一个更强的正则化器,使我们获得了更好的结果和泛化性。实验表明,我们的模型在几个基准测试中优于现有的视频预测方法,同时需要更少的参数。
{"title":"A Log-likelihood Regularized KL Divergence for Video Prediction With a 3D Convolutional Variational Recurrent Network","authors":"Haziq Razali, Basura Fernando","doi":"10.1109/WACVW52041.2021.00027","DOIUrl":"https://doi.org/10.1109/WACVW52041.2021.00027","url":null,"abstract":"The use of latent variable models has shown to be a powerful tool for modeling probability distributions over sequences. In this paper, we introduce a new variational model that extends the recurrent network in two ways for the task of video frame prediction. First, we introduce 3D convolutions inside all modules including the recurrent model for future frame prediction, inputting and outputting a sequence of video frames at each timestep. This enables us to better exploit spatiotemporal information inside the variational recurrent model, allowing us to generate high-quality predictions. Second, we enhance the latent loss of the variational model by introducing a maximum likelihood estimate in addition to the KL divergence that is commonly used in variational models. This simple extension acts as a stronger regularizer in the variational autoencoder loss function and lets us obtain better results and generalizability. Experiments show that our model outperforms existing video prediction methods on several benchmarks while requiring fewer parameters.","PeriodicalId":313062,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132448380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
PeR-ViS: Person Retrieval in Video Surveillance using Semantic Description 基于语义描述的视频监控人员检索
Pub Date : 2020-12-04 DOI: 10.1109/WACVW52041.2021.00009
Parshwa Shah, Arpit Garg, Vandit Gajjar
A person is usually characterized by descriptors like age, gender, height, cloth type, pattern, color, etc. Such descriptors are known as attributes and/or soft-biometrics. They link the semantic gap between a person’s description and retrieval in video surveillance. Retrieving a specific person with the query of semantic description has an important application in video surveillance. Using computer vision to fully automate the person retrieval task has been gathering interest within the research community. However, the Current, trend mainly focuses on retrieving persons with image-based queries, which have major limitations for practical usage. Instead of using an image query, in this paper, we study the problem of person retrieval in video surveillance with a semantic description. To solve this problem, we develop a deep learning-based cascade filtering approach (PeR-ViS), which uses Mask R-CNN [14] (person detection and instance segmentation) and DenseNet-161 [16] (soft-biometric classification). On the standard person retrieval dataset of SoftBioSearch [6], we achieve 0.566 Average IoU and 0.792 %w IoU > 0.4, surpassing the current state-of-the-art by a large margin. We hope our simple, reproducible, and effective approach will help ease future research in the domain of person retrieval in video surveillance. The source code will be released after the paper is accepted for publication with base-line and pretrained weights. The source code and pre-trained weights available at https://parshwa1999.github.io/PeR-ViS/.
一个人的特征通常用年龄、性别、身高、布料类型、图案、颜色等描述词来描述。这样的描述符被称为属性和/或软生物识别。他们将视频监控中人物描述和检索之间的语义差距联系起来。利用语义描述查询检索特定的人在视频监控中有着重要的应用。利用计算机视觉来完全自动化人的检索任务已经引起了研究界的兴趣。然而,目前的趋势主要集中在使用基于图像的查询检索人员,这在实际使用中有很大的局限性。本文研究了基于语义描述的视频监控人员检索问题,而不是使用图像查询。为了解决这个问题,我们开发了一种基于深度学习的级联过滤方法(PeR-ViS),该方法使用Mask R-CNN[14](人检测和实例分割)和DenseNet-161[16](软生物识别分类)。在SoftBioSearch[6]的标准人物检索数据集上,我们实现了0.566的平均IoU和0.792的平均IoU > 0.4,大大超过了目前的水平。我们希望我们的简单、可重复、有效的方法将有助于简化视频监控中人员检索领域的未来研究。源代码将在论文接受基线和预训练权重后发布。源代码和预训练的权重可在https://parshwa1999.github.io/PeR-ViS/。
{"title":"PeR-ViS: Person Retrieval in Video Surveillance using Semantic Description","authors":"Parshwa Shah, Arpit Garg, Vandit Gajjar","doi":"10.1109/WACVW52041.2021.00009","DOIUrl":"https://doi.org/10.1109/WACVW52041.2021.00009","url":null,"abstract":"A person is usually characterized by descriptors like age, gender, height, cloth type, pattern, color, etc. Such descriptors are known as attributes and/or soft-biometrics. They link the semantic gap between a person’s description and retrieval in video surveillance. Retrieving a specific person with the query of semantic description has an important application in video surveillance. Using computer vision to fully automate the person retrieval task has been gathering interest within the research community. However, the Current, trend mainly focuses on retrieving persons with image-based queries, which have major limitations for practical usage. Instead of using an image query, in this paper, we study the problem of person retrieval in video surveillance with a semantic description. To solve this problem, we develop a deep learning-based cascade filtering approach (PeR-ViS), which uses Mask R-CNN [14] (person detection and instance segmentation) and DenseNet-161 [16] (soft-biometric classification). On the standard person retrieval dataset of SoftBioSearch [6], we achieve 0.566 Average IoU and 0.792 %w IoU > 0.4, surpassing the current state-of-the-art by a large margin. We hope our simple, reproducible, and effective approach will help ease future research in the domain of person retrieval in video surveillance. The source code will be released after the paper is accepted for publication with base-line and pretrained weights. The source code and pre-trained weights available at https://parshwa1999.github.io/PeR-ViS/.","PeriodicalId":313062,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123368158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Pose-based Sign Language Recognition using GCN and BERT 基于姿态的GCN和BERT手语识别
Pub Date : 2020-12-01 DOI: 10.1109/WACVW52041.2021.00008
Anirudh Tunga, Sai Vidyaranya Nuthalapati, J. Wachs
Sign language recognition (SLR) plays a crucial role in bridging the communication gap between the hearing and vocally impaired community and the rest of the society. Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language. However, recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations and other movements. Recent pose-based architectures for WSLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information.We tackle the problem of WSLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion. Our proposed architecture explicitly captures the spatial interactions in the video using a Graph Convolutional Network (GCN). The temporal dependencies between the frames are captured using Bidirectional Encoder Representations from Transformers (BERT). Experimental results on WLASL, a standard word-level sign language recognition dataset show that our model significantly outperforms the state-of-the-art on pose-based methods by achieving an improvement in the prediction accuracy by up to 5%.
手语识别(SLR)在弥合听力和言语障碍群体与社会其他群体之间的沟通差距方面发挥着至关重要的作用。词级手语识别(WSLR)是理解和解释手语的第一步。然而,从视频中识别信号是一项具有挑战性的任务,因为单词的含义取决于微妙的身体动作、手部配置和其他动作的组合。当前基于姿态的WSLR结构要么同时对不同帧中姿态之间的时空依赖关系进行建模,要么只对时间信息进行建模,而没有充分利用空间信息。我们采用一种新的基于姿态的方法来解决WSLR问题,该方法分别捕获空间和时间信息并进行后期融合。我们提出的架构使用图形卷积网络(GCN)明确捕获视频中的空间交互。帧之间的时间依赖性使用双向编码器表示从变压器(BERT)捕获。在WLASL(一个标准的词级手语识别数据集)上的实验结果表明,我们的模型显著优于基于姿势的最先进的方法,预测精度提高了5%。
{"title":"Pose-based Sign Language Recognition using GCN and BERT","authors":"Anirudh Tunga, Sai Vidyaranya Nuthalapati, J. Wachs","doi":"10.1109/WACVW52041.2021.00008","DOIUrl":"https://doi.org/10.1109/WACVW52041.2021.00008","url":null,"abstract":"Sign language recognition (SLR) plays a crucial role in bridging the communication gap between the hearing and vocally impaired community and the rest of the society. Word-level sign language recognition (WSLR) is the first important step towards understanding and interpreting sign language. However, recognizing signs from videos is a challenging task as the meaning of a word depends on a combination of subtle body motions, hand configurations and other movements. Recent pose-based architectures for WSLR either model both the spatial and temporal dependencies among the poses in different frames simultaneously or only model the temporal information without fully utilizing the spatial information.We tackle the problem of WSLR using a novel pose-based approach, which captures spatial and temporal information separately and performs late fusion. Our proposed architecture explicitly captures the spatial interactions in the video using a Graph Convolutional Network (GCN). The temporal dependencies between the frames are captured using Bidirectional Encoder Representations from Transformers (BERT). Experimental results on WLASL, a standard word-level sign language recognition dataset show that our model significantly outperforms the state-of-the-art on pose-based methods by achieving an improvement in the prediction accuracy by up to 5%.","PeriodicalId":313062,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124511943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Symbolic AI for XAI: Evaluating LFIT Inductive Programming for Fair and Explainable Automatic Recruitment 用于XAI的符号人工智能:评估公平和可解释自动招聘的LFIT归纳规划
Pub Date : 2020-12-01 DOI: 10.1109/WACVW52041.2021.00013
A. Ortega, Julian Fierrez, A. Morales, Zilong Wang, Tony Ribeiro
Machine learning methods are growing in relevance for biometrics and personal information processing in domains such as forensics, e-health, recruitment, and e-learning. In these domains, white-box (human-readable) explanations of systems built on machine learning methods can become crucial. Inductive Logic Programming (ILP) is a subfield of symbolic AI aimed to automatically learn declarative theories about the process of data. Learning from Interpretation Transition (LFIT) is an ILP technique that can learn a propositional logic theory equivalent to a given blackbox system (under certain conditions). The present work takes a first step to a general methodology to incorporate accurate declarative explanations to classic machine learning by checking the viability of LFIT in a specific AI application scenario: fair recruitment based on an automatic tool generated with machine learning methods for ranking Curricula Vitae that incorporates soft biometric information (gender and ethnicity). We show the expressiveness of LFIT for this specific problem and propose a scheme that can be applicable to other domains.
机器学习方法在取证、电子医疗、招聘和电子学习等领域与生物识别和个人信息处理的相关性越来越大。在这些领域,基于机器学习方法构建的系统的白盒(人类可读的)解释可能变得至关重要。归纳逻辑编程(ILP)是符号人工智能的一个子领域,旨在自动学习有关数据过程的声明性理论。LFIT (Learning from Interpretation Transition)是一种ILP技术,它可以学习与给定黑箱系统等效的命题逻辑理论(在一定条件下)。目前的工作通过检查LFIT在特定人工智能应用场景中的可行性,向将准确的说明性解释纳入经典机器学习的一般方法迈出了第一步:基于机器学习方法生成的自动工具的公平招聘,该工具用于对包含软生物特征信息(性别和种族)的简历进行排名。我们展示了LFIT对这一特定问题的表达性,并提出了一种可适用于其他领域的方案。
{"title":"Symbolic AI for XAI: Evaluating LFIT Inductive Programming for Fair and Explainable Automatic Recruitment","authors":"A. Ortega, Julian Fierrez, A. Morales, Zilong Wang, Tony Ribeiro","doi":"10.1109/WACVW52041.2021.00013","DOIUrl":"https://doi.org/10.1109/WACVW52041.2021.00013","url":null,"abstract":"Machine learning methods are growing in relevance for biometrics and personal information processing in domains such as forensics, e-health, recruitment, and e-learning. In these domains, white-box (human-readable) explanations of systems built on machine learning methods can become crucial. Inductive Logic Programming (ILP) is a subfield of symbolic AI aimed to automatically learn declarative theories about the process of data. Learning from Interpretation Transition (LFIT) is an ILP technique that can learn a propositional logic theory equivalent to a given blackbox system (under certain conditions). The present work takes a first step to a general methodology to incorporate accurate declarative explanations to classic machine learning by checking the viability of LFIT in a specific AI application scenario: fair recruitment based on an automatic tool generated with machine learning methods for ranking Curricula Vitae that incorporates soft biometric information (gender and ethnicity). We show the expressiveness of LFIT for this specific problem and propose a scheme that can be applicable to other domains.","PeriodicalId":313062,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128271182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Person Perception Biases Exposed: Revisiting the First Impressions Dataset 人的感知偏见暴露:重新访问第一印象数据集
Pub Date : 2020-11-30 DOI: 10.1109/WACVW52041.2021.00006
Julio C. S. Jacques Junior, Àgata Lapedriza, Cristina Palmero, Xavier Baró, Sergio Escalera
This work revisits the ChaLearn First Impressions database, annotated for personality perception using pairwise comparisons via crowdsourcing. We analyse for the first time the original pairwise annotations, and reveal existing person perception biases associated to perceived attributes like gender, ethnicity, age and face attractiveness. We show how person perception bias can influence data labelling of a subjective task, which has received little attention from the computer vision and machine learning communities by now. We further show that the mechanism used to convert pairwise annotations to continuous values may magnify the biases if no special treatment is considered. The findings of this study are relevant for the computer vision community that is still creating new datasets on subjective tasks, and using them for practical applications, ignoring these perceptual biases.
这项工作重新审视了ChaLearn第一印象数据库,通过众包进行两两比较,对个性感知进行了注释。我们首次分析了原始的两两注释,并揭示了与感知属性(如性别、种族、年龄和面部吸引力)相关的现有个人感知偏见。我们展示了人的感知偏差如何影响主观任务的数据标记,这一点目前很少受到计算机视觉和机器学习社区的关注。我们进一步表明,如果不考虑特殊处理,用于将成对注释转换为连续值的机制可能会放大偏差。本研究的发现与计算机视觉社区相关,该社区仍在创建关于主观任务的新数据集,并将其用于实际应用,忽略这些感知偏差。
{"title":"Person Perception Biases Exposed: Revisiting the First Impressions Dataset","authors":"Julio C. S. Jacques Junior, Àgata Lapedriza, Cristina Palmero, Xavier Baró, Sergio Escalera","doi":"10.1109/WACVW52041.2021.00006","DOIUrl":"https://doi.org/10.1109/WACVW52041.2021.00006","url":null,"abstract":"This work revisits the ChaLearn First Impressions database, annotated for personality perception using pairwise comparisons via crowdsourcing. We analyse for the first time the original pairwise annotations, and reveal existing person perception biases associated to perceived attributes like gender, ethnicity, age and face attractiveness. We show how person perception bias can influence data labelling of a subjective task, which has received little attention from the computer vision and machine learning communities by now. We further show that the mechanism used to convert pairwise annotations to continuous values may magnify the biases if no special treatment is considered. The findings of this study are relevant for the computer vision community that is still creating new datasets on subjective tasks, and using them for practical applications, ignoring these perceptual biases.","PeriodicalId":313062,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123568806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1