Dennis Madsen, M. Lüthi, Andreas Schneider, T. Vetter
We present a novel method for co-registration of two independent statistical shape models. We solve the problem of aligning a face model to a skull model with stochastic optimization based on Markov Chain Monte Carlo (MCMC). We create a probabilistic joint face-skull model and show how to obtain a distribution of plausible face shapes given a skull shape. Due to environmental and genetic factors, there exists a distribution of possible face shapes arising from the same skull. We pose facial reconstruction as a conditional distribution of plausible face shapes given a skull shape. Because it is very difficult to obtain the distribution directly from MRI or CT data, we create a dataset of artificial face-skull pairs. To do this, we propose to combine three data sources of independent origin to model the joint face-skull distribution: a face shape model, a skull shape model and tissue depth marker information. For a given skull, we compute the posterior distribution of faces matching the tissue depth distribution with Metropolis-Hastings. We estimate the joint face-skull distribution from samples of the posterior. To find faces matching to an unknown skull, we estimate the probability of the face under the joint face-skull model. To our knowledge, we are the first to provide a whole distribution of plausible faces arising from a skull instead of only a single reconstruction. We show how the face-skull model can be used to rank a face dataset and on average successfully identify the correct match in top 30%. The face ranking even works when obtaining the face shapes from 2D images. We furthermore show how the face-skull model can be useful to estimate the skull position in an MR-image.
我们提出了一种新的两种独立统计形状模型的共配准方法。采用基于马尔可夫链蒙特卡罗(Markov Chain Monte Carlo, MCMC)的随机优化方法解决了人脸模型与颅骨模型的对齐问题。我们创建了一个概率关节面部-头骨模型,并展示了如何获得一个合理的面部形状给定颅骨形状的分布。由于环境和遗传因素的影响,同一个头骨可能产生不同的脸型。我们提出面部重建作为一个条件分布似是而非的面部形状给定的头骨形状。由于很难直接从MRI或CT数据中获得人脸-颅骨的分布,我们创建了一个人工人脸-颅骨对数据集。为此,我们建议结合三个独立来源的数据源:脸型模型、颅骨形状模型和组织深度标记信息来建模关节面-颅骨分布。对于给定的颅骨,我们用Metropolis-Hastings计算了与组织深度分布相匹配的人脸后验分布。我们估计关节面-颅骨分布从样本的后侧。为了找到与未知头骨匹配的人脸,我们在人脸-头骨联合模型下估计人脸的概率。据我们所知,我们是第一个提供由头骨产生的完整的貌似合理的面部分布,而不仅仅是单一的重建。我们展示了人脸-头骨模型如何用于对人脸数据集进行排序,并平均成功识别出前30%的正确匹配。人脸排序甚至在从二维图像中获取人脸形状时也有效。我们进一步展示了脸-头骨模型如何在核磁共振图像中用于估计头骨位置。
{"title":"Probabilistic Joint Face-Skull Modelling for Facial Reconstruction","authors":"Dennis Madsen, M. Lüthi, Andreas Schneider, T. Vetter","doi":"10.1109/CVPR.2018.00555","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00555","url":null,"abstract":"We present a novel method for co-registration of two independent statistical shape models. We solve the problem of aligning a face model to a skull model with stochastic optimization based on Markov Chain Monte Carlo (MCMC). We create a probabilistic joint face-skull model and show how to obtain a distribution of plausible face shapes given a skull shape. Due to environmental and genetic factors, there exists a distribution of possible face shapes arising from the same skull. We pose facial reconstruction as a conditional distribution of plausible face shapes given a skull shape. Because it is very difficult to obtain the distribution directly from MRI or CT data, we create a dataset of artificial face-skull pairs. To do this, we propose to combine three data sources of independent origin to model the joint face-skull distribution: a face shape model, a skull shape model and tissue depth marker information. For a given skull, we compute the posterior distribution of faces matching the tissue depth distribution with Metropolis-Hastings. We estimate the joint face-skull distribution from samples of the posterior. To find faces matching to an unknown skull, we estimate the probability of the face under the joint face-skull model. To our knowledge, we are the first to provide a whole distribution of plausible faces arising from a skull instead of only a single reconstruction. We show how the face-skull model can be used to rank a face dataset and on average successfully identify the correct match in top 30%. The face ranking even works when obtaining the face shapes from 2D images. We furthermore show how the face-skull model can be useful to estimate the skull position in an MR-image.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"1 1","pages":"5295-5303"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73181318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
X. Wu, Guanbin Li, Qingxing Cao, Qingge Ji, Liang Lin
Automatically describing open-domain videos with natural language are attracting increasing interest in the field of artificial intelligence. Most existing methods simply borrow ideas from image captioning and obtain a compact video representation from an ensemble of global image feature before feeding to an RNN decoder which outputs a sentence of variable length. However, it is not only arduous for the generator to focus on specific salient objects at different time given the global video representation, it is more formidable to capture the fine-grained motion information and the relation between moving instances for more subtle linguistic descriptions. In this paper, we propose a Trajectory Structured Attentional Encoder-Decoder (TSA-ED) neural network framework for more elaborate video captioning which works by integrating local spatial-temporal representation at trajectory level through structured attention mechanism. Our proposed method is based on a LSTM-based encoder-decoder framework, which incorporates an attention modeling scheme to adaptively learn the correlation between sentence structure and the moving objects in videos, and consequently generates more accurate and meticulous statement description in the decoding stage. Experimental results demonstrate that the feature representation and structured attention mechanism based on the trajectory cluster can efficiently obtain the local motion information in the video to help generate a more fine-grained video description, and achieve the state-of-the-art performance on the well-known Charades and MSVD datasets.
{"title":"Interpretable Video Captioning via Trajectory Structured Localization","authors":"X. Wu, Guanbin Li, Qingxing Cao, Qingge Ji, Liang Lin","doi":"10.1109/CVPR.2018.00714","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00714","url":null,"abstract":"Automatically describing open-domain videos with natural language are attracting increasing interest in the field of artificial intelligence. Most existing methods simply borrow ideas from image captioning and obtain a compact video representation from an ensemble of global image feature before feeding to an RNN decoder which outputs a sentence of variable length. However, it is not only arduous for the generator to focus on specific salient objects at different time given the global video representation, it is more formidable to capture the fine-grained motion information and the relation between moving instances for more subtle linguistic descriptions. In this paper, we propose a Trajectory Structured Attentional Encoder-Decoder (TSA-ED) neural network framework for more elaborate video captioning which works by integrating local spatial-temporal representation at trajectory level through structured attention mechanism. Our proposed method is based on a LSTM-based encoder-decoder framework, which incorporates an attention modeling scheme to adaptively learn the correlation between sentence structure and the moving objects in videos, and consequently generates more accurate and meticulous statement description in the decoding stage. Experimental results demonstrate that the feature representation and structured attention mechanism based on the trajectory cluster can efficiently obtain the local motion information in the video to help generate a more fine-grained video description, and achieve the state-of-the-art performance on the well-known Charades and MSVD datasets.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"220 1","pages":"6829-6837"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79862382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taiping Yao, Minsi Wang, Bingbing Ni, Huawei Wei, Xiaokang Yang
Most human activity analysis works (i.e., recognition or prediction) only focus on a single granularity, i.e., either modelling global motion based on the coarse level movement such as human trajectories or forecasting future detailed action based on body parts' movement such as skeleton motion. In contrast, in this work, we propose a multi-granularity interaction prediction network which integrates both global motion and detailed local action. Built on a bidirectional LSTM network, the proposed method possesses between granularities links which encourage feature sharing as well as cross-feature consistency between both global and local granularity (e.g., trajectory or local action), and in turn predict long-term global location and local dynamics of each individual. We validate our method on several public datasets with promising performance.
{"title":"Multiple Granularity Group Interaction Prediction","authors":"Taiping Yao, Minsi Wang, Bingbing Ni, Huawei Wei, Xiaokang Yang","doi":"10.1109/CVPR.2018.00239","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00239","url":null,"abstract":"Most human activity analysis works (i.e., recognition or prediction) only focus on a single granularity, i.e., either modelling global motion based on the coarse level movement such as human trajectories or forecasting future detailed action based on body parts' movement such as skeleton motion. In contrast, in this work, we propose a multi-granularity interaction prediction network which integrates both global motion and detailed local action. Built on a bidirectional LSTM network, the proposed method possesses between granularities links which encourage feature sharing as well as cross-feature consistency between both global and local granularity (e.g., trajectory or local action), and in turn predict long-term global location and local dynamics of each individual. We validate our method on several public datasets with promising performance.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"8 1","pages":"2246-2254"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82369546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zan Shen, Yi Xu, Bingbing Ni, Minsi Wang, Jianguo Hu, Xiaokang Yang
Crowd counting or density estimation is a challenging task in computer vision due to large scale variations, perspective distortions and serious occlusions, etc. Existing methods generally suffer from two issues: 1) the model averaging effects in multi-scale CNNs induced by the widely adopted $$ regression loss; and 2) inconsistent estimation across different scaled inputs. To explicitly address these issues, we propose a novel crowd counting (density estimation) framework called Adversarial Cross-Scale Consistency Pursuit (ACSCP). On one hand, a U-net structured generation network is designed to generate density map from input patch, and an adversarial loss is directly employed to shrink the solution onto a realistic subspace, thus attenuating the blurry effects of density map estimation. On the other hand, we design a novel scale-consistency regularizer which enforces that the sum up of the crowd counts from local patches (i.e., small scale) is coherent with the overall count of their region union (i.e., large scale). The above losses are integrated via a joint training scheme, so as to help boost density estimation performance by further exploring the collaboration between both objectives. Extensive experiments on four benchmarks have well demonstrated the effectiveness of the proposed innovations as well as the superior performance over prior art.
{"title":"Crowd Counting via Adversarial Cross-Scale Consistency Pursuit","authors":"Zan Shen, Yi Xu, Bingbing Ni, Minsi Wang, Jianguo Hu, Xiaokang Yang","doi":"10.1109/CVPR.2018.00550","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00550","url":null,"abstract":"Crowd counting or density estimation is a challenging task in computer vision due to large scale variations, perspective distortions and serious occlusions, etc. Existing methods generally suffer from two issues: 1) the model averaging effects in multi-scale CNNs induced by the widely adopted $$ regression loss; and 2) inconsistent estimation across different scaled inputs. To explicitly address these issues, we propose a novel crowd counting (density estimation) framework called Adversarial Cross-Scale Consistency Pursuit (ACSCP). On one hand, a U-net structured generation network is designed to generate density map from input patch, and an adversarial loss is directly employed to shrink the solution onto a realistic subspace, thus attenuating the blurry effects of density map estimation. On the other hand, we design a novel scale-consistency regularizer which enforces that the sum up of the crowd counts from local patches (i.e., small scale) is coherent with the overall count of their region union (i.e., large scale). The above losses are integrated via a joint training scheme, so as to help boost density estimation performance by further exploring the collaboration between both objectives. Extensive experiments on four benchmarks have well demonstrated the effectiveness of the proposed innovations as well as the superior performance over prior art.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"10 1","pages":"5245-5254"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82061516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, R. Urtasun, Jiaya Jia
In this paper, we propose Geometric Neural Network (GeoNet) to jointly predict depth and surface normal maps from a single image. Building on top of two-stream CNNs, our GeoNet incorporates geometric relation between depth and surface normal via the new depth-to-normal and normal-to-depth networks. Depth-to-normal network exploits the least square solution of surface normal from depth and improves its quality with a residual module. Normal-to-depth network, contrarily, refines the depth map based on the constraints from the surface normal through a kernel regression module, which has no parameter to learn. These two networks enforce the underlying model to efficiently predict depth and surface normal for high consistency and corresponding accuracy. Our experiments on NYU v2 dataset verify that our GeoNet is able to predict geometrically consistent depth and normal maps. It achieves top performance on surface normal estimation and is on par with state-of-the-art depth estimation methods.
{"title":"GeoNet: Geometric Neural Network for Joint Depth and Surface Normal Estimation","authors":"Xiaojuan Qi, Renjie Liao, Zhengzhe Liu, R. Urtasun, Jiaya Jia","doi":"10.1109/CVPR.2018.00037","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00037","url":null,"abstract":"In this paper, we propose Geometric Neural Network (GeoNet) to jointly predict depth and surface normal maps from a single image. Building on top of two-stream CNNs, our GeoNet incorporates geometric relation between depth and surface normal via the new depth-to-normal and normal-to-depth networks. Depth-to-normal network exploits the least square solution of surface normal from depth and improves its quality with a residual module. Normal-to-depth network, contrarily, refines the depth map based on the constraints from the surface normal through a kernel regression module, which has no parameter to learn. These two networks enforce the underlying model to efficiently predict depth and surface normal for high consistency and corresponding accuracy. Our experiments on NYU v2 dataset verify that our GeoNet is able to predict geometrically consistent depth and normal maps. It achieves top performance on surface normal estimation and is on par with state-of-the-art depth estimation methods.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"42 1","pages":"283-291"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82334131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the problem of segmenting a biomedical image into anatomical regions of interest. We specifically address the frequent scenario where we have no paired training data that contains images and their manual segmentations. Instead, we employ unpaired segmentation images that we use to build an anatomical prior. Critically these segmentations can be derived from imaging data from a different dataset and imaging modality than the current task. We introduce a generative probabilistic model that employs the learned prior through a convolutional neural network to compute segmentations in an unsupervised setting. We conducted an empirical analysis of the proposed approach in the context of structural brain MRI segmentation, using a multi-study dataset of more than 14,000 scans. Our results show that an anatomical prior enables fast unsupervised segmentation which is typically not possible using standard convolutional networks. The integration of anatomical priors can facilitate CNN-based anatomical segmentation in a range of novel clinical problems, where few or no annotations are available and thus standard networks are not trainable. The code, model definitions and model weights are freely available at http://github.com/adalca/neuron.
{"title":"Anatomical Priors in Convolutional Networks for Unsupervised Biomedical Segmentation","authors":"Adrian V. Dalca, J. Guttag, M. Sabuncu","doi":"10.1109/CVPR.2018.00968","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00968","url":null,"abstract":"We consider the problem of segmenting a biomedical image into anatomical regions of interest. We specifically address the frequent scenario where we have no paired training data that contains images and their manual segmentations. Instead, we employ unpaired segmentation images that we use to build an anatomical prior. Critically these segmentations can be derived from imaging data from a different dataset and imaging modality than the current task. We introduce a generative probabilistic model that employs the learned prior through a convolutional neural network to compute segmentations in an unsupervised setting. We conducted an empirical analysis of the proposed approach in the context of structural brain MRI segmentation, using a multi-study dataset of more than 14,000 scans. Our results show that an anatomical prior enables fast unsupervised segmentation which is typically not possible using standard convolutional networks. The integration of anatomical priors can facilitate CNN-based anatomical segmentation in a range of novel clinical problems, where few or no annotations are available and thus standard networks are not trainable. The code, model definitions and model weights are freely available at http://github.com/adalca/neuron.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"16 1","pages":"9290-9299"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76386875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gaussian process (GP) regression is a powerful tool in non-parametric regression providing uncertainty estimates. However, it is limited to data in vector spaces. In fields such as shape analysis and diffusion tensor imaging, the data often lies on a manifold, making GP regression nonviable, as the resulting predictive distribution does not live in the correct geometric space. We tackle the problem by defining wrapped Gaussian processes (WGPs) on Riemannian manifolds, using the probabilistic setting to generalize GP regression to the context of manifold-valued targets. The method is validated empirically on diffusion weighted imaging (DWI) data, directional data on the sphere and in the Kendall shape space, endorsing WGP regression as an efficient and flexible tool for manifold-valued regression.
{"title":"Wrapped Gaussian Process Regression on Riemannian Manifolds","authors":"Anton Mallasto, Aasa Feragen","doi":"10.1109/CVPR.2018.00585","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00585","url":null,"abstract":"Gaussian process (GP) regression is a powerful tool in non-parametric regression providing uncertainty estimates. However, it is limited to data in vector spaces. In fields such as shape analysis and diffusion tensor imaging, the data often lies on a manifold, making GP regression nonviable, as the resulting predictive distribution does not live in the correct geometric space. We tackle the problem by defining wrapped Gaussian processes (WGPs) on Riemannian manifolds, using the probabilistic setting to generalize GP regression to the context of manifold-valued targets. The method is validated empirically on diffusion weighted imaging (DWI) data, directional data on the sphere and in the Kendall shape space, endorsing WGP regression as an efficient and flexible tool for manifold-valued regression.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"359 1","pages":"5580-5588"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76417580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ordinal regression is a supervised learning problem aiming to classify instances into ordinal categories. It is challenging to automatically extract high-level features for representing intraclass information and interclass ordinal relationship simultaneously. This paper proposes a constrained optimization formulation for the ordinal regression problem which minimizes the negative loglikelihood for multiple categories constrained by the order relationship between instances. Mathematically, it is equivalent to an unconstrained formulation with a pairwise regularizer. An implementation based on the CNN framework is proposed to solve the problem such that high-level features can be extracted automatically, and the optimal solution can be learned through the traditional back-propagation method. The proposed pairwise constraints make the algorithm work even on small datasets, and a proposed efficient implementation make it be scalable for large datasets. Experimental results on four real-world benchmarks demonstrate that the proposed algorithm outperforms the traditional deep learning approaches and other state-of-the-art approaches based on hand-crafted features.
{"title":"A Constrained Deep Neural Network for Ordinal Regression","authors":"Yanzhu Liu, A. Kong, C. Goh","doi":"10.1109/CVPR.2018.00093","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00093","url":null,"abstract":"Ordinal regression is a supervised learning problem aiming to classify instances into ordinal categories. It is challenging to automatically extract high-level features for representing intraclass information and interclass ordinal relationship simultaneously. This paper proposes a constrained optimization formulation for the ordinal regression problem which minimizes the negative loglikelihood for multiple categories constrained by the order relationship between instances. Mathematically, it is equivalent to an unconstrained formulation with a pairwise regularizer. An implementation based on the CNN framework is proposed to solve the problem such that high-level features can be extracted automatically, and the optimal solution can be learned through the traditional back-propagation method. The proposed pairwise constraints make the algorithm work even on small datasets, and a proposed efficient implementation make it be scalable for large datasets. Experimental results on four real-world benchmarks demonstrate that the proposed algorithm outperforms the traditional deep learning approaches and other state-of-the-art approaches based on hand-crafted features.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"18 1","pages":"831-839"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76062685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pia Bideau, Aruni RoyChowdhury, Rakesh R Menon, E. Learned-Miller
Traditional methods of motion segmentation use powerful geometric constraints to understand motion, but fail to leverage the semantics of high-level image understanding. Modern CNN methods of motion analysis, on the other hand, excel at identifying well-known structures, but may not precisely characterize well-known geometric constraints. In this work, we build a new statistical model of rigid motion flow based on classical perspective projection constraints. We then combine piecewise rigid motions into complex deformable and articulated objects, guided by semantic segmentation from CNNs and a second "object-level" statistical model. This combination of classical geometric knowledge combined with the pattern recognition abilities of CNNs yields excellent performance on a wide range of motion segmentation benchmarks, from complex geometric scenes to camouflaged animals.
{"title":"The Best of Both Worlds: Combining CNNs and Geometric Constraints for Hierarchical Motion Segmentation","authors":"Pia Bideau, Aruni RoyChowdhury, Rakesh R Menon, E. Learned-Miller","doi":"10.1109/CVPR.2018.00060","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00060","url":null,"abstract":"Traditional methods of motion segmentation use powerful geometric constraints to understand motion, but fail to leverage the semantics of high-level image understanding. Modern CNN methods of motion analysis, on the other hand, excel at identifying well-known structures, but may not precisely characterize well-known geometric constraints. In this work, we build a new statistical model of rigid motion flow based on classical perspective projection constraints. We then combine piecewise rigid motions into complex deformable and articulated objects, guided by semantic segmentation from CNNs and a second \"object-level\" statistical model. This combination of classical geometric knowledge combined with the pattern recognition abilities of CNNs yields excellent performance on a wide range of motion segmentation benchmarks, from complex geometric scenes to camouflaged animals.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"1 1","pages":"508-517"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76095395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current works on facial action unit (AU) recognition typically require fully AU-annotated facial images for supervised AU classifier training. AU annotation is a time-consuming, expensive, and error-prone process. While AUs are hard to annotate, facial expression is relatively easy to label. Furthermore, there exist strong probabilistic dependencies between expressions and AUs as well as dependencies among AUs. Such dependencies are referred to as domain knowledge. In this paper, we propose a novel AU recognition method that learns AU classifiers from domain knowledge and expression-annotated facial images through adversarial training. Specifically, we first generate pseudo AU labels according to the probabilistic dependencies between expressions and AUs as well as correlations among AUs summarized from domain knowledge. Then we propose a weakly supervised AU recognition method via an adversarial process, in which we simultaneously train two models: a recognition model R, which learns AU classifiers, and a discrimination model D, which estimates the probability that AU labels generated from domain knowledge rather than the recognized AU labels from R. The training procedure for R maximizes the probability of D making a mistake. By leveraging the adversarial mechanism, the distribution of recognized AUs is closed to AU prior distribution from domain knowledge. Furthermore, the proposed weakly supervised AU recognition can be extended to semi-supervised learning scenarios with partially AU-annotated images. Experimental results on three benchmark databases demonstrate that the proposed method successfully leverages the summarized domain knowledge to weakly supervised AU classifier learning through an adversarial process, and thus achieves state-of-the-art performance.
{"title":"Weakly Supervised Facial Action Unit Recognition Through Adversarial Training","authors":"Guozhu Peng, Shangfei Wang","doi":"10.1109/CVPR.2018.00233","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00233","url":null,"abstract":"Current works on facial action unit (AU) recognition typically require fully AU-annotated facial images for supervised AU classifier training. AU annotation is a time-consuming, expensive, and error-prone process. While AUs are hard to annotate, facial expression is relatively easy to label. Furthermore, there exist strong probabilistic dependencies between expressions and AUs as well as dependencies among AUs. Such dependencies are referred to as domain knowledge. In this paper, we propose a novel AU recognition method that learns AU classifiers from domain knowledge and expression-annotated facial images through adversarial training. Specifically, we first generate pseudo AU labels according to the probabilistic dependencies between expressions and AUs as well as correlations among AUs summarized from domain knowledge. Then we propose a weakly supervised AU recognition method via an adversarial process, in which we simultaneously train two models: a recognition model R, which learns AU classifiers, and a discrimination model D, which estimates the probability that AU labels generated from domain knowledge rather than the recognized AU labels from R. The training procedure for R maximizes the probability of D making a mistake. By leveraging the adversarial mechanism, the distribution of recognized AUs is closed to AU prior distribution from domain knowledge. Furthermore, the proposed weakly supervised AU recognition can be extended to semi-supervised learning scenarios with partially AU-annotated images. Experimental results on three benchmark databases demonstrate that the proposed method successfully leverages the summarized domain knowledge to weakly supervised AU classifier learning through an adversarial process, and thus achieves state-of-the-art performance.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"71 1","pages":"2188-2196"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87995762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}