Collecting a large-scale and well-annotated dataset for image processing has become a common practice in computer vision. However, in the ancient painting area, this task is not practical as the number of paintings is limited and their style is greatly diverse. We, therefore, propose a novel solution for the problems that come with ancient painting processing. This is to use domain transfer to convert ancient paintings to photo-realistic natural images. By doing so, the "ancient painting processing problems" become "natural image processing problems" and models trained on natural images can be directly applied to the transferred paintings. Specifically, we focus on Chinese ancient flower, bird and landscape paintings in this work. A novel Domain Style Transfer Network (DSTN) is proposed to transfer ancient paintings to natural images which employ a compound loss to ensure that the transferred paintings still maintain the color composition and content of the input paintings. The experiment results show that the transferred paintings generated by the DSTN have a better performance in both the human perceptual test and other image processing tasks than other state-of-arts methods, indicating the authenticity of the transferred paintings and the superiority of the proposed method.
在计算机视觉领域,收集大规模且标注良好的数据集用于图像处理已经成为一种常见的做法。然而,在古代绘画领域,这一任务并不现实,因为绘画数量有限,风格各异。因此,我们提出了一种解决古代绘画加工问题的新方法。这是利用域转移将古代绘画转化为逼真的自然图像。这样,“古画处理问题”就变成了“自然图像处理问题”,在自然图像上训练的模型可以直接应用到转移的绘画上。具体来说,我们在这个作品中关注的是中国古代花鸟和山水画。提出了一种新的领域风格转移网络(Domain Style Transfer Network,简称DSTN),将古代绘画转换为自然图像,采用复合损失的方法,保证转换后的绘画仍然保持输入绘画的色彩组成和内容。实验结果表明,通过DSTN生成的转移画作在人类感知测试和其他图像处理任务中都比其他最先进的方法有更好的表现,表明了转移画作的真实性和所提出方法的优越性。
{"title":"Ancient Painting to Natural Image: A New Solution for Painting Processing","authors":"Tingting Qiao, Weijing Zhang, Miao Zhang, Zixuan Ma, Duanqing Xu","doi":"10.1109/WACV.2019.00061","DOIUrl":"https://doi.org/10.1109/WACV.2019.00061","url":null,"abstract":"Collecting a large-scale and well-annotated dataset for image processing has become a common practice in computer vision. However, in the ancient painting area, this task is not practical as the number of paintings is limited and their style is greatly diverse. We, therefore, propose a novel solution for the problems that come with ancient painting processing. This is to use domain transfer to convert ancient paintings to photo-realistic natural images. By doing so, the \"ancient painting processing problems\" become \"natural image processing problems\" and models trained on natural images can be directly applied to the transferred paintings. Specifically, we focus on Chinese ancient flower, bird and landscape paintings in this work. A novel Domain Style Transfer Network (DSTN) is proposed to transfer ancient paintings to natural images which employ a compound loss to ensure that the transferred paintings still maintain the color composition and content of the input paintings. The experiment results show that the transferred paintings generated by the DSTN have a better performance in both the human perceptual test and other image processing tasks than other state-of-arts methods, indicating the authenticity of the transferred paintings and the superiority of the proposed method.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"41 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133021378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vinoj Jayasundara, S. Jayasekara, Hirunima Jayasekara, Jathushan Rajasegaran, Suranga Seneviratne, R. Rodrigo
Many localized languages struggle to reap the benefits of recent advancements in character recognition systems due to the lack of substantial amount of labeled training data. This is due to the difficulty in generating large amounts of labeled data for such languages and inability of deep learning techniques to properly learn from small number of training samples. We solve this problem by introducing a technique of generating new training samples from the existing samples, with realistic augmentations which reflect actual variations that are present in human hand writing, by adding random controlled noise to their corresponding instantiation parameters. Our results with a mere 200 training samples per class surpass existing character recognition results in the EMNIST-letter dataset while achieving the existing results in the three datasets: EMNIST-balanced, EMNIST-digits, and MNIST. We also develop a strategy to effectively use a combination of loss functions to improve reconstructions. Our system is useful in character recognition for localized languages that lack much labeled training data and even in other related more general contexts such as object recognition.
{"title":"TextCaps: Handwritten Character Recognition With Very Small Datasets","authors":"Vinoj Jayasundara, S. Jayasekara, Hirunima Jayasekara, Jathushan Rajasegaran, Suranga Seneviratne, R. Rodrigo","doi":"10.1109/WACV.2019.00033","DOIUrl":"https://doi.org/10.1109/WACV.2019.00033","url":null,"abstract":"Many localized languages struggle to reap the benefits of recent advancements in character recognition systems due to the lack of substantial amount of labeled training data. This is due to the difficulty in generating large amounts of labeled data for such languages and inability of deep learning techniques to properly learn from small number of training samples. We solve this problem by introducing a technique of generating new training samples from the existing samples, with realistic augmentations which reflect actual variations that are present in human hand writing, by adding random controlled noise to their corresponding instantiation parameters. Our results with a mere 200 training samples per class surpass existing character recognition results in the EMNIST-letter dataset while achieving the existing results in the three datasets: EMNIST-balanced, EMNIST-digits, and MNIST. We also develop a strategy to effectively use a combination of loss functions to improve reconstructions. Our system is useful in character recognition for localized languages that lack much labeled training data and even in other related more general contexts such as object recognition.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121956299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a method for capturing high-speed video using an asynchronous camera array. Our method sequentially fires each sensor in a camera array with a small time offset and assembles captured frames into a high-speed video according to the time stamps. The resulting video, however, suffers from parallax jittering caused by the viewpoint difference among sensors in the camera array. To address this problem, we develop a dedicated novel view synthesis algorithm that transforms the video frames as if they were captured by a single reference sensor. Specifically, for any frame from a non-reference sensor, we find the two temporally neighboring frames captured by the reference sensor. Using these three frames, we render a new frame with the same time stamp as the non-reference frame but from the viewpoint of the reference sensor. Specifically, we segment these frames into super-pixels and then apply local content-preserving warping to warp them to form the new frame. We employ a multi-label Markov Random Field method to blend these warped frames. Our experiments show that our method can produce high-quality and high-speed video of a wide variety of scenes with large parallax, scene dynamics, and camera motion and outperforms several baseline and state-of-the-art approaches.
{"title":"High-Speed Video from Asynchronous Camera Array","authors":"Si Lu","doi":"10.1109/WACV.2019.00237","DOIUrl":"https://doi.org/10.1109/WACV.2019.00237","url":null,"abstract":"This paper presents a method for capturing high-speed video using an asynchronous camera array. Our method sequentially fires each sensor in a camera array with a small time offset and assembles captured frames into a high-speed video according to the time stamps. The resulting video, however, suffers from parallax jittering caused by the viewpoint difference among sensors in the camera array. To address this problem, we develop a dedicated novel view synthesis algorithm that transforms the video frames as if they were captured by a single reference sensor. Specifically, for any frame from a non-reference sensor, we find the two temporally neighboring frames captured by the reference sensor. Using these three frames, we render a new frame with the same time stamp as the non-reference frame but from the viewpoint of the reference sensor. Specifically, we segment these frames into super-pixels and then apply local content-preserving warping to warp them to form the new frame. We employ a multi-label Markov Random Field method to blend these warped frames. Our experiments show that our method can produce high-quality and high-speed video of a wide variety of scenes with large parallax, scene dynamics, and camera motion and outperforms several baseline and state-of-the-art approaches.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130389107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Attribute-conditioned face synthesis has many potential use cases, such as to aid the identification of a suspect or a missing person. Building on top of a conditional version of VAE-GAN, we augment the pathways connecting the latent space with channel-recurrent architecture, in order to provide not only improved generation qualities but also interpretable high-level features. In particular, to better achieve the latter, we further propose an attention mechanism over each attribute to indicate the specific latent subset responsible for its modulation. Thanks to the latent semantics formed via the channel-recurreny, we envision a tool that takes the desired attributes as inputs and then performs a 2-stage general-to-specific generation of diverse and realistic faces. Lastly, we incorporate the progressive-growth training scheme to the inference, generation and discriminator networks of our models to facilitate higher resolution outputs. Evaluations are performed through both qualitative visual examination and quantitative metrics, namely inception scores, human preferences, and attribute classification accuracy.
{"title":"Attentive Conditional Channel-Recurrent Autoencoding for Attribute-Conditioned Face Synthesis","authors":"Wenling Shang, Kihyuk Sohn","doi":"10.1109/WACV.2019.00168","DOIUrl":"https://doi.org/10.1109/WACV.2019.00168","url":null,"abstract":"Attribute-conditioned face synthesis has many potential use cases, such as to aid the identification of a suspect or a missing person. Building on top of a conditional version of VAE-GAN, we augment the pathways connecting the latent space with channel-recurrent architecture, in order to provide not only improved generation qualities but also interpretable high-level features. In particular, to better achieve the latter, we further propose an attention mechanism over each attribute to indicate the specific latent subset responsible for its modulation. Thanks to the latent semantics formed via the channel-recurreny, we envision a tool that takes the desired attributes as inputs and then performs a 2-stage general-to-specific generation of diverse and realistic faces. Lastly, we incorporate the progressive-growth training scheme to the inference, generation and discriminator networks of our models to facilitate higher resolution outputs. Evaluations are performed through both qualitative visual examination and quantitative metrics, namely inception scores, human preferences, and attribute classification accuracy.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126271720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Acoustic-and elastic-waveform inversion is an important and widely used method to reconstruct subsurface velocity image. Waveform inversion is a typical non-linear and ill-posed inverse problem. Existing physics-driven computational methods for solving waveform inversion suffer from the cycle skipping and local minima issues, and not to mention solving waveform inversion is computationally expensive. In this paper, we developed a real-time datadriven technique, VelocityGAN, to accurately reconstruct subsurface velocities. Our VelocityGAN is an end-to-end framework which can generate high-quality velocity images directly from the raw seismic waveform data. A series of experiments are conducted on the synthetic seismic reflection data to evaluate the effectiveness and efficiency of VelocityGAN. We not only compare it with existing physics-driven approaches but also choose some deep learning frameworks as our data-driven baselines. The experiment results show that VelocityGAN outperforms the physics-driven waveform inversion methods and achieves the state-of-the-art performance among data-driven baselines.
{"title":"VelocityGAN: Subsurface Velocity Image Estimation Using Conditional Adversarial Networks","authors":"Zhongping Zhang, Yue Wu, Zheng Zhou, Youzuo Lin","doi":"10.1109/WACV.2019.00080","DOIUrl":"https://doi.org/10.1109/WACV.2019.00080","url":null,"abstract":"Acoustic-and elastic-waveform inversion is an important and widely used method to reconstruct subsurface velocity image. Waveform inversion is a typical non-linear and ill-posed inverse problem. Existing physics-driven computational methods for solving waveform inversion suffer from the cycle skipping and local minima issues, and not to mention solving waveform inversion is computationally expensive. In this paper, we developed a real-time datadriven technique, VelocityGAN, to accurately reconstruct subsurface velocities. Our VelocityGAN is an end-to-end framework which can generate high-quality velocity images directly from the raw seismic waveform data. A series of experiments are conducted on the synthetic seismic reflection data to evaluate the effectiveness and efficiency of VelocityGAN. We not only compare it with existing physics-driven approaches but also choose some deep learning frameworks as our data-driven baselines. The experiment results show that VelocityGAN outperforms the physics-driven waveform inversion methods and achieves the state-of-the-art performance among data-driven baselines.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121930019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaodi Wang, Ce Li, Yipeng Mou, Baochang Zhang, J. Han, Jianzhuang Liu
This paper provides a new perspective to understand CNNs based on the Taylor expansion, leading to new Taylor Convolutional Networks (TaylorNets) for image classification. We introduce a principled combination of the high frequency information (i.e., detailed information) and low frequency information in the end-to-end TaylorNets, based on a nonlinear combination of the convolutional feature maps. The steerable module developed in TaylorNets is generic, which can be easily integrated into well-known deep architectures and learned within the same pipeline of the back propagation algorithm, yielding a higher representation capacity for CNNs. Extensive experimental results demonstrate the super capability of our TaylorNets which improve widely used CNNs architectures, such as conventional CNNs and ResNet, in terms of object classification accuracy on well-known benchmarks. The code will be publicly available.
{"title":"Taylor Convolutional Networks for Image Classification","authors":"Xiaodi Wang, Ce Li, Yipeng Mou, Baochang Zhang, J. Han, Jianzhuang Liu","doi":"10.1109/WACV.2019.00140","DOIUrl":"https://doi.org/10.1109/WACV.2019.00140","url":null,"abstract":"This paper provides a new perspective to understand CNNs based on the Taylor expansion, leading to new Taylor Convolutional Networks (TaylorNets) for image classification. We introduce a principled combination of the high frequency information (i.e., detailed information) and low frequency information in the end-to-end TaylorNets, based on a nonlinear combination of the convolutional feature maps. The steerable module developed in TaylorNets is generic, which can be easily integrated into well-known deep architectures and learned within the same pipeline of the back propagation algorithm, yielding a higher representation capacity for CNNs. Extensive experimental results demonstrate the super capability of our TaylorNets which improve widely used CNNs architectures, such as conventional CNNs and ResNet, in terms of object classification accuracy on well-known benchmarks. The code will be publicly available.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128171945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For a given identity in a face dataset, there are certain iconic images which are more representative of the subject than others. In this paper, we explore the problem of computing the iconicity of a face. The premise of the proposed approach is as follows: For an identity containing a mixture of iconic and non iconic images, if a given face cannot be successfully matched with any other face of the same identity, then the iconicity of the face image is low. Using this information, we train a Siamese Multi-Layer Perceptron network, such that each of its twins predict iconicity scores of the image feature pair, fed in as input. We observe the variation of the obtained scores with respect to covariates such as blur, yaw, pitch, roll and occlusion to demonstrate that they effectively predict the quality of the image and compare it with other existing metrics. Furthermore, we use these scores to weight features for template-based face verification and compare it with media averaging of features.
{"title":"On Measuring the Iconicity of a Face","authors":"Prithviraj Dhar, C. Castillo, R. Chellappa","doi":"10.1109/WACV.2019.00231","DOIUrl":"https://doi.org/10.1109/WACV.2019.00231","url":null,"abstract":"For a given identity in a face dataset, there are certain iconic images which are more representative of the subject than others. In this paper, we explore the problem of computing the iconicity of a face. The premise of the proposed approach is as follows: For an identity containing a mixture of iconic and non iconic images, if a given face cannot be successfully matched with any other face of the same identity, then the iconicity of the face image is low. Using this information, we train a Siamese Multi-Layer Perceptron network, such that each of its twins predict iconicity scores of the image feature pair, fed in as input. We observe the variation of the obtained scores with respect to covariates such as blur, yaw, pitch, roll and occlusion to demonstrate that they effectively predict the quality of the image and compare it with other existing metrics. Furthermore, we use these scores to weight features for template-based face verification and compare it with media averaging of features.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132515604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ding Liu, Zixu Zhao, Xinchao Wang, Yuxiao Hu, Lei Zhang, Thomas Huang
3D human pose estimation from monocular images has become a heated area in computer vision recently. For years, most deep neural network based practices have adopted either an end-to-end approach, or a two-stage approach. An end-to-end network typically estimates 3D human poses directly from 2D input images, but it suffers from the shortage of 3D human pose data. It is also obscure to know if the inaccuracy stems from limited visual under-standing or 2D-to-3D mapping. Whereas a two-stage directly lifts those 2D keypoint outputs to the 3D space, after utilizing an existing network for 2D keypoint detections. However, they tend to ignore some useful contextual hints from the 2D raw image pixels. In this paper, we introduce a two-stage architecture that can eliminate the main disadvantages of both these approaches. During the first stage we use an existing state-of-the-art detector to estimate 2D poses. To add more con-textual information to help lifting 2D poses to 3D poses, we propose 3D Part Affinity Fields (3D-PAFs). We use 3D-PAFs to infer 3D limb vectors, and combine them with 2D poses to regress the 3D coordinates. We trained and tested our proposed framework on Human3.6M, the most popular 3D human pose benchmark dataset. Our approach achieves the state-of-the-art performance, which proves that with right selections of contextual information, a simple regression model can be very powerful in estimating 3D poses.
{"title":"Improving 3D Human Pose Estimation Via 3D Part Affinity Fields","authors":"Ding Liu, Zixu Zhao, Xinchao Wang, Yuxiao Hu, Lei Zhang, Thomas Huang","doi":"10.1109/WACV.2019.00112","DOIUrl":"https://doi.org/10.1109/WACV.2019.00112","url":null,"abstract":"3D human pose estimation from monocular images has become a heated area in computer vision recently. For years, most deep neural network based practices have adopted either an end-to-end approach, or a two-stage approach. An end-to-end network typically estimates 3D human poses directly from 2D input images, but it suffers from the shortage of 3D human pose data. It is also obscure to know if the inaccuracy stems from limited visual under-standing or 2D-to-3D mapping. Whereas a two-stage directly lifts those 2D keypoint outputs to the 3D space, after utilizing an existing network for 2D keypoint detections. However, they tend to ignore some useful contextual hints from the 2D raw image pixels. In this paper, we introduce a two-stage architecture that can eliminate the main disadvantages of both these approaches. During the first stage we use an existing state-of-the-art detector to estimate 2D poses. To add more con-textual information to help lifting 2D poses to 3D poses, we propose 3D Part Affinity Fields (3D-PAFs). We use 3D-PAFs to infer 3D limb vectors, and combine them with 2D poses to regress the 3D coordinates. We trained and tested our proposed framework on Human3.6M, the most popular 3D human pose benchmark dataset. Our approach achieves the state-of-the-art performance, which proves that with right selections of contextual information, a simple regression model can be very powerful in estimating 3D poses.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132453339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Because the magnitude of inner products with its basis functions are invariant to rotation and scale change, the Fourier-Mellin transform has long been used as a component in Euclidean invariant 2D shape recognition systems. Yet Fourier-Mellin transform magnitudes are only invariant to rotation and scale changes about a known center point, and full Euclidean invariant shape recognition is not possible except when this center point can be consistently and accurately identified. In this paper, we describe a system where a Fourier-Mellin transform is computed at every point in the image. The spatial support of the Fourier-Mellin basis functions is made local by multiplying them with a polynomial envelope. Significantly, the magnitudes of convolutions with these complex filters at isolated points are not (by themselves) used as features for Euclidean invariant shape recognition because reliable discrimination would require filters with spatial support large enough to fully encompass the shapes. Instead, we rely on the fact that normalized histograms of magnitudes are fully Euclidean invariant. We demonstrate a system based on the VLAD machine learning method that performs Euclidean invariant recognition of 2D shapes and requires an order of magnitude less training data than comparable methods based on convolutional neural networks.
{"title":"Euclidean Invariant Recognition of 2D Shapes Using Histograms of Magnitudes of Local Fourier-Mellin Descriptors","authors":"Xinhua Zhang, L. Williams","doi":"10.1109/WACV.2019.00038","DOIUrl":"https://doi.org/10.1109/WACV.2019.00038","url":null,"abstract":"Because the magnitude of inner products with its basis functions are invariant to rotation and scale change, the Fourier-Mellin transform has long been used as a component in Euclidean invariant 2D shape recognition systems. Yet Fourier-Mellin transform magnitudes are only invariant to rotation and scale changes about a known center point, and full Euclidean invariant shape recognition is not possible except when this center point can be consistently and accurately identified. In this paper, we describe a system where a Fourier-Mellin transform is computed at every point in the image. The spatial support of the Fourier-Mellin basis functions is made local by multiplying them with a polynomial envelope. Significantly, the magnitudes of convolutions with these complex filters at isolated points are not (by themselves) used as features for Euclidean invariant shape recognition because reliable discrimination would require filters with spatial support large enough to fully encompass the shapes. Instead, we rely on the fact that normalized histograms of magnitudes are fully Euclidean invariant. We demonstrate a system based on the VLAD machine learning method that performs Euclidean invariant recognition of 2D shapes and requires an order of magnitude less training data than comparable methods based on convolutional neural networks.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128886193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The goal of this work is to seek principles of designing a deep neural network for 3D face reconstruction from a single image. To make the evaluation simple, we generated a synthetic dataset and used it for evaluation. We conducted extensive experiments using an end-to-end face reconstruction algorithm using E2FAR and its variations, and analyzed the reason why it can be successfully applied for 3D face reconstruction. From the comparative studies, we conclude that feature aggregation from different layers is a key point to training better neural networks for 3D face reconstruction. Based on these observations, a face reconstruction feature aggregation network (FR-FAN) is proposed, which obtains significant improvements compared with baselines on the synthetic validation set. We evaluate our model on existing popular indoor and in-the-wild 2D-3D datasets. Extensive experiments demonstrate that FR-FAN performs 16.50% and 9.54% better than E2FAR on BU-3DFE and JNU-3D, respectively. Finally, the sensitivity analysis we performed on controlled datasets demonstrates that our designed network is robust to large variations of pose, illumination, and expressions.
{"title":"On the Importance of Feature Aggregation for Face Reconstruction","authors":"Xiang Xu, Ha A. Le, I. Kakadiaris","doi":"10.1109/WACV.2019.00103","DOIUrl":"https://doi.org/10.1109/WACV.2019.00103","url":null,"abstract":"The goal of this work is to seek principles of designing a deep neural network for 3D face reconstruction from a single image. To make the evaluation simple, we generated a synthetic dataset and used it for evaluation. We conducted extensive experiments using an end-to-end face reconstruction algorithm using E2FAR and its variations, and analyzed the reason why it can be successfully applied for 3D face reconstruction. From the comparative studies, we conclude that feature aggregation from different layers is a key point to training better neural networks for 3D face reconstruction. Based on these observations, a face reconstruction feature aggregation network (FR-FAN) is proposed, which obtains significant improvements compared with baselines on the synthetic validation set. We evaluate our model on existing popular indoor and in-the-wild 2D-3D datasets. Extensive experiments demonstrate that FR-FAN performs 16.50% and 9.54% better than E2FAR on BU-3DFE and JNU-3D, respectively. Finally, the sensitivity analysis we performed on controlled datasets demonstrates that our designed network is robust to large variations of pose, illumination, and expressions.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"30 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120820958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}