Zheng Zhu, Guan Huang, Wei Zou, Dalong Du, Chang Huang
Convolutional neural networks (CNN) based tracking approaches have shown favorable performance in recent benchmarks. Nonetheless, the chosen CNN features are always pre-trained in different task and individual components in tracking systems are learned separately, thus the achieved tracking performance may be suboptimal. Besides, most of these trackers are not designed towards realtime applications because of their time-consuming feature extraction and complex optimization details. In this paper, we propose an end-to-end framework to learn the convolutional features and perform the tracking process simultaneously, namely, a unified convolutional tracker (UCT). Specifically, The UCT treats feature extractor and tracking process (ridge regression) both as convolution operation and trains them jointly, enabling learned CNN features are tightly coupled to tracking process. In online tracking, an efficient updating method is proposed by introducing peak-versus-noise ratio (PNR) criterion, and scale changes are handled efficiently by incorporating a scale branch into network. The proposed approach results in superior tracking performance, while maintaining real-time speed. The standard UCT and UCT-Lite can track generic objects at 41 FPS and 154 FPS without further optimization, respectively. Experiments are performed on four challenging benchmark tracking datasets: OTB2013, OTB2015, VOT2014 and VOT2015, and our method achieves state-of-the-art results on these benchmarks compared with other real-time trackers.
{"title":"UCT: Learning Unified Convolutional Networks for Real-Time Visual Tracking","authors":"Zheng Zhu, Guan Huang, Wei Zou, Dalong Du, Chang Huang","doi":"10.1109/ICCVW.2017.231","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.231","url":null,"abstract":"Convolutional neural networks (CNN) based tracking approaches have shown favorable performance in recent benchmarks. Nonetheless, the chosen CNN features are always pre-trained in different task and individual components in tracking systems are learned separately, thus the achieved tracking performance may be suboptimal. Besides, most of these trackers are not designed towards realtime applications because of their time-consuming feature extraction and complex optimization details. In this paper, we propose an end-to-end framework to learn the convolutional features and perform the tracking process simultaneously, namely, a unified convolutional tracker (UCT). Specifically, The UCT treats feature extractor and tracking process (ridge regression) both as convolution operation and trains them jointly, enabling learned CNN features are tightly coupled to tracking process. In online tracking, an efficient updating method is proposed by introducing peak-versus-noise ratio (PNR) criterion, and scale changes are handled efficiently by incorporating a scale branch into network. The proposed approach results in superior tracking performance, while maintaining real-time speed. The standard UCT and UCT-Lite can track generic objects at 41 FPS and 154 FPS without further optimization, respectively. Experiments are performed on four challenging benchmark tracking datasets: OTB2013, OTB2015, VOT2014 and VOT2015, and our method achieves state-of-the-art results on these benchmarks compared with other real-time trackers.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132276624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As a particularly interesting application in the realm of cultural heritage on the one hand, and a technically challenging problem, computer vision based analysis of Roman Imperial coins has been attracting an increasing amount of research. In this paper we make several important contributions. Firstly, we address a key limitation of existing work which is largely characterized by the application of generic object recognition techniques and the lack of use of domain knowledge. In contrast, our work approaches coin recognition in much the same way as a human expert would: by identifying the emperor universally shown on the obverse. To this end we develop a deep convolutional network, carefully crafted for what is effectively a specific instance of profile face recognition. No less importantly, we also address a major methodological flaw of previous research which is, as we explain in detail, insufficiently systematic and rigorous, and mired with confounding factors. Lastly, we introduce three carefully collected and annotated data sets, and using these demonstrate the effectiveness of the proposed approach which is shown to exceed the performance of the state of the art by approximately an order of magnitude.
{"title":"Ancient Roman Coin Recognition in the Wild Using Deep Learning Based Recognition of Artistically Depicted Face Profiles","authors":"Imanol Schlag, Ognjen Arandjelovic","doi":"10.1109/ICCVW.2017.342","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.342","url":null,"abstract":"As a particularly interesting application in the realm of cultural heritage on the one hand, and a technically challenging problem, computer vision based analysis of Roman Imperial coins has been attracting an increasing amount of research. In this paper we make several important contributions. Firstly, we address a key limitation of existing work which is largely characterized by the application of generic object recognition techniques and the lack of use of domain knowledge. In contrast, our work approaches coin recognition in much the same way as a human expert would: by identifying the emperor universally shown on the obverse. To this end we develop a deep convolutional network, carefully crafted for what is effectively a specific instance of profile face recognition. No less importantly, we also address a major methodological flaw of previous research which is, as we explain in detail, insufficiently systematic and rigorous, and mired with confounding factors. Lastly, we introduce three carefully collected and annotated data sets, and using these demonstrate the effectiveness of the proposed approach which is shown to exceed the performance of the state of the art by approximately an order of magnitude.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"248 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128136182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a novel particle filter based probabilistic forced alignment approach for training spatiotemporal deep neural networks using weak border level annotations. The proposed method jointly learns to localize and recognize isolated instances in continuous streams. This is done by drawing training volumes from a prior distribution of likely regions and training a discriminative 3D-CNN from this data. The classifier is then used to calculate the posterior distribution by scoring the training examples and using this as the prior for the next sampling stage. We apply the proposed approach to the challenging task of large-scale user-independent continuous gesture recognition. We evaluate the performance on the popular ChaLearn 2016 Continuous Gesture Recognition (ConGD) dataset. Our method surpasses state-of-the-art results by obtaining 0.3646 and 0.3744 Mean Jaccard Index Score on the validation and test sets of ConGD, respectively. Furthermore, we participated in the ChaLearn 2017 Continuous Gesture Recognition Challenge and was ranked 3rd. It should be noted that our method is learner independent, it can be easily combined with other approaches.
{"title":"Particle Filter Based Probabilistic Forced Alignment for Continuous Gesture Recognition","authors":"Necati Cihan Camgöz, Simon Hadfield, R. Bowden","doi":"10.1109/ICCVW.2017.364","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.364","url":null,"abstract":"In this paper, we propose a novel particle filter based probabilistic forced alignment approach for training spatiotemporal deep neural networks using weak border level annotations. The proposed method jointly learns to localize and recognize isolated instances in continuous streams. This is done by drawing training volumes from a prior distribution of likely regions and training a discriminative 3D-CNN from this data. The classifier is then used to calculate the posterior distribution by scoring the training examples and using this as the prior for the next sampling stage. We apply the proposed approach to the challenging task of large-scale user-independent continuous gesture recognition. We evaluate the performance on the popular ChaLearn 2016 Continuous Gesture Recognition (ConGD) dataset. Our method surpasses state-of-the-art results by obtaining 0.3646 and 0.3744 Mean Jaccard Index Score on the validation and test sets of ConGD, respectively. Furthermore, we participated in the ChaLearn 2017 Continuous Gesture Recognition Challenge and was ranked 3rd. It should be noted that our method is learner independent, it can be easily combined with other approaches.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125991351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
If a CAD model of a rigid object is available, the location of any point on an object can be derived from the measured 6DOF pose of the object. However, the uncertainty of the measured pose propagates to the uncertainty of the point in an anisotropic way. We investigate this propagation for a class of systems that determine an object pose by using point-based rigid body registration. For such systems, the uncertainty in the location of the points used for registration propagates to the pose uncertainty. We find that for different poses of the object, the direction corresponding to the smallest propagated uncertainty remains relatively unchanged in the object's local frame, regardless of object pose. We show that this direction may be closely approximated by the moment of inertia axis which is based on the configuration of the fiducials. We use existing theory of rigid-body registration to explain the experimental results, discuss the limitations of the theory and practical implications of our findings.
{"title":"Propagation of Orientation Uncertainty of 3D Rigid Object to Its Points","authors":"M. Franaszek, G. Cheok","doi":"10.1109/ICCVW.2017.255","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.255","url":null,"abstract":"If a CAD model of a rigid object is available, the location of any point on an object can be derived from the measured 6DOF pose of the object. However, the uncertainty of the measured pose propagates to the uncertainty of the point in an anisotropic way. We investigate this propagation for a class of systems that determine an object pose by using point-based rigid body registration. For such systems, the uncertainty in the location of the points used for registration propagates to the pose uncertainty. We find that for different poses of the object, the direction corresponding to the smallest propagated uncertainty remains relatively unchanged in the object's local frame, regardless of object pose. We show that this direction may be closely approximated by the moment of inertia axis which is based on the configuration of the fiducials. We use existing theory of rigid-body registration to explain the experimental results, discuss the limitations of the theory and practical implications of our findings.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131873501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonas Vlasselaer, C. Crispim, F. Brémond, Anton Dries
This paper proposes BEHAVE, a person-centered pipeline for probabilistic event recognition. The proposed pipeline firstly detects the set of people in a video frame, then it searches for correspondences between people in the current and previous frames (i.e., people tracking). Finally, event recognition is carried for each person using probabilistic logic models (PLMs, ProbLog2 language). PLMs represent interactions among people, home appliances and semantic regions. They also enable one to assess the probability of an event given noisy observations of the real world. BEHAVE was evaluated on the task of online (non-clipped videos) and open-set event recognition (e.g., target events plus none class) on video recordings of seniors carrying out daily tasks. Results have shown that BEHAVE improves event recognition accuracy by handling missed and partially satisfied logic models. Future work will investigate how to extend PLMs to represent temporal relations among events.
{"title":"BEHAVE — Behavioral Analysis of Visual Events for Assisted Living Scenarios","authors":"Jonas Vlasselaer, C. Crispim, F. Brémond, Anton Dries","doi":"10.1109/ICCVW.2017.160","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.160","url":null,"abstract":"This paper proposes BEHAVE, a person-centered pipeline for probabilistic event recognition. The proposed pipeline firstly detects the set of people in a video frame, then it searches for correspondences between people in the current and previous frames (i.e., people tracking). Finally, event recognition is carried for each person using probabilistic logic models (PLMs, ProbLog2 language). PLMs represent interactions among people, home appliances and semantic regions. They also enable one to assess the probability of an event given noisy observations of the real world. BEHAVE was evaluated on the task of online (non-clipped videos) and open-set event recognition (e.g., target events plus none class) on video recordings of seniors carrying out daily tasks. Results have shown that BEHAVE improves event recognition accuracy by handling missed and partially satisfied logic models. Future work will investigate how to extend PLMs to represent temporal relations among events.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122806533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Local and global approaches can be identified as the two main classes of optical flow estimation methods. In this paper, we propose a framework to combine the advantages of these two principles, namely robustness to noise of the local approach and discontinuity preservation of the global approach. This is particularly crucial in biological imaging, where the noise produced by microscopes is one of the main issues for optical flow estimation. The idea is to adapt spatially the local support of the local parametric constraint in the combined local-global model [6]. To this end, we jointly estimate the motion field and the parameters of the spatial support. We apply our approach to the case of Gaussian filtering, and we derive efficient minimization schemes for usual data terms. The estimation of a spatially varying standard deviation map prevents from the smoothing of motion discontinuities, while ensuring robustness to noise. We validate our method for a standard model and demonstrate how a baseline approach with pixel-wise data term can be improved when integrated in our framework. The method is evaluated on the Middlebury benchmark with ground truth and on real fluorescence microscopy data.
{"title":"Spatially-Variant Kernel for Optical Flow Under Low Signal-to-Noise Ratios Application to Microscopy","authors":"Denis Fortun, N. Debroux, C. Kervrann","doi":"10.1109/ICCVW.2017.12","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.12","url":null,"abstract":"Local and global approaches can be identified as the two main classes of optical flow estimation methods. In this paper, we propose a framework to combine the advantages of these two principles, namely robustness to noise of the local approach and discontinuity preservation of the global approach. This is particularly crucial in biological imaging, where the noise produced by microscopes is one of the main issues for optical flow estimation. The idea is to adapt spatially the local support of the local parametric constraint in the combined local-global model [6]. To this end, we jointly estimate the motion field and the parameters of the spatial support. We apply our approach to the case of Gaussian filtering, and we derive efficient minimization schemes for usual data terms. The estimation of a spatially varying standard deviation map prevents from the smoothing of motion discontinuities, while ensuring robustness to noise. We validate our method for a standard model and demonstrate how a baseline approach with pixel-wise data term can be improved when integrated in our framework. The method is evaluated on the Middlebury benchmark with ground truth and on real fluorescence microscopy data.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123657543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a new spatio-temporal attention based mechanism for human action recognition able to automatically attend to most important human hands and detect the most discriminative moments in an action. Attention is handled in a recurrent manner employing Recurrent Neural Network (RNN) and is fully-differentiable. In contrast to standard soft-attention based mechanisms, our approach does not use the hidden RNN state as input to the attention model. Instead, attention distributions are drawn using external information: human articulated pose. We performed an extensive ablation study to show the strengths of this approach and we particularly studied the conditioning aspect of the attention mechanism. We evaluate the method on the largest currently available human action recognition dataset, NTU-RGB+D, and report state-of-the-art results. Another advantage of our model are certains aspects of explanability, as the spatial and temporal attention distributions at test time allow to study and verify on which parts of the input data the method focuses.
{"title":"Human Action Recognition: Pose-Based Attention Draws Focus to Hands","authors":"Fabien Baradel, Christian Wolf, J. Mille","doi":"10.1109/ICCVW.2017.77","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.77","url":null,"abstract":"We propose a new spatio-temporal attention based mechanism for human action recognition able to automatically attend to most important human hands and detect the most discriminative moments in an action. Attention is handled in a recurrent manner employing Recurrent Neural Network (RNN) and is fully-differentiable. In contrast to standard soft-attention based mechanisms, our approach does not use the hidden RNN state as input to the attention model. Instead, attention distributions are drawn using external information: human articulated pose. We performed an extensive ablation study to show the strengths of this approach and we particularly studied the conditioning aspect of the attention mechanism. We evaluate the method on the largest currently available human action recognition dataset, NTU-RGB+D, and report state-of-the-art results. Another advantage of our model are certains aspects of explanability, as the spatial and temporal attention distributions at test time allow to study and verify on which parts of the input data the method focuses.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129745182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yutong Ban, Laurent Girin, Xavier Alameda-Pineda, R. Horaud
Multi-speaker tracking is a central problem in human-robot interaction. In this context, exploiting auditory and visual information is gratifying and challenging at the same time. Gratifying because the complementary nature of auditory and visual information allows us to be more robust against noise and outliers than unimodal approaches. Challenging because how to properly fuse auditory and visual information for multi-speaker tracking is far from being a solved problem. In this paper we propose a probabilistic generative model that tracks multiple speakers by jointly exploiting auditory and visual features in their own representation spaces. Importantly, the method is robust to missing data and is therefore able to track even when observations from one of the modalities are absent. Quantitative and qualitative results on the AVDIAR dataset are reported.
{"title":"Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking","authors":"Yutong Ban, Laurent Girin, Xavier Alameda-Pineda, R. Horaud","doi":"10.1109/ICCVW.2017.60","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.60","url":null,"abstract":"Multi-speaker tracking is a central problem in human-robot interaction. In this context, exploiting auditory and visual information is gratifying and challenging at the same time. Gratifying because the complementary nature of auditory and visual information allows us to be more robust against noise and outliers than unimodal approaches. Challenging because how to properly fuse auditory and visual information for multi-speaker tracking is far from being a solved problem. In this paper we propose a probabilistic generative model that tracks multiple speakers by jointly exploiting auditory and visual features in their own representation spaces. Importantly, the method is robust to missing data and is therefore able to track even when observations from one of the modalities are absent. Quantitative and qualitative results on the AVDIAR dataset are reported.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127732121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Trong Phuc Truong, M. Yamaguchi, Shohei Mori, Vincent Nozick, H. Saito
Thermal imaging has become a valuable tool in various fields for remote sensing and can provide relevant information to perform object recognition or classification. In this paper, we present an automated method to obtain a 3D model fusing data from a visible and a thermal camera. The RGB and thermal point clouds are generated independently by structure from motion. The registration process includes a normalization of the point cloud scale, a global registration based on calibration data and the output of the structure from motion, and a fine registration employing a variant of the Iterative Closest Point optimization. Experimental results demonstrate the accuracy and robustness of the overall process.
{"title":"Registration of RGB and Thermal Point Clouds Generated by Structure From Motion","authors":"Trong Phuc Truong, M. Yamaguchi, Shohei Mori, Vincent Nozick, H. Saito","doi":"10.1109/ICCVW.2017.57","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.57","url":null,"abstract":"Thermal imaging has become a valuable tool in various fields for remote sensing and can provide relevant information to perform object recognition or classification. In this paper, we present an automated method to obtain a 3D model fusing data from a visible and a thermal camera. The RGB and thermal point clouds are generated independently by structure from motion. The registration process includes a normalization of the point cloud scale, a global registration based on calibration data and the output of the structure from motion, and a fine registration employing a variant of the Iterative Closest Point optimization. Experimental results demonstrate the accuracy and robustness of the overall process.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123671369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a Handcrafted Normalized-Convolution Network (NmzNet) for efficient texture classification. NmzNet is implemented by a three-layer normalized convolution network, which computes successive normalized convolution with a predefined filter bank (Gabor filter bank) and modulus non-linearities. Coefficients from different layers are aggregated by Fisher Vector aggregation to form the final discriminative features. The results of experimental evaluation on three texture datasets UIUC, KTH-TIPS-2a, and KTH-TIPS-2b indicate that our proposed approach achieves the good classification rate compared with other handcrafted methods. The results additionally indicate that only a marginal difference exists between the best classification rate of recent frontiers CNN and that of the proposed method on the experimented datasets.
{"title":"A Handcrafted Normalized-Convolution Network for Texture Classification","authors":"Ngoc-Son Vu, Vu-Lam Nguyen, P. Gosselin","doi":"10.1109/ICCVW.2017.149","DOIUrl":"https://doi.org/10.1109/ICCVW.2017.149","url":null,"abstract":"In this paper, we propose a Handcrafted Normalized-Convolution Network (NmzNet) for efficient texture classification. NmzNet is implemented by a three-layer normalized convolution network, which computes successive normalized convolution with a predefined filter bank (Gabor filter bank) and modulus non-linearities. Coefficients from different layers are aggregated by Fisher Vector aggregation to form the final discriminative features. The results of experimental evaluation on three texture datasets UIUC, KTH-TIPS-2a, and KTH-TIPS-2b indicate that our proposed approach achieves the good classification rate compared with other handcrafted methods. The results additionally indicate that only a marginal difference exists between the best classification rate of recent frontiers CNN and that of the proposed method on the experimented datasets.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128927015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}