Pedestrian path forecasting is crucial in applications such as smart video surveillance. It is a challenging task because of the complex crowd movement patterns in the scenes. Most of existing state-of-the-art LSTM based prediction methods require rich context like labelled static obstacles, labelled entrance/exit regions and even the background scene. Furthermore, incorporating contextual information into trajectory prediction increases the computational overhead and decreases the generalization of the prediction models across different scenes. In this paper, we propose a joint Location-Velocity Attention LSTM based method to predict trajectories. Specifically, a module is designed to tweak the LSTM network and an attention mechanism is trained to learn to optimally combine the location and the velocity information of pedestrians in the prediction process. We have evaluated our approach against other baselines and state-of-the-art methods on several publicly available datasets. The results show that it not only outperforms other prediction methods but it also has a good generalization ability.
{"title":"Location-Velocity Attention for Pedestrian Trajectory Prediction","authors":"Hao Xue, D. Huynh, Mark Reynolds","doi":"10.1109/WACV.2019.00221","DOIUrl":"https://doi.org/10.1109/WACV.2019.00221","url":null,"abstract":"Pedestrian path forecasting is crucial in applications such as smart video surveillance. It is a challenging task because of the complex crowd movement patterns in the scenes. Most of existing state-of-the-art LSTM based prediction methods require rich context like labelled static obstacles, labelled entrance/exit regions and even the background scene. Furthermore, incorporating contextual information into trajectory prediction increases the computational overhead and decreases the generalization of the prediction models across different scenes. In this paper, we propose a joint Location-Velocity Attention LSTM based method to predict trajectories. Specifically, a module is designed to tweak the LSTM network and an attention mechanism is trained to learn to optimally combine the location and the velocity information of pedestrians in the prediction process. We have evaluated our approach against other baselines and state-of-the-art methods on several publicly available datasets. The results show that it not only outperforms other prediction methods but it also has a good generalization ability.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121965156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Huynh, J. Pillai, Eunyoung Kim, Kristen Aw, Jack Sim, Ken Goldman, Rui Min
While deep learning has achieved great success in building vision applications for mainstream users, there is relatively less work for the blind and visually impaired to have a personal, on-device visual assistant for their daily life. Unlike mainstream applications, vision system for the blind must be robust, reliable and safe-to-use. In this paper, we propose a fine-grained currency recognizer based on CONGAS, which significantly surpasses other popular local features by a large margin. In addition, we introduce an effective and light-weight coarse classifier that gates the fine-grained recognizer on resource-constrained mobile devices. The coarse-to-fine approach is orchestrated to provide an extensible mobile-vision architecture, that demonstrates how the benefits of coordinating deep learning and local feature based methods can help in resolving a challenging problem for the blind and visually impaired. The proposed system runs in real-time with ~150ms latency on a Pixel device, and achieved 98% precision and 97% recall on a challenging evaluation set.
{"title":"Bringing Vision to the Blind: From Coarse to Fine, One Dollar at a Time","authors":"T. Huynh, J. Pillai, Eunyoung Kim, Kristen Aw, Jack Sim, Ken Goldman, Rui Min","doi":"10.1109/WACV.2019.00057","DOIUrl":"https://doi.org/10.1109/WACV.2019.00057","url":null,"abstract":"While deep learning has achieved great success in building vision applications for mainstream users, there is relatively less work for the blind and visually impaired to have a personal, on-device visual assistant for their daily life. Unlike mainstream applications, vision system for the blind must be robust, reliable and safe-to-use. In this paper, we propose a fine-grained currency recognizer based on CONGAS, which significantly surpasses other popular local features by a large margin. In addition, we introduce an effective and light-weight coarse classifier that gates the fine-grained recognizer on resource-constrained mobile devices. The coarse-to-fine approach is orchestrated to provide an extensible mobile-vision architecture, that demonstrates how the benefits of coordinating deep learning and local feature based methods can help in resolving a challenging problem for the blind and visually impaired. The proposed system runs in real-time with ~150ms latency on a Pixel device, and achieved 98% precision and 97% recall on a challenging evaluation set.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129139329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper compares the performance of three state-of-the-art visual-inertial simultaneous localization and mapping (SLAM) methods in the context of assisted wayfinding of the visually impaired. Specifically, we analyze their strengths and weaknesses for assisted wayfinding of a robotic navigation aid (RNA). Based on the analysis, we select the best visual-inertial SLAM method for the RNA application and extend the method by integrating with it a method capable of detecting loops caused by the RNA's unique motion pattern. By incorporating the loop closures in the graph and optimization process, the extended visual-inertial SLAM method reduces the pose estimation error. The experimental results with our own datasets and the TUM VI benchmark datasets confirm the advantage of the selected method over the other two and validate the efficacy of the extended method.
{"title":"A Comparative Analysis of Visual-Inertial SLAM for Assisted Wayfinding of the Visually Impaired","authors":"He Zhang, Lingqiu Jin, H. Zhang, C. Ye","doi":"10.1109/WACV.2019.00028","DOIUrl":"https://doi.org/10.1109/WACV.2019.00028","url":null,"abstract":"This paper compares the performance of three state-of-the-art visual-inertial simultaneous localization and mapping (SLAM) methods in the context of assisted wayfinding of the visually impaired. Specifically, we analyze their strengths and weaknesses for assisted wayfinding of a robotic navigation aid (RNA). Based on the analysis, we select the best visual-inertial SLAM method for the RNA application and extend the method by integrating with it a method capable of detecting loops caused by the RNA's unique motion pattern. By incorporating the loop closures in the graph and optimization process, the extended visual-inertial SLAM method reduces the pose estimation error. The experimental results with our own datasets and the TUM VI benchmark datasets confirm the advantage of the selected method over the other two and validate the efficacy of the extended method.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"2007 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127486251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel P. Benalcazar, C. Pérez, Diego Bastias, K. Bowyer
In most iris recognition systems the texture of the iris image is either the result of the interaction between the iris and Near Infrared (NIR) light, or between the iris pigmentation and visible-light. The iris, however, is a three-dimensional organ, and the information contained on its relief is not being exploited completely. In this article, we present an image acquisition method that enhances viewing the structural information of the iris. Our method consists of adding lateral illumination to the visible light frontal illumination to capture the structural information of the muscle fibers of the iris on the resulting image. These resulting images contain highly textured patterns of the iris. To test our method, we collected a database of 1,920 iris images using both a conventional NIR device, and a custom-made device that illuminates the eye in lateral and frontal angles with visible-light (LFVL). Then, we compared the iris recognition performance of both devices by means of a Hamming distance distribution analysis among the corresponding binary iris codes. The ROC curves show that our method produced more separable distributions than those of the NIR device, and much better distribution than using frontal visible-light alone. Eliminating errors produced by images captured with different iris dilation (13 cases), the NIR produced inter-class and intra-class distributions that are completely separable as in the case of LFVL. This acquisition method could also be useful for 3D iris scanning.
{"title":"Iris Recognition: Comparing Visible-Light Lateral and Frontal Illumination to NIR Frontal Illumination","authors":"Daniel P. Benalcazar, C. Pérez, Diego Bastias, K. Bowyer","doi":"10.1109/WACV.2019.00097","DOIUrl":"https://doi.org/10.1109/WACV.2019.00097","url":null,"abstract":"In most iris recognition systems the texture of the iris image is either the result of the interaction between the iris and Near Infrared (NIR) light, or between the iris pigmentation and visible-light. The iris, however, is a three-dimensional organ, and the information contained on its relief is not being exploited completely. In this article, we present an image acquisition method that enhances viewing the structural information of the iris. Our method consists of adding lateral illumination to the visible light frontal illumination to capture the structural information of the muscle fibers of the iris on the resulting image. These resulting images contain highly textured patterns of the iris. To test our method, we collected a database of 1,920 iris images using both a conventional NIR device, and a custom-made device that illuminates the eye in lateral and frontal angles with visible-light (LFVL). Then, we compared the iris recognition performance of both devices by means of a Hamming distance distribution analysis among the corresponding binary iris codes. The ROC curves show that our method produced more separable distributions than those of the NIR device, and much better distribution than using frontal visible-light alone. Eliminating errors produced by images captured with different iris dilation (13 cases), the NIR produced inter-class and intra-class distributions that are completely separable as in the case of LFVL. This acquisition method could also be useful for 3D iris scanning.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129237288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Mahfujur Rahman, C. Fookes, Mahsa Baktash, S. Sridharan
Domain adaption (DA) and domain generalization (DG) are two closely related methods which are both concerned with the task of assigning labels to an unlabeled data set. The only dissimilarity between these approaches is that DA can access the target data during the training phase, while the target data is totally unseen during the training phase in DG. The task of DG is challenging as we have no earlier knowledge of the target samples. If DA methods are applied directly to DG by a simple exclusion of the target data from training, poor performance will result for a given task. In this paper, we tackle the domain generalization challenge in two ways. In our first approach, we propose a novel deep domain generalization architecture utilizing synthetic data generated by a Generative Adversarial Network (GAN). The discrepancy between the generated images and synthetic images is minimized using existing domain discrepancy metrics such as maximum mean discrepancy or correlation alignment. In our second approach, we introduce a protocol for applying DA methods to a DG scenario by excluding the target data from the training phase, splitting the source data to training and validation parts, and treating the validation data as target data for DA. We conduct extensive experiments on four cross-domain benchmark datasets. Experimental results signify our proposed model outperforms the current state-of-the-art methods for DG.
{"title":"Multi-Component Image Translation for Deep Domain Generalization","authors":"Mohammad Mahfujur Rahman, C. Fookes, Mahsa Baktash, S. Sridharan","doi":"10.1109/WACV.2019.00067","DOIUrl":"https://doi.org/10.1109/WACV.2019.00067","url":null,"abstract":"Domain adaption (DA) and domain generalization (DG) are two closely related methods which are both concerned with the task of assigning labels to an unlabeled data set. The only dissimilarity between these approaches is that DA can access the target data during the training phase, while the target data is totally unseen during the training phase in DG. The task of DG is challenging as we have no earlier knowledge of the target samples. If DA methods are applied directly to DG by a simple exclusion of the target data from training, poor performance will result for a given task. In this paper, we tackle the domain generalization challenge in two ways. In our first approach, we propose a novel deep domain generalization architecture utilizing synthetic data generated by a Generative Adversarial Network (GAN). The discrepancy between the generated images and synthetic images is minimized using existing domain discrepancy metrics such as maximum mean discrepancy or correlation alignment. In our second approach, we introduce a protocol for applying DA methods to a DG scenario by excluding the target data from the training phase, splitting the source data to training and validation parts, and treating the validation data as target data for DA. We conduct extensive experiments on four cross-domain benchmark datasets. Experimental results signify our proposed model outperforms the current state-of-the-art methods for DG.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116613401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Li, Shuai Zhang, Bolan Jiang, Y. Qi, M. Chuah, N. Bi
Deploying a deep learning model on mobile/IoT devices is a challenging task. The difficulty lies in the trade-off between computation speed and accuracy. A complex deep learning model with high accuracy runs slowly on resource-limited devices, while a light-weight model that runs much faster loses accuracy. In this paper, we propose a novel decomposition method, namely DAC, that is capable of factorizing an ordinary convolutional layer into two layers with much fewer parameters. DAC computes the corresponding weights for the newly generated layers directly from the weights of the original convolutional layer. Thus, no training (or fine-tuning) or any data is needed. The experimental results show that DAC reduces a large number of floating-point operations (FLOPs) while maintaining high accuracy of a pre-trained model. If 2% accuracy drop is acceptable, DAC saves 53% FLOPs of VGG16 image classification model on ImageNet dataset, 29% FLOPS of SSD300 object detection model on PASCAL VOC2007 dataset, and 46% FLOPS of a multi-person pose estimation model on Microsoft COCO dataset. Compared to other existing decomposition methods, DAC achieves better performance.
{"title":"DAC: Data-Free Automatic Acceleration of Convolutional Networks","authors":"Xin Li, Shuai Zhang, Bolan Jiang, Y. Qi, M. Chuah, N. Bi","doi":"10.1109/WACV.2019.00175","DOIUrl":"https://doi.org/10.1109/WACV.2019.00175","url":null,"abstract":"Deploying a deep learning model on mobile/IoT devices is a challenging task. The difficulty lies in the trade-off between computation speed and accuracy. A complex deep learning model with high accuracy runs slowly on resource-limited devices, while a light-weight model that runs much faster loses accuracy. In this paper, we propose a novel decomposition method, namely DAC, that is capable of factorizing an ordinary convolutional layer into two layers with much fewer parameters. DAC computes the corresponding weights for the newly generated layers directly from the weights of the original convolutional layer. Thus, no training (or fine-tuning) or any data is needed. The experimental results show that DAC reduces a large number of floating-point operations (FLOPs) while maintaining high accuracy of a pre-trained model. If 2% accuracy drop is acceptable, DAC saves 53% FLOPs of VGG16 image classification model on ImageNet dataset, 29% FLOPS of SSD300 object detection model on PASCAL VOC2007 dataset, and 46% FLOPS of a multi-person pose estimation model on Microsoft COCO dataset. Compared to other existing decomposition methods, DAC achieves better performance.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"68 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128725176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most geometric approaches to monocular Visual Odometry (VO) provide robust pose estimates, but sparse or semi-dense depth estimates. Off late, deep methods have shown good performance in generating dense depths and VO from monocular images by optimizing the photometric consistency between images. Despite being intuitive, a naive photometric loss does not ensure proper pixel correspondences between two views, which is the key factor for accurate depth and relative pose estimations. It is a well known fact that simply minimizing such an error is prone to failures. We propose a method using Epipolar constraints to make the learning more geometrically sound. We use the Essential matrix, obtained using Nistér's Five Point Algorithm, for enforcing meaningful geometric constraints on the loss, rather than using it as labels for training. Our method, although simplistic but more geometrically meaningful, uses lesser number of parameters to give a comparable performance to state-of-the-art methods which use complex losses and large networks showing the effectiveness of using epipolar constraints. Such a geometrically constrained learning method performs successfully even in cases where simply minimizing the photometric error would fail.
大多数几何方法的单目视觉距离测量(VO)提供鲁棒的姿态估计,但稀疏或半密集的深度估计。近年来,深度方法通过优化图像之间的光度一致性,在单眼图像生成密集深度和VO方面表现出良好的性能。尽管是直观的,幼稚的光度损失并不能确保两个视图之间适当的像素对应,这是准确的深度和相对姿态估计的关键因素。这是一个众所周知的事实,简单地最小化这样的错误是容易失败的。我们提出了一种使用极限约束的方法,使学习在几何上更加合理。我们使用本质矩阵(Essential matrix),通过nist的五点算法(Five Point Algorithm)获得,对损失施加有意义的几何约束,而不是将其用作训练的标签。我们的方法虽然简单,但在几何上更有意义,使用较少数量的参数来提供与使用复杂损失和大型网络的最先进方法相当的性能,显示使用极外约束的有效性。这种几何约束的学习方法即使在简单地最小化光度误差失败的情况下也能成功地执行。
{"title":"SfMLearner++: Learning Monocular Depth & Ego-Motion Using Meaningful Geometric Constraints","authors":"V. Prasad, B. Bhowmick","doi":"10.1109/WACV.2019.00226","DOIUrl":"https://doi.org/10.1109/WACV.2019.00226","url":null,"abstract":"Most geometric approaches to monocular Visual Odometry (VO) provide robust pose estimates, but sparse or semi-dense depth estimates. Off late, deep methods have shown good performance in generating dense depths and VO from monocular images by optimizing the photometric consistency between images. Despite being intuitive, a naive photometric loss does not ensure proper pixel correspondences between two views, which is the key factor for accurate depth and relative pose estimations. It is a well known fact that simply minimizing such an error is prone to failures. We propose a method using Epipolar constraints to make the learning more geometrically sound. We use the Essential matrix, obtained using Nistér's Five Point Algorithm, for enforcing meaningful geometric constraints on the loss, rather than using it as labels for training. Our method, although simplistic but more geometrically meaningful, uses lesser number of parameters to give a comparable performance to state-of-the-art methods which use complex losses and large networks showing the effectiveness of using epipolar constraints. Such a geometrically constrained learning method performs successfully even in cases where simply minimizing the photometric error would fail.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123576278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yilun Chen, Praveen Palanisamy, P. Mudalige, Katharina Muelling, J. Dolan
A safe and robust on-road navigation system is a crucial component of achieving fully automated vehicles. NVIDIA recently proposed an End-to-End algorithm that can directly learn steering commands from raw pixels of a front camera by using one convolutional neural network. In this paper, we leverage auxiliary information aside from raw images and design a novel network structure, called Auxiliary Task Network (ATN), to help boost the driving performance while maintaining the advantage of minimal training data and an End-to-End training method. In this network, we introduce human prior knowledge into vehicle navigation by transferring features from image recognition tasks. Image semantic segmentation is applied as an auxiliary task for navigation. We consider temporal information by introducing an LSTM module and optical flow to the network. Finally, we combine vehicle kinematics with a sensor fusion step. We discuss the benefits of our method over state-of-the-art visual navigation methods both in the Udacity simulation environment and on the real-world Comma.ai dataset.
{"title":"Learning On-Road Visual Control for Self-Driving Vehicles With Auxiliary Tasks","authors":"Yilun Chen, Praveen Palanisamy, P. Mudalige, Katharina Muelling, J. Dolan","doi":"10.1109/WACV.2019.00041","DOIUrl":"https://doi.org/10.1109/WACV.2019.00041","url":null,"abstract":"A safe and robust on-road navigation system is a crucial component of achieving fully automated vehicles. NVIDIA recently proposed an End-to-End algorithm that can directly learn steering commands from raw pixels of a front camera by using one convolutional neural network. In this paper, we leverage auxiliary information aside from raw images and design a novel network structure, called Auxiliary Task Network (ATN), to help boost the driving performance while maintaining the advantage of minimal training data and an End-to-End training method. In this network, we introduce human prior knowledge into vehicle navigation by transferring features from image recognition tasks. Image semantic segmentation is applied as an auxiliary task for navigation. We consider temporal information by introducing an LSTM module and optical flow to the network. Finally, we combine vehicle kinematics with a sensor fusion step. We discuss the benefits of our method over state-of-the-art visual navigation methods both in the Udacity simulation environment and on the real-world Comma.ai dataset.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124941668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaolong Jiang, Peizhao Li, Xiantong Zhen, Xianbin Cao
Being able to track an anonymous object, a model-free tracker is comprehensively applicable regardless of the target type. However, designing such a generalized framework is challenged by the lack of object-oriented prior information. As one solution, a real-time model-free object tracking approach is designed in this work relying on Convolutional Neural Networks (CNNs). To overcome the object-centric information scarcity, both appearance and motion features are deeply integrated by the proposed AMNet, which is an end-to-end offline trained two-stream network. Between the two parallel streams, the ANet investigates appearance features with a multi-scale Siamese atrous CNN, enabling the tracking-by-matching strategy. The MNet achieves deep motion detection to localize anonymous moving objects by processing generic motion features. The final tracking result at each frame is generated by fusing the output response maps from both sub-networks. The proposed AMNet reports leading performance on both OTB and VOT benchmark datasets with favorable real-time processing speed.
{"title":"Model-Free Tracking With Deep Appearance and Motion Features Integration","authors":"Xiaolong Jiang, Peizhao Li, Xiantong Zhen, Xianbin Cao","doi":"10.1109/WACV.2019.00018","DOIUrl":"https://doi.org/10.1109/WACV.2019.00018","url":null,"abstract":"Being able to track an anonymous object, a model-free tracker is comprehensively applicable regardless of the target type. However, designing such a generalized framework is challenged by the lack of object-oriented prior information. As one solution, a real-time model-free object tracking approach is designed in this work relying on Convolutional Neural Networks (CNNs). To overcome the object-centric information scarcity, both appearance and motion features are deeply integrated by the proposed AMNet, which is an end-to-end offline trained two-stream network. Between the two parallel streams, the ANet investigates appearance features with a multi-scale Siamese atrous CNN, enabling the tracking-by-matching strategy. The MNet achieves deep motion detection to localize anonymous moving objects by processing generic motion features. The final tracking result at each frame is generated by fusing the output response maps from both sub-networks. The proposed AMNet reports leading performance on both OTB and VOT benchmark datasets with favorable real-time processing speed.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"389 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124801161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Can learning to measure the quality of an action help in measuring the quality of other actions? If so, can consolidated samples from multiple actions help improve the performance of current approaches? In this paper, we carry out experiments to see if knowledge transfer is possible in the action quality assessment (AQA) setting. Experiments are carried out on our newly released AQA dataset (http://rtis.oit.unlv.edu/datasets.html) consisting of 1106 action samples from seven actions with quality as measured by expert human judges. Our experimental results show that there is utility in learning a single model across multiple actions.
{"title":"Action Quality Assessment Across Multiple Actions","authors":"Paritosh Parmar, B. Morris","doi":"10.1109/WACV.2019.00161","DOIUrl":"https://doi.org/10.1109/WACV.2019.00161","url":null,"abstract":"Can learning to measure the quality of an action help in measuring the quality of other actions? If so, can consolidated samples from multiple actions help improve the performance of current approaches? In this paper, we carry out experiments to see if knowledge transfer is possible in the action quality assessment (AQA) setting. Experiments are carried out on our newly released AQA dataset (http://rtis.oit.unlv.edu/datasets.html) consisting of 1106 action samples from seven actions with quality as measured by expert human judges. Our experimental results show that there is utility in learning a single model across multiple actions.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115698192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}