Romain Belmonte, Nacim Ihaddadene, Pierre Tirilly, Ioan Marius Bilasco, C. Djeraba
Face alignment remains difficult under uncontrolled conditions due to the many variations that may considerably impact facial appearance. Recently, video-based approaches have been proposed, which take advantage of temporal coherence to improve robustness. These new approaches suffer from limited temporal connectivity. We show that early, direct pixel connectivity enables the detection of local motion patterns and the learning of a hierarchy of motion features. We integrate local motion to the two predominant models in the literature, coordinate regression networks and heatmap regression networks, and combine it with late connectivity based on recurrent neural networks. The experimental results on two datasets, 300VW and SNaP-2DFe, show that local motion improves video-based face alignment and is complementary to late temporal information. Despite the simplicity of the proposed architectures, our best model provides competitive performance with more complex models from the literature.
{"title":"Video-Based Face Alignment With Local Motion Modeling","authors":"Romain Belmonte, Nacim Ihaddadene, Pierre Tirilly, Ioan Marius Bilasco, C. Djeraba","doi":"10.1109/WACV.2019.00228","DOIUrl":"https://doi.org/10.1109/WACV.2019.00228","url":null,"abstract":"Face alignment remains difficult under uncontrolled conditions due to the many variations that may considerably impact facial appearance. Recently, video-based approaches have been proposed, which take advantage of temporal coherence to improve robustness. These new approaches suffer from limited temporal connectivity. We show that early, direct pixel connectivity enables the detection of local motion patterns and the learning of a hierarchy of motion features. We integrate local motion to the two predominant models in the literature, coordinate regression networks and heatmap regression networks, and combine it with late connectivity based on recurrent neural networks. The experimental results on two datasets, 300VW and SNaP-2DFe, show that local motion improves video-based face alignment and is complementary to late temporal information. Despite the simplicity of the proposed architectures, our best model provides competitive performance with more complex models from the literature.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124072576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep convolutional neural networks can be used to produce discriminative image level features. However, when they are used as the feature extractor in a feature encoding pipeline, there are many design choices that are need to be made. In this work, we conduct a comprehensive study on deep convolutional feature encoding, by paying a special attention on its feature extraction aspect. We mainly evaluated the choices of the encoding methods; the choices of the base DCNN models; and the choices of the data augmentation methods. We not only quantitatively confirmed some known and previously unknown good choices for deep convolutional feature encoding, but also found out that some known good choices tune out to be bad. Base on the observations in the experiments, we present a very simple deep feature encoding pipeline, and confirmed its state-of-the-art performances on multiple image recognition datasets.
{"title":"Good Choices for Deep Convolutional Feature Encoding","authors":"Yu Wang, Jien Kato","doi":"10.1109/WACV.2019.00039","DOIUrl":"https://doi.org/10.1109/WACV.2019.00039","url":null,"abstract":"Deep convolutional neural networks can be used to produce discriminative image level features. However, when they are used as the feature extractor in a feature encoding pipeline, there are many design choices that are need to be made. In this work, we conduct a comprehensive study on deep convolutional feature encoding, by paying a special attention on its feature extraction aspect. We mainly evaluated the choices of the encoding methods; the choices of the base DCNN models; and the choices of the data augmentation methods. We not only quantitatively confirmed some known and previously unknown good choices for deep convolutional feature encoding, but also found out that some known good choices tune out to be bad. Base on the observations in the experiments, we present a very simple deep feature encoding pipeline, and confirmed its state-of-the-art performances on multiple image recognition datasets.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124282005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
3D skeletal data has recently attracted wide attention in human behavior analysis for its robustness to variant scenes, while accurate gesture recognition is still challenging. The main reason lies in the high intra-class variance caused by temporal dynamics. A solution is resorting to the generative models, such as the hidden Markov model (HMM). However, existing methods commonly assume fixed anchors for each hidden state, which is hard to depict the explicit temporal structure of gestures. Based on the observation that a gesture is a time series with distinctly defined phases, we propose a new formulation to build temporal compositions of gestures by the low-rank matrix decomposition. The only assumption is that the gesture's "hold" phases with static poses are linearly correlated among each other. As such, a gesture sequence could be segmented into temporal states with semantically meaningful and discriminative concepts. Furthermore, different to traditional HMMs which tend to use specific distance metric for clustering and ignore the temporal contextual information when estimating the emission probability, the Long Short-Term Memory (LSTM) is utilized to learn probability distributions over states of HMM. The proposed method is validated on two challenging datasets. Experiments demonstrate that our approach can effectively work on a wide range of gestures and actions, and achieve state-of-the-art performance.
{"title":"Hidden States Exploration for 3D Skeleton-Based Gesture Recognition","authors":"Xin Liu, Henglin Shi, Xiaopeng Hong, Haoyu Chen, D. Tao, Guoying Zhao","doi":"10.1109/WACV.2019.00201","DOIUrl":"https://doi.org/10.1109/WACV.2019.00201","url":null,"abstract":"3D skeletal data has recently attracted wide attention in human behavior analysis for its robustness to variant scenes, while accurate gesture recognition is still challenging. The main reason lies in the high intra-class variance caused by temporal dynamics. A solution is resorting to the generative models, such as the hidden Markov model (HMM). However, existing methods commonly assume fixed anchors for each hidden state, which is hard to depict the explicit temporal structure of gestures. Based on the observation that a gesture is a time series with distinctly defined phases, we propose a new formulation to build temporal compositions of gestures by the low-rank matrix decomposition. The only assumption is that the gesture's \"hold\" phases with static poses are linearly correlated among each other. As such, a gesture sequence could be segmented into temporal states with semantically meaningful and discriminative concepts. Furthermore, different to traditional HMMs which tend to use specific distance metric for clustering and ignore the temporal contextual information when estimating the emission probability, the Long Short-Term Memory (LSTM) is utilized to learn probability distributions over states of HMM. The proposed method is validated on two challenging datasets. Experiments demonstrate that our approach can effectively work on a wide range of gestures and actions, and achieve state-of-the-art performance.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"287 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122002572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We address the problem of learning dynamic patterns from unlabeled video sequences, either in the form of generating new video sequences, or recovering incomplete video sequences. This problem is challenging because the appearances and motions in the video sequences can be very complex. We propose to use the alternating back-propagation algorithm to learn the generator network with the spatial-temporal convolutional architecture. The proposed method is efficient and flexible. It can not only generate realistic video sequences, but can also recover the incomplete video sequences in the testing stage or even in the learning stage. The proposed algorithm can be further improved by using learned initialization which is useful for the recovery tasks. Further, the proposed algorithm can naturally help to learn the shared representation between different modalities. Our experiments show that our method is competitive with the existing state of the art methods both qualitatively and quantitatively.
{"title":"Learning Generator Networks for Dynamic Patterns","authors":"Tian Han, Yang Lu, Jiawen Wu, X. Xing, Y. Wu","doi":"10.1109/WACV.2019.00091","DOIUrl":"https://doi.org/10.1109/WACV.2019.00091","url":null,"abstract":"We address the problem of learning dynamic patterns from unlabeled video sequences, either in the form of generating new video sequences, or recovering incomplete video sequences. This problem is challenging because the appearances and motions in the video sequences can be very complex. We propose to use the alternating back-propagation algorithm to learn the generator network with the spatial-temporal convolutional architecture. The proposed method is efficient and flexible. It can not only generate realistic video sequences, but can also recover the incomplete video sequences in the testing stage or even in the learning stage. The proposed algorithm can be further improved by using learned initialization which is useful for the recovery tasks. Further, the proposed algorithm can naturally help to learn the shared representation between different modalities. Our experiments show that our method is competitive with the existing state of the art methods both qualitatively and quantitatively.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"168 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115253819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tejaswi Kasarla, G. Nagendar, Guruprasad M. Hegde, V. Balasubramanian, C. V. Jawahar
As vision-based autonomous systems, such as self-driving vehicles, become a reality, there is an increasing need for large annotated datasets for developing solutions to vision tasks. One important task that has seen significant interest in recent years is semantic segmentation. However, the cost of annotating every pixel for semantic segmentation is immense, and can be prohibitive in scaling to various settings and locations. In this paper, we propose a region-based active learning method for efficient labeling in semantic segmentation. Using the proposed active learning strategy, we show that we are able to judiciously select the regions for annotation such that we obtain 93.8% of the baseline performance (when all pixels are labeled) with labeling of 10% of the total number of pixels. Further, we show that this approach can be used to transfer annotations from a model trained on a given dataset (Cityscapes) to a different dataset (Mapillary), thus highlighting its promise and potential.
{"title":"Region-based active learning for efficient labeling in semantic segmentation","authors":"Tejaswi Kasarla, G. Nagendar, Guruprasad M. Hegde, V. Balasubramanian, C. V. Jawahar","doi":"10.1109/WACV.2019.00123","DOIUrl":"https://doi.org/10.1109/WACV.2019.00123","url":null,"abstract":"As vision-based autonomous systems, such as self-driving vehicles, become a reality, there is an increasing need for large annotated datasets for developing solutions to vision tasks. One important task that has seen significant interest in recent years is semantic segmentation. However, the cost of annotating every pixel for semantic segmentation is immense, and can be prohibitive in scaling to various settings and locations. In this paper, we propose a region-based active learning method for efficient labeling in semantic segmentation. Using the proposed active learning strategy, we show that we are able to judiciously select the regions for annotation such that we obtain 93.8% of the baseline performance (when all pixels are labeled) with labeling of 10% of the total number of pixels. Further, we show that this approach can be used to transfer annotations from a model trained on a given dataset (Cityscapes) to a different dataset (Mapillary), thus highlighting its promise and potential.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124914917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To automatically produce a brief yet expressive summary of a long video, an automatic algorithm should start by resembling the human process of summary generation. Prior work proposed supervised and unsupervised algorithms to train models for learning the underlying behavior of humans by increasing modeling complexity or craft-designing better heuristics to simulate human summary generation process. In this work, we take a different approach by analyzing a major cue that humans exploit for summary generation; the nature and intensity of actions. We empirically observed that a frame is more likely to be included in human-generated summaries if it contains a substantial amount of deliberate motion performed by an agent, which is referred to as actionness. Therefore, we hypothesize that learning to automatically generate summaries involves an implicit knowledge of actionness estimation and ranking. We validate our hypothesis by running a user study that explores the correlation between human-generated summaries and actionness ranks. We also run a consensus and behavioral analysis between human subjects to ensure reliable and consistent results. The analysis exhibits a considerable degree of agreement among subjects within obtained data and verifying our initial hypothesis. Based on the study findings, we develop a method to incorporate actionness data to explicitly regulate a learning algorithm that is trained for summary generation. We assess the performance of our approach on 4 summarization benchmark datasets, and demonstrate an evident advantage compared to state-of-the-art summarization methods.
{"title":"Video Summarization Via Actionness Ranking","authors":"Mohamed Elfeki, A. Borji","doi":"10.1109/WACV.2019.00085","DOIUrl":"https://doi.org/10.1109/WACV.2019.00085","url":null,"abstract":"To automatically produce a brief yet expressive summary of a long video, an automatic algorithm should start by resembling the human process of summary generation. Prior work proposed supervised and unsupervised algorithms to train models for learning the underlying behavior of humans by increasing modeling complexity or craft-designing better heuristics to simulate human summary generation process. In this work, we take a different approach by analyzing a major cue that humans exploit for summary generation; the nature and intensity of actions. We empirically observed that a frame is more likely to be included in human-generated summaries if it contains a substantial amount of deliberate motion performed by an agent, which is referred to as actionness. Therefore, we hypothesize that learning to automatically generate summaries involves an implicit knowledge of actionness estimation and ranking. We validate our hypothesis by running a user study that explores the correlation between human-generated summaries and actionness ranks. We also run a consensus and behavioral analysis between human subjects to ensure reliable and consistent results. The analysis exhibits a considerable degree of agreement among subjects within obtained data and verifying our initial hypothesis. Based on the study findings, we develop a method to incorporate actionness data to explicitly regulate a learning algorithm that is trained for summary generation. We assess the performance of our approach on 4 summarization benchmark datasets, and demonstrate an evident advantage compared to state-of-the-art summarization methods.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123363759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For convolutional neural network models that optimize an image embedding, we propose a method to highlight the regions of images that contribute most to pairwise similarity. This work is a corollary to the visualization tools developed for classification networks, but applicable to the problem domains better suited to similarity learning. The visualization shows how similarity networks that are fine-tuned learn to focus on different features. We also generalize our approach to embedding networks that use different pooling strategies and provide a simple mechanism to support image similarity searches on objects or sub-regions in the query image.
{"title":"Visualizing Deep Similarity Networks","authors":"Abby Stylianou, Richard Souvenir, Robert Pless","doi":"10.1109/WACV.2019.00220","DOIUrl":"https://doi.org/10.1109/WACV.2019.00220","url":null,"abstract":"For convolutional neural network models that optimize an image embedding, we propose a method to highlight the regions of images that contribute most to pairwise similarity. This work is a corollary to the visualization tools developed for classification networks, but applicable to the problem domains better suited to similarity learning. The visualization shows how similarity networks that are fine-tuned learn to focus on different features. We also generalize our approach to embedding networks that use different pooling strategies and provide a simple mechanism to support image similarity searches on objects or sub-regions in the query image.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129669918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a method for separating direct and global components of a dynamic scene per illumination color by using a projector-camera system; it exploits both the color switch and the temporal dithering of a DLP projector. Our proposed method is easy-to-implement because it does not require any self-built equipment and temporal synchronization between a projector and a camera. In addition, our method automatically calibrates the projector-camera correspondence in a dynamic scene on the basis of the consistency in pixel intensities, and optimizes the projection pattern on the basis of noise propagation analysis. We implemented the prototype setup and achieved multispectral direct-global separation of dynamic scenes in 60 Hz. Furthermore, we demonstrated that our method is effective for applications such as image-based material editing and multispectral relighting of dynamic scenes where wavelength-dependent phenomena such as fluorescence are observed.
{"title":"Multispectral Direct-Global Separation of Dynamic Scenes","authors":"M. Torii, Takahiro Okabe, Toshiyuki Amano","doi":"10.1109/WACV.2019.00209","DOIUrl":"https://doi.org/10.1109/WACV.2019.00209","url":null,"abstract":"In this paper, we propose a method for separating direct and global components of a dynamic scene per illumination color by using a projector-camera system; it exploits both the color switch and the temporal dithering of a DLP projector. Our proposed method is easy-to-implement because it does not require any self-built equipment and temporal synchronization between a projector and a camera. In addition, our method automatically calibrates the projector-camera correspondence in a dynamic scene on the basis of the consistency in pixel intensities, and optimizes the projection pattern on the basis of noise propagation analysis. We implemented the prototype setup and achieved multispectral direct-global separation of dynamic scenes in 60 Hz. Furthermore, we demonstrated that our method is effective for applications such as image-based material editing and multispectral relighting of dynamic scenes where wavelength-dependent phenomena such as fluorescence are observed.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125245300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce the first benchmark dataset for slide-page segmentation. Presentation slides are one of the most prominent document types used to exchange ideas across the web, educational institutes and businesses. This document format is marked with a complex layout which contains a rich variety of graphical (e.g. diagram, logo), textual (e.g. heading, affiliation) and structural components (e.g. enumeration, legend). This vast and popular knowledge source is still unattainable by modern machine learning technique due to lack of annotated data. To tackle this issue, we introduce SPaSe (Slide Page Segmentation), a novel dataset containing in total 2000 slides with dense, pixel-wise annotations of 25 classes. We show that slide segmentation reveals some interesting properties that characterize this task. Unlike the common image segmentation problem, disjoint classes tend to have a high overlap of regions, thus posing this segmentation task as a multi-label problem. Furthermore, many of the frequently encountered classes in slides are location sensitive (e.g. title, footnote). Hence, we believe our dataset represents a challenging and interesting benchmark for novel segmentation models. Finally, we evaluate state-of-the-art deep segmentation models on our dataset and show that it is suitable for developing deep learning models without any need of pre-training. Our dataset will be released to the public to foster further research on this interesting task.
{"title":"SPaSe - Multi-Label Page Segmentation for Presentation Slides","authors":"Monica Haurilet, Ziad Al-Halah, R. Stiefelhagen","doi":"10.1109/WACV.2019.00082","DOIUrl":"https://doi.org/10.1109/WACV.2019.00082","url":null,"abstract":"We introduce the first benchmark dataset for slide-page segmentation. Presentation slides are one of the most prominent document types used to exchange ideas across the web, educational institutes and businesses. This document format is marked with a complex layout which contains a rich variety of graphical (e.g. diagram, logo), textual (e.g. heading, affiliation) and structural components (e.g. enumeration, legend). This vast and popular knowledge source is still unattainable by modern machine learning technique due to lack of annotated data. To tackle this issue, we introduce SPaSe (Slide Page Segmentation), a novel dataset containing in total 2000 slides with dense, pixel-wise annotations of 25 classes. We show that slide segmentation reveals some interesting properties that characterize this task. Unlike the common image segmentation problem, disjoint classes tend to have a high overlap of regions, thus posing this segmentation task as a multi-label problem. Furthermore, many of the frequently encountered classes in slides are location sensitive (e.g. title, footnote). Hence, we believe our dataset represents a challenging and interesting benchmark for novel segmentation models. Finally, we evaluate state-of-the-art deep segmentation models on our dataset and show that it is suitable for developing deep learning models without any need of pre-training. Our dataset will be released to the public to foster further research on this interesting task.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"20 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120848388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we tackle no-reference image quality assessment (NR-IQA), which aims to predict the perceptual quality of a test image without referencing its pristine-quality counterpart. The free-energy brain theory implies that the human visual system (HVS) tends to predict the pristine image while perceiving a distorted one. Besides, image quality assessment heavily depends on the way how human beings attend to distorted images. Motivated by that, the distorted image is restored first. Then given the distorted-restored pair, we make the first attempt to formulate the NR-IQA as a dynamic attentional process and implement it via reinforcement learning. The reward is derived from two tasks—classifying the distortion type and predicting the perceptual score of a test image. The model learns a policy to sample a sequence of fixation areas with a goal to maximize the expectation of the accumulated rewards. The observations of the fixation areas are aggregated through a recurrent neural network (RNN) and the robust averaging strategy which assigns different weights on different fixation areas. Extensive experiments on TID2008, TID2013 and CSIQ demonstrate the superiority of our method.
{"title":"No-Reference Image Quality Assessment: An Attention Driven Approach","authors":"Diqi Chen, Yizhou Wang, Hongyu Ren, Wen Gao","doi":"10.1109/WACV.2019.00046","DOIUrl":"https://doi.org/10.1109/WACV.2019.00046","url":null,"abstract":"In this paper, we tackle no-reference image quality assessment (NR-IQA), which aims to predict the perceptual quality of a test image without referencing its pristine-quality counterpart. The free-energy brain theory implies that the human visual system (HVS) tends to predict the pristine image while perceiving a distorted one. Besides, image quality assessment heavily depends on the way how human beings attend to distorted images. Motivated by that, the distorted image is restored first. Then given the distorted-restored pair, we make the first attempt to formulate the NR-IQA as a dynamic attentional process and implement it via reinforcement learning. The reward is derived from two tasks—classifying the distortion type and predicting the perceptual score of a test image. The model learns a policy to sample a sequence of fixation areas with a goal to maximize the expectation of the accumulated rewards. The observations of the fixation areas are aggregated through a recurrent neural network (RNN) and the robust averaging strategy which assigns different weights on different fixation areas. Extensive experiments on TID2008, TID2013 and CSIQ demonstrate the superiority of our method.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127236877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}