Tiantong Guo, Hojjat Seyed Mousavi, T. Vu, V. Monga
Recent advances have seen a surge of deep learning approaches for image super-resolution. Invariably, a network, e.g. a deep convolutional neural network (CNN) or auto-encoder is trained to learn the relationship between low and high-resolution image patches. Recognizing that a wavelet transform provides a "coarse" as well as "detail" separation of image content, we design a deep CNN to predict the "missing details" of wavelet coefficients of the low-resolution images to obtain the Super-Resolution (SR) results, which we name Deep Wavelet Super-Resolution (DWSR). Out network is trained in the wavelet domain with four input and output channels respectively. The input comprises of 4 sub-bands of the low-resolution wavelet coefficients and outputs are residuals (missing details) of 4 sub-bands of high-resolution wavelet coefficients. Wavelet coefficients and wavelet residuals are used as input and outputs of our network to further enhance the sparsity of activation maps. A key benefit of such a design is that it greatly reduces the training burden of learning the network that reconstructs low frequency details. The output prediction is added to the input to form the final SR wavelet coefficients. Then the inverse 2d discrete wavelet transformation is applied to transform the predicted details and generate the SR results. We show that DWSR is computationally simpler and yet produces competitive and often better results than state-of-the-art alternatives.
{"title":"Deep Wavelet Prediction for Image Super-Resolution","authors":"Tiantong Guo, Hojjat Seyed Mousavi, T. Vu, V. Monga","doi":"10.1109/CVPRW.2017.148","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.148","url":null,"abstract":"Recent advances have seen a surge of deep learning approaches for image super-resolution. Invariably, a network, e.g. a deep convolutional neural network (CNN) or auto-encoder is trained to learn the relationship between low and high-resolution image patches. Recognizing that a wavelet transform provides a \"coarse\" as well as \"detail\" separation of image content, we design a deep CNN to predict the \"missing details\" of wavelet coefficients of the low-resolution images to obtain the Super-Resolution (SR) results, which we name Deep Wavelet Super-Resolution (DWSR). Out network is trained in the wavelet domain with four input and output channels respectively. The input comprises of 4 sub-bands of the low-resolution wavelet coefficients and outputs are residuals (missing details) of 4 sub-bands of high-resolution wavelet coefficients. Wavelet coefficients and wavelet residuals are used as input and outputs of our network to further enhance the sparsity of activation maps. A key benefit of such a design is that it greatly reduces the training burden of learning the network that reconstructs low frequency details. The output prediction is added to the input to form the final SR wavelet coefficients. Then the inverse 2d discrete wavelet transformation is applied to transform the predicted details and generate the SR results. We show that DWSR is computationally simpler and yet produces competitive and often better results than state-of-the-art alternatives.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"1 1","pages":"1100-1109"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83353703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose AcFR, an active face recognition system that employs a convolutional neural network and acts consistently with human behaviors in common face recognition scenarios. AcFR comprises two main components—a recognition module and a controller module. The recognition module uses a pre-trained VGG-Face net to extract facial image features along with a nearest neighbor identity recognition algorithm. Based on the results, the controller module can make three different decisions—greet a recognized individual, disregard an unknown individual, or acquire a different viewpoint from which to reassess the subject, all of which are natural reactions when people observe passers-by. Evaluated on the PIE dataset, our recognition module yields higher accuracy on images under closer angles to those saved in memory. The accuracy is viewdependent and it also provides evidence for the proper design of the controller module.
{"title":"AcFR: Active Face Recognition Using Convolutional Neural Networks","authors":"Masaki Nakada, Han Wang, Demetri Terzopoulos","doi":"10.1109/CVPRW.2017.11","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.11","url":null,"abstract":"We propose AcFR, an active face recognition system that employs a convolutional neural network and acts consistently with human behaviors in common face recognition scenarios. AcFR comprises two main components—a recognition module and a controller module. The recognition module uses a pre-trained VGG-Face net to extract facial image features along with a nearest neighbor identity recognition algorithm. Based on the results, the controller module can make three different decisions—greet a recognized individual, disregard an unknown individual, or acquire a different viewpoint from which to reassess the subject, all of which are natural reactions when people observe passers-by. Evaluated on the PIE dataset, our recognition module yields higher accuracy on images under closer angles to those saved in memory. The accuracy is viewdependent and it also provides evidence for the proper design of the controller module.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"22 1","pages":"35-40"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88753667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Kollias, M. Nicolaou, I. Kotsia, Guoying Zhao, S. Zafeiriou
In this paper we utilize the first large-scale "in-the-wild" (Aff-Wild) database, which is annotated in terms of the valence-arousal dimensions, to train and test an end-to-end deep neural architecture for the estimation of continuous emotion dimensions based on visual cues. The proposed architecture is based on jointly training convolutional (CNN) and recurrent neural network (RNN) layers, thus exploiting both the invariant properties of convolutional features, while also modelling temporal dynamics that arise in human behaviour via the recurrent layers. Various pre-trained networks are used as starting structures which are subsequently appropriately fine-tuned to the Aff-Wild database. Obtained results show premise for the utilization of deep architectures for the visual analysis of human behaviour in terms of continuous emotion dimensions and analysis of different types of affect.
{"title":"Recognition of Affect in the Wild Using Deep Neural Networks","authors":"D. Kollias, M. Nicolaou, I. Kotsia, Guoying Zhao, S. Zafeiriou","doi":"10.1109/CVPRW.2017.247","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.247","url":null,"abstract":"In this paper we utilize the first large-scale \"in-the-wild\" (Aff-Wild) database, which is annotated in terms of the valence-arousal dimensions, to train and test an end-to-end deep neural architecture for the estimation of continuous emotion dimensions based on visual cues. The proposed architecture is based on jointly training convolutional (CNN) and recurrent neural network (RNN) layers, thus exploiting both the invariant properties of convolutional features, while also modelling temporal dynamics that arise in human behaviour via the recurrent layers. Various pre-trained networks are used as starting structures which are subsequently appropriately fine-tuned to the Aff-Wild database. Obtained results show premise for the utilization of deep architectures for the visual analysis of human behaviour in terms of continuous emotion dimensions and analysis of different types of affect.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"84 9 1","pages":"1972-1979"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87669190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traffic light detection (TLD) is a vital part of both intelligent vehicles and driving assistance systems (DAS). General for most TLDs is that they are evaluated on small and private datasets making it hard to determine the exact performance of a given method. In this paper we apply the state-of-the-art, real-time object detection system You Only Look Once, (YOLO) on the public LISA Traffic Light dataset available through the VIVA-challenge, which contain a high number of annotated traffic lights, captured in varying light and weather conditions.,,,,,,The YOLO object detector achieves an AUC of impressively 90.49% for daysequence1, which is an improvement of 50.32% compared to the latest ACF entry in the VIVAchallenge. Using the exact same training configuration as the ACF detector, the YOLO detector reaches an AUC of 58.3%, which is in an increase of 18.13%.
交通信号灯检测(TLD)是智能车辆和驾驶辅助系统(DAS)的重要组成部分。大多数顶级域名的一般情况是,它们是在小型和私有数据集上进行评估的,这使得很难确定给定方法的确切性能。在本文中,我们将最先进的实时目标检测系统You Only Look Once (YOLO)应用于通过viva挑战获得的公共LISA交通灯数据集,该数据集包含大量在不同光线和天气条件下捕获的带注释的交通灯。,,,,,, YOLO目标检测器对daysequence1的AUC达到了令人印象深刻的90.49%,与vivchallenge中最新的ACF条目相比,这一AUC提高了50.32%。使用与ACF检测器完全相同的训练配置,YOLO检测器的AUC达到58.3%,提高了18.13%。
{"title":"Evaluating State-of-the-Art Object Detector on Challenging Traffic Light Data","authors":"M. B. Jensen, Kamal Nasrollahi, T. Moeslund","doi":"10.1109/CVPRW.2017.122","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.122","url":null,"abstract":"Traffic light detection (TLD) is a vital part of both intelligent vehicles and driving assistance systems (DAS). General for most TLDs is that they are evaluated on small and private datasets making it hard to determine the exact performance of a given method. In this paper we apply the state-of-the-art, real-time object detection system You Only Look Once, (YOLO) on the public LISA Traffic Light dataset available through the VIVA-challenge, which contain a high number of annotated traffic lights, captured in varying light and weather conditions.,,,,,,The YOLO object detector achieves an AUC of impressively 90.49% for daysequence1, which is an improvement of 50.32% compared to the latest ACF entry in the VIVAchallenge. Using the exact same training configuration as the ACF detector, the YOLO detector reaches an AUC of 58.3%, which is in an increase of 18.13%.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"43 1","pages":"882-888"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91077239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The burst of video production appeals for new browsing frameworks. Chiefly in sports, TV companies have years of recorded match archives to exploit and sports fans are looking for replay, summary or collection of events. In this work, we design a new multi-resolution motion feature for video abstraction. This descriptor is based on optical flow singularities tracked along the video. We use these singlets in order to detect zooms, slow-motions and salient moments in soccer games and finally to produce an automatic summarization of a game. We produce a database for soccer video summarization composed of 4 soccer matches from HDTV games for the FIFA world cup 2014 annotated with goals, fouls, corners and salient moments to make a summary. We correctly detect 88.2% of saliant moments using this database. To highlight the generalization of our approach, we test our system on the final game of the handball world championship 2015 without any retraining, refining or adaptation.
{"title":"Singlets: Multi-resolution Motion Singularities for Soccer Video Abstraction","authors":"K. Blanc, D. Lingrand, F. Precioso","doi":"10.1109/CVPRW.2017.15","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.15","url":null,"abstract":"The burst of video production appeals for new browsing frameworks. Chiefly in sports, TV companies have years of recorded match archives to exploit and sports fans are looking for replay, summary or collection of events. In this work, we design a new multi-resolution motion feature for video abstraction. This descriptor is based on optical flow singularities tracked along the video. We use these singlets in order to detect zooms, slow-motions and salient moments in soccer games and finally to produce an automatic summarization of a game. We produce a database for soccer video summarization composed of 4 soccer matches from HDTV games for the FIFA world cup 2014 annotated with goals, fouls, corners and salient moments to make a summary. We correctly detect 88.2% of saliant moments using this database. To highlight the generalization of our approach, we test our system on the final game of the handball world championship 2015 without any retraining, refining or adaptation.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"74 1","pages":"66-75"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80772285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. S. Aydin, Abhinandan Dubey, Daniel Dovrat, A. Aharoni, Roy Shilkrot
We present a method for foreground segmentation of yeast cells in the presence of high-noise induced by intentional low illumination, where traditional approaches (e.g., threshold-based methods, specialized cell-segmentation methods) fail. To deal with these harsh conditions, we use a fully-convolutional semantic segmentation network based on the SegNet architecture. Our model is capable of segmenting patches extracted from yeast live-cell experiments with a mIOU score of 0.71 on unseen patches drawn from independent experiments. Further, we show that simultaneous multi-modal observations of bio-fluorescent markers can result in better segmentation performance than the DIC channel alone.
{"title":"CNN Based Yeast Cell Segmentation in Multi-modal Fluorescent Microscopy Data","authors":"A. S. Aydin, Abhinandan Dubey, Daniel Dovrat, A. Aharoni, Roy Shilkrot","doi":"10.1109/CVPRW.2017.105","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.105","url":null,"abstract":"We present a method for foreground segmentation of yeast cells in the presence of high-noise induced by intentional low illumination, where traditional approaches (e.g., threshold-based methods, specialized cell-segmentation methods) fail. To deal with these harsh conditions, we use a fully-convolutional semantic segmentation network based on the SegNet architecture. Our model is capable of segmenting patches extracted from yeast live-cell experiments with a mIOU score of 0.71 on unseen patches drawn from independent experiments. Further, we show that simultaneous multi-modal observations of bio-fluorescent markers can result in better segmentation performance than the DIC channel alone.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"23 1","pages":"753-759"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74101419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Human action recognition from skeletal data is a hot research topic and important in many open domain applications of computer vision, thanks to recently introduced 3D sensors. In the literature, naive methods simply transfer off-the-shelf techniques from video to the skeletal representation. However, the current state-of-the-art is contended between to different paradigms: kernel-based methods and feature learning with (recurrent) neural networks. Both approaches show strong performances, yet they exhibit heavy, but complementary, drawbacks. Motivated by this fact, our work aims at combining together the best of the two paradigms, by proposing an approach where a shallow network is fed with a covariance representation. Our intuition is that, as long as the dynamics is effectively modeled, there is no need for the classification network to be deep nor recurrent in order to score favorably. We validate this hypothesis in a broad experimental analysis over 6 publicly available datasets.
{"title":"When Kernel Methods Meet Feature Learning: Log-Covariance Network for Action Recognition From Skeletal Data","authors":"Jacopo Cavazza, Pietro Morerio, Vittorio Murino","doi":"10.1109/CVPRW.2017.165","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.165","url":null,"abstract":"Human action recognition from skeletal data is a hot research topic and important in many open domain applications of computer vision, thanks to recently introduced 3D sensors. In the literature, naive methods simply transfer off-the-shelf techniques from video to the skeletal representation. However, the current state-of-the-art is contended between to different paradigms: kernel-based methods and feature learning with (recurrent) neural networks. Both approaches show strong performances, yet they exhibit heavy, but complementary, drawbacks. Motivated by this fact, our work aims at combining together the best of the two paradigms, by proposing an approach where a shallow network is fed with a covariance representation. Our intuition is that, as long as the dynamics is effectively modeled, there is no need for the classification network to be deep nor recurrent in order to score favorably. We validate this hypothesis in a broad experimental analysis over 6 publicly available datasets.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"85 1","pages":"1251-1258"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80141975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We describe an end-to-end system for explainable automatic job candidate screening from video CVs. In this application, audio, face and scene features are first computed from an input video CV, using rich feature sets. These multiple modalities are fed into modality-specific regressors to predict apparent personality traits and a variable that predicts whether the subject will be invited to the interview. The base learners are stacked to an ensemble of decision trees to produce the outputs of the quantitative stage, and a single decision tree, combined with a rule-based algorithm produces interview decision explanations based on the quantitative results. The proposed system in this work ranks first in both quantitative and qualitative stages of the CVPR 2017 ChaLearn Job Candidate Screening Coopetition.
{"title":"Multi-modal Score Fusion and Decision Trees for Explainable Automatic Job Candidate Screening from Video CVs","authors":"Heysem Kaya, Furkan Gürpinar, A. A. Salah","doi":"10.1109/CVPRW.2017.210","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.210","url":null,"abstract":"We describe an end-to-end system for explainable automatic job candidate screening from video CVs. In this application, audio, face and scene features are first computed from an input video CV, using rich feature sets. These multiple modalities are fed into modality-specific regressors to predict apparent personality traits and a variable that predicts whether the subject will be invited to the interview. The base learners are stacked to an ensemble of decision trees to produce the outputs of the quantitative stage, and a single decision tree, combined with a rule-based algorithm produces interview decision explanations based on the quantitative results. The proposed system in this work ranks first in both quantitative and qualitative stages of the CVPR 2017 ChaLearn Job Candidate Screening Coopetition.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"28 1","pages":"1651-1659"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91061670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a principled approach for the learning of causal conditions from actions and activities taking place in the physical environment through visual input. Causal conditions are the preconditions that must exist before a certain effect can ensue. We propose to consider diachronic and synchronic causal conditions separately for the learning of causal knowledge. Diachronic condition captures the "change" aspect of the causal relationship – what change must be present at a certain time to effect a subsequent change – while the synchronic condition is the "contextual" aspect – what "static" condition must be present to enable the causal relationship involved. This paper focuses on discussing the learning of synchronic causal conditions as well as proposing a principled framework for the learning of causal knowledge including the learning of extended sequences of cause-effect and the encoding of this knowledge in the form of scripts for prediction and problem solving.
{"title":"The Role of Synchronic Causal Conditions in Visual Knowledge Learning","authors":"Seng-Beng Ho","doi":"10.1109/CVPRW.2017.8","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.8","url":null,"abstract":"We propose a principled approach for the learning of causal conditions from actions and activities taking place in the physical environment through visual input. Causal conditions are the preconditions that must exist before a certain effect can ensue. We propose to consider diachronic and synchronic causal conditions separately for the learning of causal knowledge. Diachronic condition captures the \"change\" aspect of the causal relationship – what change must be present at a certain time to effect a subsequent change – while the synchronic condition is the \"contextual\" aspect – what \"static\" condition must be present to enable the causal relationship involved. This paper focuses on discussing the learning of synchronic causal conditions as well as proposing a principled framework for the learning of causal knowledge including the learning of extended sequences of cause-effect and the encoding of this knowledge in the form of scripts for prediction and problem solving.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"221 1","pages":"9-16"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83476246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recurrent neural networks (RNNs) are able to capture context in an image by modeling long-range semantic dependencies among image units. However, existing methods only utilize RNNs to model dependencies of a single modality (e.g., RGB) for labeling. In this work we extend this single-modal RNNs to multimodal RNNs (MM-RNNs) and apply it to RGB-D scene labeling. Our MM-RNNs are capable of seamlessly modeling dependencies of both RGB and depth modalities, and allow 'memory' sharing across modalities. By sharing 'memory', each modality possesses multiple properties of itself and other modalities, and becomes more discriminative to distinguish pixels. Moreover, we also analyse two simple extensions of single-modal RNNs and demonstrate that our MM-RNNs perform better than both of them. Integrating with convolutional neural networks (CNNs), we build an end-to-end network for RGB-D scene labeling. Extensive experiments on NYU depth V1 and V2 demonstrate the effectiveness of MM-RNNs.
{"title":"RGB-D Scene Labeling with Multimodal Recurrent Neural Networks","authors":"Heng Fan, Xue Mei, D. Prokhorov, Haibin Ling","doi":"10.1109/CVPRW.2017.31","DOIUrl":"https://doi.org/10.1109/CVPRW.2017.31","url":null,"abstract":"Recurrent neural networks (RNNs) are able to capture context in an image by modeling long-range semantic dependencies among image units. However, existing methods only utilize RNNs to model dependencies of a single modality (e.g., RGB) for labeling. In this work we extend this single-modal RNNs to multimodal RNNs (MM-RNNs) and apply it to RGB-D scene labeling. Our MM-RNNs are capable of seamlessly modeling dependencies of both RGB and depth modalities, and allow 'memory' sharing across modalities. By sharing 'memory', each modality possesses multiple properties of itself and other modalities, and becomes more discriminative to distinguish pixels. Moreover, we also analyse two simple extensions of single-modal RNNs and demonstrate that our MM-RNNs perform better than both of them. Integrating with convolutional neural networks (CNNs), we build an end-to-end network for RGB-D scene labeling. Extensive experiments on NYU depth V1 and V2 demonstrate the effectiveness of MM-RNNs.","PeriodicalId":6668,"journal":{"name":"2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"41 1","pages":"203-211"},"PeriodicalIF":0.0,"publicationDate":"2017-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85214678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}