Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301835
Shengbin Meng, Yang Li, Yiting Liao, Junlin Li, Shiqi Wang
On a platform of user-generated content (UGC), the uploaded videos need to be encoded again before distribution. For this specific encoding scenario, we propose a novel dataset and a corresponding learning-based scheme that is able to achieve significant bitrate saving without decreasing perceptual quality. In the dataset, each video’s label indicates whether it can be encoded with a much lower bitrate while still keeps the same perceptual quality. Models trained on this dataset can then be used to classify the input video and adjust its final encoding parameters accordingly. With enough classification accuracy, more than 20% average bitrate saving can be obtained through the proposed scheme. The dataset will be further expanded to facilitate the study on this problem.
{"title":"Learning to encode user-generated short videos with lower bitrate and the same perceptual quality","authors":"Shengbin Meng, Yang Li, Yiting Liao, Junlin Li, Shiqi Wang","doi":"10.1109/VCIP49819.2020.9301835","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301835","url":null,"abstract":"On a platform of user-generated content (UGC), the uploaded videos need to be encoded again before distribution. For this specific encoding scenario, we propose a novel dataset and a corresponding learning-based scheme that is able to achieve significant bitrate saving without decreasing perceptual quality. In the dataset, each video’s label indicates whether it can be encoded with a much lower bitrate while still keeps the same perceptual quality. Models trained on this dataset can then be used to classify the input video and adjust its final encoding parameters accordingly. With enough classification accuracy, more than 20% average bitrate saving can be obtained through the proposed scheme. The dataset will be further expanded to facilitate the study on this problem.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129182762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301868
Jingsong Xu, Litao Yu, Jian Zhang, Qiang Wu
Animal counting is a highly skilled yet tedious task in livestock transportation and trading. To effectively free up the human labour and provide accurate counts for sheep loading/unloading, we develop an auto sheep counting system based on multi-object detection, tracking and extrapolation techniques. Our system has demonstrated more than 99.9% accuracy with sheep moving freely in a race under optimal visual conditions.
{"title":"Automatic Sheep Counting by Multi-object Tracking","authors":"Jingsong Xu, Litao Yu, Jian Zhang, Qiang Wu","doi":"10.1109/VCIP49819.2020.9301868","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301868","url":null,"abstract":"Animal counting is a highly skilled yet tedious task in livestock transportation and trading. To effectively free up the human labour and provide accurate counts for sheep loading/unloading, we develop an auto sheep counting system based on multi-object detection, tracking and extrapolation techniques. Our system has demonstrated more than 99.9% accuracy with sheep moving freely in a race under optimal visual conditions.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120994298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Omnidirectional video is required to be projected from the Three-Dimensional (3D) sphere to a Two-Dimensional (2D) plane before compression due to its spherical characteristics. Therefore, various projection formats have been proposed in recent years. However, these existing projection methods have problems of either oversampling or discontinuous boundary, which penalize the coding performance. Among them, Hybrid Equiangular Cubemap (HEC) projection has achieved significant coding gains by keeping boundary continuity when compared with Equi-Angular Cubemap (EAC) projection. However, the parameters of its mapping function are fixed and cannot adapt to the video contents, which results in non-uniform sampling in certain regions. To address this limitation, a projection method named Content-aware HEC (CHEC) is presented in this paper. In particular, these parameters of mapping function are adaptively achieved by minimizing the projection conversion distortion. Additionally, an omnidirectional video coding framework with adaptive parameters of mapping function is proposed to effectively improve the coding performance. Experimental results show that the proposed scheme achieves 8.57% and 0.11% bit rate reduction on average in terms of End-to-End Weighted to Spherically uniform Peak Signal to Noise Ratio (E2E WS-PSNR) when compared with Equi-Rectangular Projection (ERP) and HEC projections, respectively.
{"title":"Content-aware Hybrid Equi-angular Cubemap Projection for Omnidirectional Video Coding","authors":"Jinyong Pi, Yun Zhang, Linwei Zhu, Xinju Wu, Xuemei Zhou","doi":"10.1109/VCIP49819.2020.9301893","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301893","url":null,"abstract":"Omnidirectional video is required to be projected from the Three-Dimensional (3D) sphere to a Two-Dimensional (2D) plane before compression due to its spherical characteristics. Therefore, various projection formats have been proposed in recent years. However, these existing projection methods have problems of either oversampling or discontinuous boundary, which penalize the coding performance. Among them, Hybrid Equiangular Cubemap (HEC) projection has achieved significant coding gains by keeping boundary continuity when compared with Equi-Angular Cubemap (EAC) projection. However, the parameters of its mapping function are fixed and cannot adapt to the video contents, which results in non-uniform sampling in certain regions. To address this limitation, a projection method named Content-aware HEC (CHEC) is presented in this paper. In particular, these parameters of mapping function are adaptively achieved by minimizing the projection conversion distortion. Additionally, an omnidirectional video coding framework with adaptive parameters of mapping function is proposed to effectively improve the coding performance. Experimental results show that the proposed scheme achieves 8.57% and 0.11% bit rate reduction on average in terms of End-to-End Weighted to Spherically uniform Peak Signal to Noise Ratio (E2E WS-PSNR) when compared with Equi-Rectangular Projection (ERP) and HEC projections, respectively.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122842491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301844
Cheng-Yeh Liou, Cheng-Yen Chuang, Chia-Han Huang, Yi-Chang Lu
For most of the existing high dynamic range (HDR) deghosting flows, they require a time-consuming motion registration step to generate ghost-free HDR results. Since the motion registration step usually becomes the bottleneck of the entire flow, in this paper, we propose a novel H DR deghosting flow which does not require any motion registration process. By taking channel properties into account, the luminance and chrominance channels are fused differently in the proposed flow. Our motion-registration-free fusion could generate high-quality HDR results swiftly even if the original Low Dynamic Range (LDR) images contain objects with large foreground motions.
{"title":"HDR Deghosting Using Motion-Registration-Free Fusion in the Luminance Gradient Domain","authors":"Cheng-Yeh Liou, Cheng-Yen Chuang, Chia-Han Huang, Yi-Chang Lu","doi":"10.1109/VCIP49819.2020.9301844","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301844","url":null,"abstract":"For most of the existing high dynamic range (HDR) deghosting flows, they require a time-consuming motion registration step to generate ghost-free HDR results. Since the motion registration step usually becomes the bottleneck of the entire flow, in this paper, we propose a novel H DR deghosting flow which does not require any motion registration process. By taking channel properties into account, the luminance and chrominance channels are fused differently in the proposed flow. Our motion-registration-free fusion could generate high-quality HDR results swiftly even if the original Low Dynamic Range (LDR) images contain objects with large foreground motions.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122950048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301888
Zhongyi Ji, Wenmin Wang, Baoyang Chen, Xiao Han
Synthesizing images from text is an important problem and has various applications. Most of the existing studies of text-to-image generation utilize supervised methods and rely on a fully-labeled dataset, but detailed and accurate descriptions of images are onerous to obtain. In this paper, we introduce a simple but effective semi-supervised approach that considers the feature of unlabeled images as "Pseudo Text Feature". Therefore, the unlabeled data can participate in the following training process. To achieve this, we design a Modality-invariant Semantic- consistent Module which aims to make the image feature and the text feature indistinguishable and maintain their semantic information. Extensive qualitative and quantitative experiments on MNIST and Oxford-102 flower datasets demonstrate the effectiveness of our semi-supervised method in comparison to supervised ones. We also show that the proposed method can be easily plugged into other visual generation models such as image translation and performs well.
{"title":"Text-to-Image Generation via Semi-Supervised Training","authors":"Zhongyi Ji, Wenmin Wang, Baoyang Chen, Xiao Han","doi":"10.1109/VCIP49819.2020.9301888","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301888","url":null,"abstract":"Synthesizing images from text is an important problem and has various applications. Most of the existing studies of text-to-image generation utilize supervised methods and rely on a fully-labeled dataset, but detailed and accurate descriptions of images are onerous to obtain. In this paper, we introduce a simple but effective semi-supervised approach that considers the feature of unlabeled images as \"Pseudo Text Feature\". Therefore, the unlabeled data can participate in the following training process. To achieve this, we design a Modality-invariant Semantic- consistent Module which aims to make the image feature and the text feature indistinguishable and maintain their semantic information. Extensive qualitative and quantitative experiments on MNIST and Oxford-102 flower datasets demonstrate the effectiveness of our semi-supervised method in comparison to supervised ones. We also show that the proposed method can be easily plugged into other visual generation models such as image translation and performs well.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115469204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301883
Herman Prawiro, Tse-Yu Pan, Min-Chun Hu
Emotion recognition is a crucial problem in affective computing. Most of previous works utilized facial expression from visible spectrum data to solve emotion recognition task. Thermal videos provide temperature measurement of human body over time, which can be used to recognize affective states by learning its temporal pattern. In this paper, we conduct comparative experiments to study the effectiveness of the existing deep neural networks when applied to emotion recognition task from thermal video. We analyze the effect of various approaches for frame sampling in video, temporal aggregation between frames, and different convolutional neural network architectures. To the best of our knowledge, we are the first w ork t o c onduct s tudy on emotion recognition from thermal video based on deep neural networks. Our work can provide preliminary study to design new methods for emotion recognition in thermal domain.
{"title":"An Empirical Study of Emotion Recognition from Thermal Video Based on Deep Neural Networks","authors":"Herman Prawiro, Tse-Yu Pan, Min-Chun Hu","doi":"10.1109/VCIP49819.2020.9301883","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301883","url":null,"abstract":"Emotion recognition is a crucial problem in affective computing. Most of previous works utilized facial expression from visible spectrum data to solve emotion recognition task. Thermal videos provide temperature measurement of human body over time, which can be used to recognize affective states by learning its temporal pattern. In this paper, we conduct comparative experiments to study the effectiveness of the existing deep neural networks when applied to emotion recognition task from thermal video. We analyze the effect of various approaches for frame sampling in video, temporal aggregation between frames, and different convolutional neural network architectures. To the best of our knowledge, we are the first w ork t o c onduct s tudy on emotion recognition from thermal video based on deep neural networks. Our work can provide preliminary study to design new methods for emotion recognition in thermal domain.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132319944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301793
Tingting Zhong, Xin Jin, Kedeng Tong
Recently, plenoptic image has attracted great attentions because of its applications in various scenarios. However, high resolution and special pixel distribution structure bring huge challenges to its storage and transmission. In order to adapt compression to the structural characteristic of plenoptic image, in this paper, we propose a Data Structure Adaptive 3D-convolutional(DSA-3D) autoencoder. The DSA-3D autoencoder enables up-sampling and down-samping the sub-aperture sequence along the angular resolution or spatial resolution, thereby avoiding the artifacts caused by directly compressing plenoptic image and achieving better compression efficiency. In addition, we propose a special and efficient Square rearrangement to generate sub-aperture sequence. We compare Square with Zigzag sub-aperture sequence rearrangements, and analyzed the compression efficiency of block image compression and whole image compression. Compared with traditional hybrid encoders HEVC, JPEG2000 and JPEG PLENO(WaSP), the proposed DSA-3D(Square) autoencoder achieves a superior performance in terms of PSNR metrics.
{"title":"3D-CNN Autoencoder for Plenoptic Image Compression","authors":"Tingting Zhong, Xin Jin, Kedeng Tong","doi":"10.1109/VCIP49819.2020.9301793","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301793","url":null,"abstract":"Recently, plenoptic image has attracted great attentions because of its applications in various scenarios. However, high resolution and special pixel distribution structure bring huge challenges to its storage and transmission. In order to adapt compression to the structural characteristic of plenoptic image, in this paper, we propose a Data Structure Adaptive 3D-convolutional(DSA-3D) autoencoder. The DSA-3D autoencoder enables up-sampling and down-samping the sub-aperture sequence along the angular resolution or spatial resolution, thereby avoiding the artifacts caused by directly compressing plenoptic image and achieving better compression efficiency. In addition, we propose a special and efficient Square rearrangement to generate sub-aperture sequence. We compare Square with Zigzag sub-aperture sequence rearrangements, and analyzed the compression efficiency of block image compression and whole image compression. Compared with traditional hybrid encoders HEVC, JPEG2000 and JPEG PLENO(WaSP), the proposed DSA-3D(Square) autoencoder achieves a superior performance in terms of PSNR metrics.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126754714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Occlusion lack compensation (OLC) is a multiplexing gain optimization data acquisition and novel views rendering strategy for light field rendering (LFR). While the achieved OLC is much higher than previously thought possible, the improvement comes at the cost of requiring more scene information. This can capture more detailed scene information, including geometric information, texture information and depth information, by learning and training methods. In this paper, we develop an occlusion compensation (OCC) model based on restricted boltzmann machine (RBM) to compensate for lack scene information caused by occlusion. We show that occlusion will cause the lack of captured scene information, which will lead to the decline of view rendering quality. The OCC model can estimate and compensate the lack information of occlusion edge by learning. We present experimental results to demonstrate the performance of OCC model with analog training, verify our theoretical analysis, and extend our conclusions on optimal rendering quality of light field.
{"title":"A Theory of Occlusion for Improving Rendering Quality of Views","authors":"Yijun Zeng, Weiyan Chen, Mengqin Bai, Yangdong Zeng, Changjian Zhu","doi":"10.1109/VCIP49819.2020.9301887","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301887","url":null,"abstract":"Occlusion lack compensation (OLC) is a multiplexing gain optimization data acquisition and novel views rendering strategy for light field rendering (LFR). While the achieved OLC is much higher than previously thought possible, the improvement comes at the cost of requiring more scene information. This can capture more detailed scene information, including geometric information, texture information and depth information, by learning and training methods. In this paper, we develop an occlusion compensation (OCC) model based on restricted boltzmann machine (RBM) to compensate for lack scene information caused by occlusion. We show that occlusion will cause the lack of captured scene information, which will lead to the decline of view rendering quality. The OCC model can estimate and compensate the lack information of occlusion edge by learning. We present experimental results to demonstrate the performance of OCC model with analog training, verify our theoretical analysis, and extend our conclusions on optimal rendering quality of light field.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114179823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301855
Zhida Zhou, Li Su, Guorong Li, Yifan Yang, Qingming Huang
Crowd counting in complex scene is an important but challenge task. The scale variation of crowd makes the shallow network hard to extract effective features. In this paper, we propose a shallow single column network named CSCNet for crowd counting. The key component is complementary scale context block (CSCB). It is designed to capture complementary scale context and obtains a high accuracy with limited depth of the network. As far as we know, CSCNet is the shallowest single column network in existing works. We demonstrate our methods on three challenge benchmarks. Compared to state-of-the-art methods, CSCNet achieves comparable accuracy with much less complexity. CSCNet provides an alternative to achieve comparable or even better performance with about 30% of depth and 50% of width decrease. Besides, CSCNet performs more stably on both sparse and congested crowd scenes.
{"title":"CSCNet: A Shallow Single Column Network for Crowd Counting","authors":"Zhida Zhou, Li Su, Guorong Li, Yifan Yang, Qingming Huang","doi":"10.1109/VCIP49819.2020.9301855","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301855","url":null,"abstract":"Crowd counting in complex scene is an important but challenge task. The scale variation of crowd makes the shallow network hard to extract effective features. In this paper, we propose a shallow single column network named CSCNet for crowd counting. The key component is complementary scale context block (CSCB). It is designed to capture complementary scale context and obtains a high accuracy with limited depth of the network. As far as we know, CSCNet is the shallowest single column network in existing works. We demonstrate our methods on three challenge benchmarks. Compared to state-of-the-art methods, CSCNet achieves comparable accuracy with much less complexity. CSCNet provides an alternative to achieve comparable or even better performance with about 30% of depth and 50% of width decrease. Besides, CSCNet performs more stably on both sparse and congested crowd scenes.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115218621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301831
Achref Ouni, E. Royer, Marc Chevaldonné, M. Dhome
This paper addresses the problem of image based localization. The goal is to find quickly and accurately the relative pose from a query taken from a stereo camera and a map obtained using visual SLAM which contains poses and 3D points associated to descriptors. In this paper we introduce a new method that leverages the stereo vision by adding geometric information to visual descriptors. This method can be used when the vertical direction of the camera is known (for example on a wheeled robot). This new geometric visual descriptor can be used with several image based localization algorithms based on visual words. We test the approach with different datasets (indoor, outdoor) and we show experimentally that the new geometric-visual descriptor improves standard image based localization approaches.
{"title":"Geometric-visual descriptor for improved image based localization","authors":"Achref Ouni, E. Royer, Marc Chevaldonné, M. Dhome","doi":"10.1109/VCIP49819.2020.9301831","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301831","url":null,"abstract":"This paper addresses the problem of image based localization. The goal is to find quickly and accurately the relative pose from a query taken from a stereo camera and a map obtained using visual SLAM which contains poses and 3D points associated to descriptors. In this paper we introduce a new method that leverages the stereo vision by adding geometric information to visual descriptors. This method can be used when the vertical direction of the camera is known (for example on a wheeled robot). This new geometric visual descriptor can be used with several image based localization algorithms based on visual words. We test the approach with different datasets (indoor, outdoor) and we show experimentally that the new geometric-visual descriptor improves standard image based localization approaches.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115349441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}