Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301750
K. Sivakumar, B. Vishwanath, K. Rose
The VR180 format is gaining considerable traction among the various promising immersive multimedia formats that will arguably dominate future multimedia consumption applications. VR180 enables stereo viewing of a hemisphere about the user. The increased field of view and the stereo setting result in extensive volumes of data that strongly motivate the pursuit of novel efficient compression tools tailored to this format. This paper’s focus is on the critical inter-view prediction module that exploits correlations between camera views. Existing approaches mainly consist of projection to a plane where traditional multi-view coders are applied, and disparity compensation employs simple block translation in the plane. However, warping due to the projection renders such compensation highly suboptimal. The proposed approach circumvents this shortcoming by performing geodesic disparity compensation on the sphere. It leverages the observation that, as an observer moves from one view point to the other, all points on surrounding objects are perceived to move along respective geodesics on the sphere, which all intersect at the two points where the axis connecting the two view points pierces the sphere. Thus, the proposed method performs inter-view prediction on the sphere by moving pixels along their predefined respective geodesics, and accurately captures the perceived deformations. Experimental results show significant bitrate savings and evidence the efficacy of the proposed approach.
{"title":"Geodesic Disparity Compensation for Inter-View Prediction in VR180","authors":"K. Sivakumar, B. Vishwanath, K. Rose","doi":"10.1109/VCIP49819.2020.9301750","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301750","url":null,"abstract":"The VR180 format is gaining considerable traction among the various promising immersive multimedia formats that will arguably dominate future multimedia consumption applications. VR180 enables stereo viewing of a hemisphere about the user. The increased field of view and the stereo setting result in extensive volumes of data that strongly motivate the pursuit of novel efficient compression tools tailored to this format. This paper’s focus is on the critical inter-view prediction module that exploits correlations between camera views. Existing approaches mainly consist of projection to a plane where traditional multi-view coders are applied, and disparity compensation employs simple block translation in the plane. However, warping due to the projection renders such compensation highly suboptimal. The proposed approach circumvents this shortcoming by performing geodesic disparity compensation on the sphere. It leverages the observation that, as an observer moves from one view point to the other, all points on surrounding objects are perceived to move along respective geodesics on the sphere, which all intersect at the two points where the axis connecting the two view points pierces the sphere. Thus, the proposed method performs inter-view prediction on the sphere by moving pixels along their predefined respective geodesics, and accurately captures the perceived deformations. Experimental results show significant bitrate savings and evidence the efficacy of the proposed approach.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125053504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301859
Chang Liu, Ke-bin Jia, Pengyu Liu
Compared with traditional High Efficiency Video Coding (HEVC), 3D-HEVC introduces multi-view coding and depth map coding, which leads to significant increase in coding complexity. In this paper, we propose a low complexity intra coding algorithm for depth map based on end-to-end edge detection network. Firstly, we use Holistically Nested Edge Detection (HED) network to determine the edge location of the depth map. Secondly, we use Ostu method to divide the output of the HED into foreground region and background region. Finally, the CU size and the candidate list of intra mode are determined according to the region of coding tree unit (CTU). Experimental results demonstrate that the proposed algorithm can reduce the encoding time by 39.56% on average under negligible degradation of coding performance.
与传统的高效视频编码(High Efficiency Video Coding, HEVC)相比,3D-HEVC引入了多视图编码和深度图编码,使得编码复杂度显著提高。本文提出了一种基于端到端边缘检测网络的低复杂度深度图内编码算法。首先,我们使用整体嵌套边缘检测(HED)网络确定深度图的边缘位置。其次,利用Ostu方法将HED的输出分割为前景区域和背景区域;最后,根据编码树单元(CTU)的区域确定CU大小和内模候选列表。实验结果表明,该算法在编码性能下降可以忽略不计的情况下,平均减少了39.56%的编码时间。
{"title":"Fast Intra Coding Algorithm for Depth Map with End-to-End Edge Detection Network","authors":"Chang Liu, Ke-bin Jia, Pengyu Liu","doi":"10.1109/VCIP49819.2020.9301859","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301859","url":null,"abstract":"Compared with traditional High Efficiency Video Coding (HEVC), 3D-HEVC introduces multi-view coding and depth map coding, which leads to significant increase in coding complexity. In this paper, we propose a low complexity intra coding algorithm for depth map based on end-to-end edge detection network. Firstly, we use Holistically Nested Edge Detection (HED) network to determine the edge location of the depth map. Secondly, we use Ostu method to divide the output of the HED into foreground region and background region. Finally, the CU size and the candidate list of intra mode are determined according to the region of coding tree unit (CTU). Experimental results demonstrate that the proposed algorithm can reduce the encoding time by 39.56% on average under negligible degradation of coding performance.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129103929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301753
Wen-Hsiao Peng, H. Hang
The DCT-based transform coding technique was adopted by the international standards (ISO JPEG, ITU H.261/264/265, ISO MPEG-2/4/H, and many others) for nearly 30 years. Although researchers are still trying to improve its efficiency by fine-tuning its components and parameters, the basic structure has not changed in the past two decades.The deep learning technology recently developed may provide a new direction for constructing a high-compression image/video coding system. Recent results, particularly from the Challenge on Learned Image Compression (CLIC) at CVPR, indicate that this new type of schemes (often trained end-to-end) may have good potential for further improving compression efficiency.In the first part of this tutorial, we shall (1) summarize briefly the progress of this topic in the past 3 or so years, including an overview of CLIC results and JPEG AI Call-for-Evidence Challenge on Learning-based Image Coding (issued in early 2020). Because Deep Neural Network (DNN)-based image compression is a new area, several techniques and structures have been tested. The recently published autoencoder-based schemes can achieve similar PSNR to BPG (Better Portable Graphics, H.265 still image standard) and has superior subject quality (e.g., MSSSIM), especially at the very low bit rates. In the second part, we shall (2) address the detailed design concepts of image compression algorithms using the autoencoder structure. In the third part, we shall switch gears to (3) explore the emerging area of DNN-based video compression. Recent publications in this area have indicated that end-to-end trained video compression can achieve comparable or superior rate-distortion performance to HEVC/H.265. The CLIC at CVPR 2020 also created for the first time a new track dedicated to P-frame coding.
{"title":"Recent Advances in End-to-End Learned Image and Video Compression","authors":"Wen-Hsiao Peng, H. Hang","doi":"10.1109/VCIP49819.2020.9301753","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301753","url":null,"abstract":"The DCT-based transform coding technique was adopted by the international standards (ISO JPEG, ITU H.261/264/265, ISO MPEG-2/4/H, and many others) for nearly 30 years. Although researchers are still trying to improve its efficiency by fine-tuning its components and parameters, the basic structure has not changed in the past two decades.The deep learning technology recently developed may provide a new direction for constructing a high-compression image/video coding system. Recent results, particularly from the Challenge on Learned Image Compression (CLIC) at CVPR, indicate that this new type of schemes (often trained end-to-end) may have good potential for further improving compression efficiency.In the first part of this tutorial, we shall (1) summarize briefly the progress of this topic in the past 3 or so years, including an overview of CLIC results and JPEG AI Call-for-Evidence Challenge on Learning-based Image Coding (issued in early 2020). Because Deep Neural Network (DNN)-based image compression is a new area, several techniques and structures have been tested. The recently published autoencoder-based schemes can achieve similar PSNR to BPG (Better Portable Graphics, H.265 still image standard) and has superior subject quality (e.g., MSSSIM), especially at the very low bit rates. In the second part, we shall (2) address the detailed design concepts of image compression algorithms using the autoencoder structure. In the third part, we shall switch gears to (3) explore the emerging area of DNN-based video compression. Recent publications in this area have indicated that end-to-end trained video compression can achieve comparable or superior rate-distortion performance to HEVC/H.265. The CLIC at CVPR 2020 also created for the first time a new track dedicated to P-frame coding.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128720714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301872
Runtong Zhang, Fanman Meng, Hongliang Li, Q. Wu, K. Ngan
Class Activation Map (CAM) is the visualization of target regions generated from classification networks. However, classification network trained by class-level labels only has high responses to a few features of objects and thus the network cannot discriminate the whole target. We think that original labels used in classification tasks are not enough to describe all features of the objects. If we annotate more detailed labels like class-agnostic attribute labels for each image, the network may be able to mine larger CAM. Motivated by this idea, we propose and design common attribute labels, which are lower-level labels summarized from original image-level categories to describe more details of the target. Moreover, it should be emphasized that our proposed labels have good generalization on unknown categories since attributes (such as head, body, etc.) in some categories (such as dog, cat, etc.) are common and class-agnostic. That is why we call our proposed labels as common attribute labels, which are lower-level and more general compared with traditional labels. We finish the annotation work based on the PASCAL VOC2012 dataset and design a new architecture to successfully classify these common attribute labels. Then after fusing features of attribute labels into original categories, our network can mine larger CAMs of objects. Our method achieves better CAM results in visual and higher evaluation scores compared with traditional methods.
{"title":"Mining Larger Class Activation Map with Common Attribute Labels","authors":"Runtong Zhang, Fanman Meng, Hongliang Li, Q. Wu, K. Ngan","doi":"10.1109/VCIP49819.2020.9301872","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301872","url":null,"abstract":"Class Activation Map (CAM) is the visualization of target regions generated from classification networks. However, classification network trained by class-level labels only has high responses to a few features of objects and thus the network cannot discriminate the whole target. We think that original labels used in classification tasks are not enough to describe all features of the objects. If we annotate more detailed labels like class-agnostic attribute labels for each image, the network may be able to mine larger CAM. Motivated by this idea, we propose and design common attribute labels, which are lower-level labels summarized from original image-level categories to describe more details of the target. Moreover, it should be emphasized that our proposed labels have good generalization on unknown categories since attributes (such as head, body, etc.) in some categories (such as dog, cat, etc.) are common and class-agnostic. That is why we call our proposed labels as common attribute labels, which are lower-level and more general compared with traditional labels. We finish the annotation work based on the PASCAL VOC2012 dataset and design a new architecture to successfully classify these common attribute labels. Then after fusing features of attribute labels into original categories, our network can mine larger CAMs of objects. Our method achieves better CAM results in visual and higher evaluation scores compared with traditional methods.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117080490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301811
Jian Cao, Zhen Qiu, Zhengren Li, Fan Liang, Jun Wang
This paper proposes an IBC-Mirror mode for Screen Content Coding (SCC) for the next generation video coding standards, including Versatile Video Coding (VVC) and Audio Video Standard-3 in China (AVS3). It is the first time to take mirror characteristic into consideration for SCC in VVC/AVS3. Based on the translational motion model of Intra Block Copy (IBC) mode, the function of "horizontal and vertical flipping" is further added to reduce prediction error and improve coding efficiency. The proposed IBC-Mirror mode is implemented on the latest reference software, including VTM5.0 (VVC) and HPM-5.0 (AVS3). The simulations show that the proposed mode can achieve up to 1~2% (VVC) and 4~7% (AVS3) BD-rate saving for SCC test sequences. Drafts about the mode have been submitted to AVS meeting and investigated in SCC Core Experiments (CE).
{"title":"IBC-Mirror Mode for Screen Content Coding for the Next Generation Video Coding Standards","authors":"Jian Cao, Zhen Qiu, Zhengren Li, Fan Liang, Jun Wang","doi":"10.1109/VCIP49819.2020.9301811","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301811","url":null,"abstract":"This paper proposes an IBC-Mirror mode for Screen Content Coding (SCC) for the next generation video coding standards, including Versatile Video Coding (VVC) and Audio Video Standard-3 in China (AVS3). It is the first time to take mirror characteristic into consideration for SCC in VVC/AVS3. Based on the translational motion model of Intra Block Copy (IBC) mode, the function of \"horizontal and vertical flipping\" is further added to reduce prediction error and improve coding efficiency. The proposed IBC-Mirror mode is implemented on the latest reference software, including VTM5.0 (VVC) and HPM-5.0 (AVS3). The simulations show that the proposed mode can achieve up to 1~2% (VVC) and 4~7% (AVS3) BD-rate saving for SCC test sequences. Drafts about the mode have been submitted to AVS meeting and investigated in SCC Core Experiments (CE).","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121385043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301877
Matthias Kränzler, Christian Herglotz, A. Kaup
In previous research, it is shown that the decoding energy demand of several video codecs can be estimated accurately by using bit stream feature-based models. Therefore, we show in this paper that the visualization with the Decoding Energy Estimation Tool (DENESTO) can help to improve the understanding of the energy demand of the decoder.
{"title":"DENESTO: A Tool for Video Decoding Energy Estimation and Visualization","authors":"Matthias Kränzler, Christian Herglotz, A. Kaup","doi":"10.1109/VCIP49819.2020.9301877","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301877","url":null,"abstract":"In previous research, it is shown that the decoding energy demand of several video codecs can be estimated accurately by using bit stream feature-based models. Therefore, we show in this paper that the visualization with the Decoding Energy Estimation Tool (DENESTO) can help to improve the understanding of the energy demand of the decoder.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122240115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301862
Jin Feng, Kaili Zhao, Xiaolin Song, Anxin Li, Honggang Zhang
The competitive performances in visual tracking are achieved mostly by tracking-by-detection based approaches, whose accuracy highly relies on a binary classifier that distinguishes targets from distractors in a set of candidates. However, severe class imbalance, with few positives (e.g., targets) relative to negatives (e.g., backgrounds), leads to degrade accuracy of classification or increase bias of tracking. In this paper, we propose an imbalance-elimination mechanism, which adopts a multi-class paradigm and utilizes a novel candidate generation strategy. Specifically, our multi-class model assigns samples into one positive class and four proposed negative classes, naturally alleviating class imbalance. We define negative classes by introducing proportions of targets in samples, which values explicitly reveal relative scales between targets and backgrounds. Further-more, during candidate generation, we exploit such scale-aware negative patterns to help adjust searching areas of candidates to incorporate larger target proportions, thus more accurate target candidates are obtained and more positive samples are included to ease class imbalance simultaneously. Extensive experiments on standard benchmarks show that our tracker achieves favorable performance against the state-of-the-art approaches, and offers robust discrimination of positive targets and negative patterns.
{"title":"Robust Visual Tracking Via An Imbalance-Elimination Mechanism","authors":"Jin Feng, Kaili Zhao, Xiaolin Song, Anxin Li, Honggang Zhang","doi":"10.1109/VCIP49819.2020.9301862","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301862","url":null,"abstract":"The competitive performances in visual tracking are achieved mostly by tracking-by-detection based approaches, whose accuracy highly relies on a binary classifier that distinguishes targets from distractors in a set of candidates. However, severe class imbalance, with few positives (e.g., targets) relative to negatives (e.g., backgrounds), leads to degrade accuracy of classification or increase bias of tracking. In this paper, we propose an imbalance-elimination mechanism, which adopts a multi-class paradigm and utilizes a novel candidate generation strategy. Specifically, our multi-class model assigns samples into one positive class and four proposed negative classes, naturally alleviating class imbalance. We define negative classes by introducing proportions of targets in samples, which values explicitly reveal relative scales between targets and backgrounds. Further-more, during candidate generation, we exploit such scale-aware negative patterns to help adjust searching areas of candidates to incorporate larger target proportions, thus more accurate target candidates are obtained and more positive samples are included to ease class imbalance simultaneously. Extensive experiments on standard benchmarks show that our tracker achieves favorable performance against the state-of-the-art approaches, and offers robust discrimination of positive targets and negative patterns.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133877199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301880
Alexandre Berthet, J. Dugelay
Access to technologies like mobile phones contributes to the significant increase in the volume of digital visual data (images and videos). In addition, photo editing software is becoming increasingly powerful and easy to use. In some cases, these tools can be utilized to produce forgeries with the objective to change the semantic meaning of a photo or a video (e.g. fake news). Digital image forensics (DIF) includes two main objectives: the detection (and localization) of forgery and the identification of the origin of the acquisition (i.e. sensor identification). Since 2005, many classical methods for DIF have been designed, implemented and tested on several databases. Meantime, innovative approaches based on deep learning have emerged in other fields and have surpassed traditional techniques. In the context of DIF, deep learning methods mainly use convolutional neural networks (CNN) associated with significant preprocessing modules. This is an active domain and two possible ways to operate preprocessing have been studied: prior to the network or incorporated into it. None of the various studies on the digital image forensics provide a comprehensive overview of the preprocessing techniques used with deep learning methods. Therefore, the core objective of this article is to review the preprocessing modules associated with CNN models.
{"title":"A review of data preprocessing modules in digital image forensics methods using deep learning","authors":"Alexandre Berthet, J. Dugelay","doi":"10.1109/VCIP49819.2020.9301880","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301880","url":null,"abstract":"Access to technologies like mobile phones contributes to the significant increase in the volume of digital visual data (images and videos). In addition, photo editing software is becoming increasingly powerful and easy to use. In some cases, these tools can be utilized to produce forgeries with the objective to change the semantic meaning of a photo or a video (e.g. fake news). Digital image forensics (DIF) includes two main objectives: the detection (and localization) of forgery and the identification of the origin of the acquisition (i.e. sensor identification). Since 2005, many classical methods for DIF have been designed, implemented and tested on several databases. Meantime, innovative approaches based on deep learning have emerged in other fields and have surpassed traditional techniques. In the context of DIF, deep learning methods mainly use convolutional neural networks (CNN) associated with significant preprocessing modules. This is an active domain and two possible ways to operate preprocessing have been studied: prior to the network or incorporated into it. None of the various studies on the digital image forensics provide a comprehensive overview of the preprocessing techniques used with deep learning methods. Therefore, the core objective of this article is to review the preprocessing modules associated with CNN models.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115363011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.1109/VCIP49819.2020.9301813
Bo Peng, Zengrui Yu, Jianjun Lei, Jiahui Song
With the dramatic growth of 3D shape data, 3D shape recognition has become a hot research topic in the field of computer vision. How to effectively utilize the multimodal characteristics of 3D shape has been one of the key problems to boost the performance of 3D shape recognition. In this paper, we propose a novel attention-guided fusion network of point cloud and multiple views for 3D shape recognition. Specifically, in order to obtain more discriminative descriptor for 3D shape data, the inter-modality attention enhancement module and view-context attention fusion module are proposed to gradually refine and fuse the features of the point cloud and multiple views. In the inter-modality attention enhancement module, the inter-modality attention mask based on the joint feature representation is computed, so that the features of each modality are enhanced by fusing the correlative information between two modalities. After that, the view-context attention fusion module is proposed to explore the context information of multiple views, and fuse the enhanced features to obtain more discriminative descriptor for 3D shape data. Experimental results on the ModelNet40 dataset demonstrate that the proposed method achieves promising performance compared with state-of-the-art methods.
{"title":"Attention-Guided Fusion Network of Point Cloud and Multiple Views for 3D Shape Recognition","authors":"Bo Peng, Zengrui Yu, Jianjun Lei, Jiahui Song","doi":"10.1109/VCIP49819.2020.9301813","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301813","url":null,"abstract":"With the dramatic growth of 3D shape data, 3D shape recognition has become a hot research topic in the field of computer vision. How to effectively utilize the multimodal characteristics of 3D shape has been one of the key problems to boost the performance of 3D shape recognition. In this paper, we propose a novel attention-guided fusion network of point cloud and multiple views for 3D shape recognition. Specifically, in order to obtain more discriminative descriptor for 3D shape data, the inter-modality attention enhancement module and view-context attention fusion module are proposed to gradually refine and fuse the features of the point cloud and multiple views. In the inter-modality attention enhancement module, the inter-modality attention mask based on the joint feature representation is computed, so that the features of each modality are enhanced by fusing the correlative information between two modalities. After that, the view-context attention fusion module is proposed to explore the context information of multiple views, and fuse the enhanced features to obtain more discriminative descriptor for 3D shape data. Experimental results on the ModelNet40 dataset demonstrate that the proposed method achieves promising performance compared with state-of-the-art methods.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121262717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Indoor navigation is urgently needed by blind people in their everyday lives. In this paper, we design an assistive cane with visual odometry based on actual requirements of the blind to aid them in attaining safe indoor navigation. Compared to the state-of-the-art indoor navigation systems, the proposed device is portable, compact, and adaptable. The main specifications of the system are: the perception range is respectively from 0.10m to 2.10m, and 0.08m to 1.60m for width and length dimensions; the maximum weight is 2.1kg; the detection range is from 0.15m and 3.00m; the cruising ability is about 8h; and the objects whose heights are below 80cm can be detected. The demo video of the proposed navigation system is available at: https://doi.org/10.6084/m9.figshare.12399572.v1.
盲人在日常生活中迫切需要室内导航。本文根据盲人的实际需求,设计了一种具有视觉里程计的辅助手杖,以帮助他们实现安全的室内导航。与最先进的室内导航系统相比,所提出的设备具有便携、紧凑和适应性强的特点。系统的主要规格为:感知范围分别为0.1 m ~ 2.1 m,宽度和长度尺寸分别为0.08m ~ 1.60m;最大重量2.1kg;探测距离为0.15m ~ 3.00m;巡航能力约8h;并且可以检测到高度在80cm以下的物体。所提出的导航系统的演示视频可在:https://doi.org/10.6084/m9.figshare.12399572.v1。
{"title":"Special Cane with Visual Odometry for Real-time Indoor Navigation of Blind People","authors":"Tang Tang, Menghan Hu, Guodong Li, Qingli Li, Jian Zhang, Xiaofeng Zhou, Guangtao Zhai","doi":"10.1109/VCIP49819.2020.9301782","DOIUrl":"https://doi.org/10.1109/VCIP49819.2020.9301782","url":null,"abstract":"Indoor navigation is urgently needed by blind people in their everyday lives. In this paper, we design an assistive cane with visual odometry based on actual requirements of the blind to aid them in attaining safe indoor navigation. Compared to the state-of-the-art indoor navigation systems, the proposed device is portable, compact, and adaptable. The main specifications of the system are: the perception range is respectively from 0.10m to 2.10m, and 0.08m to 1.60m for width and length dimensions; the maximum weight is 2.1kg; the detection range is from 0.15m and 3.00m; the cruising ability is about 8h; and the objects whose heights are below 80cm can be detected. The demo video of the proposed navigation system is available at: https://doi.org/10.6084/m9.figshare.12399572.v1.","PeriodicalId":431880,"journal":{"name":"2020 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116193491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}