Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506173
Z. Deng, Kai Zhang, Li Zhang
This paper presents a decoder derived cross-component linear model (DD-CCLM) intra-prediction method, in which one or more linear models can be used to exploit the similarities between luma and chroma sample values, and the number of linear models used for a specific coding unit is adaptively determined at both encoder and decoder sides in a consistent way, without signalling a syntax element. The neighbouring samples are classified into two or three groups based on a K-means algorithm. Moreover, DDCCLM can be combined with normal intra-prediction modes such as DM mode. The proposed method can be well incorporated with the state-of-the-art CCLM intra-prediction in the Versatile Video Coding standard. Experimental results show that the proposed method provides an overall average bitrate saving of 0.52% for All Intra configurations under the JVET common test conditions, with negligible runtime change. On sequences with rich chroma information, the coding gain is up to 2.07%.
{"title":"Decoder Derived Cross-Component Linear Model Intra-Prediction for Video Coding","authors":"Z. Deng, Kai Zhang, Li Zhang","doi":"10.1109/ICIP42928.2021.9506173","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506173","url":null,"abstract":"This paper presents a decoder derived cross-component linear model (DD-CCLM) intra-prediction method, in which one or more linear models can be used to exploit the similarities between luma and chroma sample values, and the number of linear models used for a specific coding unit is adaptively determined at both encoder and decoder sides in a consistent way, without signalling a syntax element. The neighbouring samples are classified into two or three groups based on a K-means algorithm. Moreover, DDCCLM can be combined with normal intra-prediction modes such as DM mode. The proposed method can be well incorporated with the state-of-the-art CCLM intra-prediction in the Versatile Video Coding standard. Experimental results show that the proposed method provides an overall average bitrate saving of 0.52% for All Intra configurations under the JVET common test conditions, with negligible runtime change. On sequences with rich chroma information, the coding gain is up to 2.07%.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115241604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506166
Jiwoo Kang, Seongmin Lee, Mingyu Jang, H. Yoon, Sanghoon Lee
In this paper, we propose the novel 3D reconstruction framework, where the surface of a target object is reconstructed accurately and robustly from multi-view depth maps. A depth map of a moving object tends to have the spatially-varying perspective warps due to motion blur and rolling shutter artifacts. Incorporating those misaligned points from the views into the world coordinate leads to significant artifacts in the reconstructed shape. We address the mismatches by the patch-based depth-to-surface alignment using implicit surface-based distance measurement. The patch-based minimization finds spatial warps on the depth map fast and accurately with the global transformation preserved. The proposed framework efficiently optimizes the local alignments against depth occlusions and local variants thanks to the point to surface distance based on an implicit representation. The proposed method shows significant improvements over the other reconstruction methods, demonstrating efficiency and benefits of our method in the multi-view reconstruction.
{"title":"WarpingFusion: Accurate Multi-View TSDF Fusion with Local Perspective Warp","authors":"Jiwoo Kang, Seongmin Lee, Mingyu Jang, H. Yoon, Sanghoon Lee","doi":"10.1109/ICIP42928.2021.9506166","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506166","url":null,"abstract":"In this paper, we propose the novel 3D reconstruction framework, where the surface of a target object is reconstructed accurately and robustly from multi-view depth maps. A depth map of a moving object tends to have the spatially-varying perspective warps due to motion blur and rolling shutter artifacts. Incorporating those misaligned points from the views into the world coordinate leads to significant artifacts in the reconstructed shape. We address the mismatches by the patch-based depth-to-surface alignment using implicit surface-based distance measurement. The patch-based minimization finds spatial warps on the depth map fast and accurately with the global transformation preserved. The proposed framework efficiently optimizes the local alignments against depth occlusions and local variants thanks to the point to surface distance based on an implicit representation. The proposed method shows significant improvements over the other reconstruction methods, demonstrating efficiency and benefits of our method in the multi-view reconstruction.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114679695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506021
Vincent Monardo, A. Iyer, S. Donegan, M. Graef, Yuejie Chi
Plug-and-play (PnP) methods have recently emerged as a powerful framework for image reconstruction that can flexibly combine different physics-based observation models with data-driven image priors in the form of denoisers, and achieve state-of-the-art image reconstruction quality in many applications. In this paper, we aim to further improve the computational efficacy of PnP methods by designing a new algorithm that makes use of stochastic variance-reduced gradients (SVRG), a nascent idea to accelerate runtime in stochastic optimization. Compared with existing PnP methods using batch gradients or stochastic gradients, the new algorithm, called PnP-SVRG, achieves comparable or better accuracy of image reconstruction at a much faster computational speed. Extensive numerical experiments are provided to demonstrate the benefits of the proposed algorithm through the application of compressive imaging using partial Fourier measurements in conjunction with a wide variety of popular image denoisers.
{"title":"Plug-And-Play Image Reconstruction Meets Stochastic Variance-Reduced Gradient Methods","authors":"Vincent Monardo, A. Iyer, S. Donegan, M. Graef, Yuejie Chi","doi":"10.1109/ICIP42928.2021.9506021","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506021","url":null,"abstract":"Plug-and-play (PnP) methods have recently emerged as a powerful framework for image reconstruction that can flexibly combine different physics-based observation models with data-driven image priors in the form of denoisers, and achieve state-of-the-art image reconstruction quality in many applications. In this paper, we aim to further improve the computational efficacy of PnP methods by designing a new algorithm that makes use of stochastic variance-reduced gradients (SVRG), a nascent idea to accelerate runtime in stochastic optimization. Compared with existing PnP methods using batch gradients or stochastic gradients, the new algorithm, called PnP-SVRG, achieves comparable or better accuracy of image reconstruction at a much faster computational speed. Extensive numerical experiments are provided to demonstrate the benefits of the proposed algorithm through the application of compressive imaging using partial Fourier measurements in conjunction with a wide variety of popular image denoisers.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116866553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506095
M. Salman Asif
Fourier phase retrieval problem is equivalent to the recovery of a two-dimensional image from its autocorrelation measurements. This problem is generally nonlinear and nonconvex. Good initialization and prior information about the support or sparsity of the target image are often critical for a robust recovery. In this paper, we show that the presence of a known reference image can help us solve the nonlinear phase retrieval problem as a sequence of small linear inverse problems. Instead of recovering the entire image at once, our sequential method recovers a small number of rows or columns by solving a linear deconvolution problem at every step. Existing methods for the reference-based (holographic) phase retrieval either assume that the reference and target images are sufficiently separated so that the recovery problem is linear or recover the image via nonlinear optimization. In contrast, our proposed method does not require the separation condition. We performed an extensive set of simulations to demonstrate that our proposed method can successfully recover images from autocorrelation data under different settings of reference placement and noise.
{"title":"Solving Fourier Phase Retrieval with a Reference Image as a Sequence of Linear Inverse Problems","authors":"M. Salman Asif","doi":"10.1109/ICIP42928.2021.9506095","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506095","url":null,"abstract":"Fourier phase retrieval problem is equivalent to the recovery of a two-dimensional image from its autocorrelation measurements. This problem is generally nonlinear and nonconvex. Good initialization and prior information about the support or sparsity of the target image are often critical for a robust recovery. In this paper, we show that the presence of a known reference image can help us solve the nonlinear phase retrieval problem as a sequence of small linear inverse problems. Instead of recovering the entire image at once, our sequential method recovers a small number of rows or columns by solving a linear deconvolution problem at every step. Existing methods for the reference-based (holographic) phase retrieval either assume that the reference and target images are sufficiently separated so that the recovery problem is linear or recover the image via nonlinear optimization. In contrast, our proposed method does not require the separation condition. We performed an extensive set of simulations to demonstrate that our proposed method can successfully recover images from autocorrelation data under different settings of reference placement and noise.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"16 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120903571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506669
Zhengzhong Tu, Chia-Ju Chen, Yilin Wang, N. Birkbeck, Balu Adsumilli, A. Bovik
Blind video quality assessment of user-generated content (UGC) has become a trending and challenging problem. Previous studies have shown the efficacy of natural scene statistics for capturing spatial distortions. The exploration of temporal video statistics on UGC, however, is relatively limited. Here we propose the first general, effective and efficient temporal statistics model accounting for temporal- or motion-related distortions for UGC video quality assessment, by analyzing regularities in the temporal bandpass domain. The proposed temporal model can serve as a plug-in module to boost existing no-reference video quality predictors that lack motion-relevant features. Our experimental results on recent large-scale UGC video databases show that the proposed model can significantly improve the performances of existing methods, at a very reasonable computational expense.
{"title":"A Temporal Statistics Model For UGC Video Quality Prediction","authors":"Zhengzhong Tu, Chia-Ju Chen, Yilin Wang, N. Birkbeck, Balu Adsumilli, A. Bovik","doi":"10.1109/ICIP42928.2021.9506669","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506669","url":null,"abstract":"Blind video quality assessment of user-generated content (UGC) has become a trending and challenging problem. Previous studies have shown the efficacy of natural scene statistics for capturing spatial distortions. The exploration of temporal video statistics on UGC, however, is relatively limited. Here we propose the first general, effective and efficient temporal statistics model accounting for temporal- or motion-related distortions for UGC video quality assessment, by analyzing regularities in the temporal bandpass domain. The proposed temporal model can serve as a plug-in module to boost existing no-reference video quality predictors that lack motion-relevant features. Our experimental results on recent large-scale UGC video databases show that the proposed model can significantly improve the performances of existing methods, at a very reasonable computational expense.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127077205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506142
Mohamed Chelali, Camille Kurtz, N. Vincent
Action recognition in videos, especially for violence detection, is now a hot topic in computer vision. The interest of this task is related to the multiplication of videos from surveillance cameras or live television content producing complex $2D+t$ data. State-of-the-art methods rely on end-to-end learning from 3D neural network approaches that should be trained with a large amount of data to obtain discriminating features. To face these limitations, we present in this article a method to classify videos for violence recognition purpose, by using a classical 2D convolutional neural network (CNN). The strategy of the method is two-fold: (1) we start by building several 2D spatio-temporal representations from an input video, (2) the new representations are considered to feed the CNN to the train/test process. The classification decision of the video is carried out by aggregating the individual decisions from its different 2D spatio-temporal representations. An experimental study on public datasets containing violent videos highlights the interest of the presented method.
{"title":"Violence Detection from Video under 2D Spatio-Temporal Representations","authors":"Mohamed Chelali, Camille Kurtz, N. Vincent","doi":"10.1109/ICIP42928.2021.9506142","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506142","url":null,"abstract":"Action recognition in videos, especially for violence detection, is now a hot topic in computer vision. The interest of this task is related to the multiplication of videos from surveillance cameras or live television content producing complex $2D+t$ data. State-of-the-art methods rely on end-to-end learning from 3D neural network approaches that should be trained with a large amount of data to obtain discriminating features. To face these limitations, we present in this article a method to classify videos for violence recognition purpose, by using a classical 2D convolutional neural network (CNN). The strategy of the method is two-fold: (1) we start by building several 2D spatio-temporal representations from an input video, (2) the new representations are considered to feed the CNN to the train/test process. The classification decision of the video is carried out by aggregating the individual decisions from its different 2D spatio-temporal representations. An experimental study on public datasets containing violent videos highlights the interest of the presented method.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126089843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506778
Haiwei Wu, Jiantao Zhou
Deep learning (DL) has demonstrated its powerful capabilities in the field of image inpainting, which could produce visually plausible results. Meanwhile, the malicious use of advanced image inpainting tools (e.g. removing key objects to report fake news) has led to increasing threats to the reliability of image data. To fight against the inpainting forgeries, in this work, we propose a novel end-to-end Generalizable Image Inpainting Detection Network (GIID-Net), to detect the inpainted regions at pixel accuracy. Extensive experimental results are presented to validate the superiority of the proposed GIID-Net, compared with the state-of-the-art competitors. Our results would suggest that common artifacts are shared across diverse image inpainting methods.
{"title":"GIID-NET: Generalizable Image Inpainting Detection Network","authors":"Haiwei Wu, Jiantao Zhou","doi":"10.1109/ICIP42928.2021.9506778","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506778","url":null,"abstract":"Deep learning (DL) has demonstrated its powerful capabilities in the field of image inpainting, which could produce visually plausible results. Meanwhile, the malicious use of advanced image inpainting tools (e.g. removing key objects to report fake news) has led to increasing threats to the reliability of image data. To fight against the inpainting forgeries, in this work, we propose a novel end-to-end Generalizable Image Inpainting Detection Network (GIID-Net), to detect the inpainted regions at pixel accuracy. Extensive experimental results are presented to validate the superiority of the proposed GIID-Net, compared with the state-of-the-art competitors. Our results would suggest that common artifacts are shared across diverse image inpainting methods.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125423805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506774
H. Sahbi
Graph convolutional networks (GCNs) aim at extending deep learning to arbitrary irregular domains, namely graphs. Their success is highly dependent on how the topology of input graphs is defined and most of the existing GCN architectures rely on predefined or handcrafted graph structures. In this paper, we introduce a novel method that learns the topology (or connectivity) of input graphs as a part of GCN design. The main contribution of our method resides in building an orthogonal connectivity basis that optimally aggregates nodes, through their neighborhood, prior to achieve convolution. Our method also considers a stochasticity criterion which acts as a regularizer that makes the learned basis and the underlying GCNs lightweight while still being highly effective. Experiments conducted on the challenging task of skeleton-based hand-gesture recognition show the high effectiveness of the learned GCNs w.r.t. the related work.
{"title":"Lightweight Connectivity In Graph Convolutional Networks For Skeleton-Based Recognition","authors":"H. Sahbi","doi":"10.1109/ICIP42928.2021.9506774","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506774","url":null,"abstract":"Graph convolutional networks (GCNs) aim at extending deep learning to arbitrary irregular domains, namely graphs. Their success is highly dependent on how the topology of input graphs is defined and most of the existing GCN architectures rely on predefined or handcrafted graph structures. In this paper, we introduce a novel method that learns the topology (or connectivity) of input graphs as a part of GCN design. The main contribution of our method resides in building an orthogonal connectivity basis that optimally aggregates nodes, through their neighborhood, prior to achieve convolution. Our method also considers a stochasticity criterion which acts as a regularizer that makes the learned basis and the underlying GCNs lightweight while still being highly effective. Experiments conducted on the challenging task of skeleton-based hand-gesture recognition show the high effectiveness of the learned GCNs w.r.t. the related work.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115015014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506390
Konstantinos Apostolidis, V. Mezaris
In this paper a method that re-targets a video to a different aspect ratio using cropping is presented. We argue that cropping methods are more suitable for video aspect ratio transformation when the minimization of semantic distortions is a prerequisite. For our method, we utilize visual saliency to find the image regions of attention, and we employ a filtering-through-clustering technique to select the main region of focus. We additionally introduce the first publicly available benchmark dataset for video cropping, annotated by 6 human subjects. Experimental evaluation on the introduced dataset shows the competitiveness of our method.
{"title":"A Fast Smart-Cropping Method and Dataset for Video Retargeting","authors":"Konstantinos Apostolidis, V. Mezaris","doi":"10.1109/ICIP42928.2021.9506390","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506390","url":null,"abstract":"In this paper a method that re-targets a video to a different aspect ratio using cropping is presented. We argue that cropping methods are more suitable for video aspect ratio transformation when the minimization of semantic distortions is a prerequisite. For our method, we utilize visual saliency to find the image regions of attention, and we employ a filtering-through-clustering technique to select the main region of focus. We additionally introduce the first publicly available benchmark dataset for video cropping, annotated by 6 human subjects. Experimental evaluation on the introduced dataset shows the competitiveness of our method.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116436203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506249
Thang Vu, Kookhoi Kim, Haeyong Kang, Xuan Thanh Nguyen, T. Luu, C. Yoo
A bounding box commonly serves as the proxy for 2D object detection. However, extending this practice to 3D detection raises sensitivity to localization error. This problem is acute on flat objects since small localization error may lead to low overlaps between the prediction and ground truth. To address this problem, this paper proposes Sphere Region Proposal Network (SphereRPN) which detects objects by learning spheres as opposed to bounding boxes. We demonstrate that spherical proposals are more robust to localization error compared to bounding boxes. The proposed SphereRPN is not only accurate but also fast. Experiment results on the standard ScanNet dataset show that the proposed SphereRPN outperforms the previous state-of-the-art methods by a large margin while being $2 times$ to $7 times$ faster. The code will be made publicly available.
{"title":"Sphererpn: Learning Spheres For High-Quality Region Proposals On 3d Point Clouds Object Detection","authors":"Thang Vu, Kookhoi Kim, Haeyong Kang, Xuan Thanh Nguyen, T. Luu, C. Yoo","doi":"10.1109/ICIP42928.2021.9506249","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506249","url":null,"abstract":"A bounding box commonly serves as the proxy for 2D object detection. However, extending this practice to 3D detection raises sensitivity to localization error. This problem is acute on flat objects since small localization error may lead to low overlaps between the prediction and ground truth. To address this problem, this paper proposes Sphere Region Proposal Network (SphereRPN) which detects objects by learning spheres as opposed to bounding boxes. We demonstrate that spherical proposals are more robust to localization error compared to bounding boxes. The proposed SphereRPN is not only accurate but also fast. Experiment results on the standard ScanNet dataset show that the proposed SphereRPN outperforms the previous state-of-the-art methods by a large margin while being $2 times$ to $7 times$ faster. The code will be made publicly available.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122296251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}