Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9190991
Xiannong Wu, Chi Zhang, Yuehu Liu
Precise and online LiDAR-camera extrinsic calibration is one of the prerequisites of multi-modal data fusion for autonomous perception. The existing 6-DoF pose regression networks take majority effort on coarse-to-fine training strategy to gradually approach the global minimum. However, with limited computing resources, the optimal pose parameters seem unreachable. Moreover, recent research on neural network interpretability reveals that learning-based pose regression is nothing but the interpolation with most relevant samples. Motivated by this notion, we propose to solve the calibration problem in a retrieval way. Concretely, the learning-to-rank pipeline is introduced for ranking the top n relevant poses in the gallery set, which is then fused in to the final prediction. To better explore the pose relevance between ground truth samples, we further propose an exponential mapping from parametric space to the relevance space. The superiority of the proposed method is validated and demonstrated in the comparative and ablative experimental analysis.
{"title":"Calibrank: Effective Lidar-Camera Extrinsic Calibration By Multi-Modal Learning To Rank","authors":"Xiannong Wu, Chi Zhang, Yuehu Liu","doi":"10.1109/ICIP40778.2020.9190991","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9190991","url":null,"abstract":"Precise and online LiDAR-camera extrinsic calibration is one of the prerequisites of multi-modal data fusion for autonomous perception. The existing 6-DoF pose regression networks take majority effort on coarse-to-fine training strategy to gradually approach the global minimum. However, with limited computing resources, the optimal pose parameters seem unreachable. Moreover, recent research on neural network interpretability reveals that learning-based pose regression is nothing but the interpolation with most relevant samples. Motivated by this notion, we propose to solve the calibration problem in a retrieval way. Concretely, the learning-to-rank pipeline is introduced for ranking the top n relevant poses in the gallery set, which is then fused in to the final prediction. To better explore the pose relevance between ground truth samples, we further propose an exponential mapping from parametric space to the relevance space. The superiority of the proposed method is validated and demonstrated in the comparative and ablative experimental analysis.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122123412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9190915
Meng Lei, Falei Luo, Xinfeng Zhang, Shanshe Wang, Siwei Ma
In traditional intra prediction, nearest reference samples are utilized to generate the prediction block. Although more directional intra modes and reference lines have been utilized, encoders could not predict complex content with only the 10-cal reference samples efficiently. To address this issue, a twostep progressive prediction method combining local and nonlocal information is proposed. The non-local information can be obtained through template matching based prediction, and the local information can be derived by the high frequency coefficients from the first prediction step. Experimental results show that the proposed method can achieve 0.87% BD-rate reduction in VTM-7.0. In particular, the method is of significant advantages over prediction schemes using only non-local information.
{"title":"Two-Step Progressive Intra Prediction For Versatile Video Coding","authors":"Meng Lei, Falei Luo, Xinfeng Zhang, Shanshe Wang, Siwei Ma","doi":"10.1109/ICIP40778.2020.9190915","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9190915","url":null,"abstract":"In traditional intra prediction, nearest reference samples are utilized to generate the prediction block. Although more directional intra modes and reference lines have been utilized, encoders could not predict complex content with only the 10-cal reference samples efficiently. To address this issue, a twostep progressive prediction method combining local and nonlocal information is proposed. The non-local information can be obtained through template matching based prediction, and the local information can be derived by the high frequency coefficients from the first prediction step. Experimental results show that the proposed method can achieve 0.87% BD-rate reduction in VTM-7.0. In particular, the method is of significant advantages over prediction schemes using only non-local information.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117087037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9191343
Yunpeng Tang, Xiaobo Shen, Zexuan Ji, Tao Wang, Peng Fu, Quansen Sun
Multi-view hashing efficiently integrates multi-view data for learning compact hash codes, and achieves impressive large-scale retrieval performance. In real-world applications, multi-view data are often stored or collected in different locations, where hash code learning is more challenging yet less studied. To fulfill this gap, this paper proposes a novel supervised multi-view distributed hashing (SMvDisH) for hash code learning from multi-view data in a distributed manner. SMvDisH yields the discriminative latent hash codes by joint learning of latent factor model and classifier. With local consistency assumption among neighbor nodes, the distributed learning problem is divided into a set of decentralized sub-problems. The sub-problems can be solved in parallel, and the computational and communication costs are low. Experimental results on three large-scale image datasets demonstrate that SMvDisH achieves competitive retrieval performance and trains faster than state-of-the-art multi-view hashing methods.
{"title":"Supervised Multi-View Distributed Hashing","authors":"Yunpeng Tang, Xiaobo Shen, Zexuan Ji, Tao Wang, Peng Fu, Quansen Sun","doi":"10.1109/ICIP40778.2020.9191343","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9191343","url":null,"abstract":"Multi-view hashing efficiently integrates multi-view data for learning compact hash codes, and achieves impressive large-scale retrieval performance. In real-world applications, multi-view data are often stored or collected in different locations, where hash code learning is more challenging yet less studied. To fulfill this gap, this paper proposes a novel supervised multi-view distributed hashing (SMvDisH) for hash code learning from multi-view data in a distributed manner. SMvDisH yields the discriminative latent hash codes by joint learning of latent factor model and classifier. With local consistency assumption among neighbor nodes, the distributed learning problem is divided into a set of decentralized sub-problems. The sub-problems can be solved in parallel, and the computational and communication costs are low. Experimental results on three large-scale image datasets demonstrate that SMvDisH achieves competitive retrieval performance and trains faster than state-of-the-art multi-view hashing methods.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123901144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9191112
Jianjun Lei, Zongqian Zhang, Dong Liu, Ying Chen, N. Ling
Multiview video has a large amount of data which brings great challenges to both the storage and transmission. Thus, it is essential to increase the compression efficiency of multiview video coding. In this paper, a deep virtual reference frame generation method is proposed to improve the performance of multiview video coding. Specifically, a parallax-guided generation network (PGG-Net) is designed to transform the parallax relation between different viewpoints and generate a high-quality virtual reference frame. In the network, a multilevel receptive field module is designed to enlarge the receptive field and extract the multi-scale deep features. After that, a parallax attention fusion module is used to transform the parallax and merge the features. The proposed method is integrated into the platform of 3D-HEVC and the generated virtual reference frame is inserted into the reference picture list as an additional reference. Experimental results show that the proposed method achieves 5.31% average BD-rate reduction compared to the 3D-HEVC.
{"title":"Deep Virtual Reference Frame Generation For Multiview Video Coding","authors":"Jianjun Lei, Zongqian Zhang, Dong Liu, Ying Chen, N. Ling","doi":"10.1109/ICIP40778.2020.9191112","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9191112","url":null,"abstract":"Multiview video has a large amount of data which brings great challenges to both the storage and transmission. Thus, it is essential to increase the compression efficiency of multiview video coding. In this paper, a deep virtual reference frame generation method is proposed to improve the performance of multiview video coding. Specifically, a parallax-guided generation network (PGG-Net) is designed to transform the parallax relation between different viewpoints and generate a high-quality virtual reference frame. In the network, a multilevel receptive field module is designed to enlarge the receptive field and extract the multi-scale deep features. After that, a parallax attention fusion module is used to transform the parallax and merge the features. The proposed method is integrated into the platform of 3D-HEVC and the generated virtual reference frame is inserted into the reference picture list as an additional reference. Experimental results show that the proposed method achieves 5.31% average BD-rate reduction compared to the 3D-HEVC.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"653 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123971409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9190771
Zhuoqian Yang, Zengchang Qin, Jing Yu, T. Wan
Visual Question Answering (VQA) is a representative task of cross-modal reasoning where an image and a free-form question in natural language are presented and the correct answer needs to be determined using both visual and textual information. One of the key issues of VQA is to reason with semantic clues in the visual content under the guidance of the question. In this paper, we propose Scene Graph Convolutional Network (SceneGCN) to jointly reason the object properties and their semantic relations for the correct answer. The visual relationship is projected into a deep learned semantic space constrained by visual context and language priors. Based on comprehensive experiments on two challenging datasets: GQA and VQA 2.0, we demonstrate the effectiveness and interpretability of the new model.
{"title":"Prior Visual Relationship Reasoning For Visual Question Answering","authors":"Zhuoqian Yang, Zengchang Qin, Jing Yu, T. Wan","doi":"10.1109/ICIP40778.2020.9190771","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9190771","url":null,"abstract":"Visual Question Answering (VQA) is a representative task of cross-modal reasoning where an image and a free-form question in natural language are presented and the correct answer needs to be determined using both visual and textual information. One of the key issues of VQA is to reason with semantic clues in the visual content under the guidance of the question. In this paper, we propose Scene Graph Convolutional Network (SceneGCN) to jointly reason the object properties and their semantic relations for the correct answer. The visual relationship is projected into a deep learned semantic space constrained by visual context and language priors. Based on comprehensive experiments on two challenging datasets: GQA and VQA 2.0, we demonstrate the effectiveness and interpretability of the new model.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124069852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9191336
Jiaqiyu Zhan, Yuesheng Zhu, Zhiqiang Bai
Methods based on sparse subspace clustering (SSC) have shown great potential for hyperspectral image (HSI) clustering. However their performance is limited due to the complex spatial-spectral structure in HSIs. In this paper, a spatial best-fit direction (SBFD) algorithm is proposed to update the coefficients obtained from sparse representation to more discriminant features by integrating the spatial-contextual information given by the best-fit pixel of each target pixel. Also, SBFD is more targeted by searching for the best-fit direction than directly using the local window to do max pooling. The proposed SBFD was tested on two widely used hyperspectral dataset, the experimental results indicate its improvement in the clustering accuracy and spatial homogeneity.
{"title":"Targeted Incorporating Spatial Information in Sparse Subspace Clustering of Hyperspectral Remote Sensing Images","authors":"Jiaqiyu Zhan, Yuesheng Zhu, Zhiqiang Bai","doi":"10.1109/ICIP40778.2020.9191336","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9191336","url":null,"abstract":"Methods based on sparse subspace clustering (SSC) have shown great potential for hyperspectral image (HSI) clustering. However their performance is limited due to the complex spatial-spectral structure in HSIs. In this paper, a spatial best-fit direction (SBFD) algorithm is proposed to update the coefficients obtained from sparse representation to more discriminant features by integrating the spatial-contextual information given by the best-fit pixel of each target pixel. Also, SBFD is more targeted by searching for the best-fit direction than directly using the local window to do max pooling. The proposed SBFD was tested on two widely used hyperspectral dataset, the experimental results indicate its improvement in the clustering accuracy and spatial homogeneity.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124655422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9191323
B. Vishwanath, Shunyao Li, K. Rose
Transform coding is a key component of video coders, tasked with spatial decorrelation of the prediction residual. There is growing interest in adapting the transform to local statistics of the inter-prediction residual, going beyond a few standard trigonometric transforms. However, the joint design of multiple transform modes is highly challenging due to critical stability problems inherent to feedback through the codec’s prediction loop, wherein training updates inadvertently impact the signal statistics the transform ultimately operates on, and are often counter-productive (and sometimes catastrophic). It is the premise of this work that a truly effective switched transform design procedure must account for and circumvent this shortcoming. We introduce a data-driven approach to design optimal transform modes for adaptive switching by the encoder. Most importantly, to overcome the critical stability issues, the approach is derived within an asymptotic closed loop (ACL) design framework, wherein each iteration operates in an effective open loop, and is thus inherently stable, but with a subterfuge that ensures that, asymptotically, the design approaches closed loop operation, as required for the ultimate coder operation. Experimental results demonstrate the efficacy of the proposed optimization paradigm which yields significant performance gains over the state-of-the-art.
{"title":"Asymptotic Closed-Loop Design Of Transform Modes For The Inter-Prediction Residual In Video Coding","authors":"B. Vishwanath, Shunyao Li, K. Rose","doi":"10.1109/ICIP40778.2020.9191323","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9191323","url":null,"abstract":"Transform coding is a key component of video coders, tasked with spatial decorrelation of the prediction residual. There is growing interest in adapting the transform to local statistics of the inter-prediction residual, going beyond a few standard trigonometric transforms. However, the joint design of multiple transform modes is highly challenging due to critical stability problems inherent to feedback through the codec’s prediction loop, wherein training updates inadvertently impact the signal statistics the transform ultimately operates on, and are often counter-productive (and sometimes catastrophic). It is the premise of this work that a truly effective switched transform design procedure must account for and circumvent this shortcoming. We introduce a data-driven approach to design optimal transform modes for adaptive switching by the encoder. Most importantly, to overcome the critical stability issues, the approach is derived within an asymptotic closed loop (ACL) design framework, wherein each iteration operates in an effective open loop, and is thus inherently stable, but with a subterfuge that ensures that, asymptotically, the design approaches closed loop operation, as required for the ultimate coder operation. Experimental results demonstrate the efficacy of the proposed optimization paradigm which yields significant performance gains over the state-of-the-art.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129505134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9191235
M. Sultana, Arif Mahmood, T. Bouwmans, Soon Ki Jung
Dynamic Background Subtraction (BS) is a fundamental problem in many vision-based applications. BS in real complex environments has several challenging conditions like illumination variations, shadows, camera jitters, and bad weather. In this study, we aim to address the challenges of BS in complex scenes by exploiting conditional least squares adversarial networks. During training, a scene-specific conditional least squares adversarial network with two additional regularizations including L1-Loss and Perceptual-Loss is employed to learn the dynamic background variations. The given input to the model is video frames conditioned on corresponding ground truth to learn the dynamic changes in complex scenes. Afterwards, testing is performed on unseen test video frames so that the generator would conduct dynamic background subtraction. The proposed method consisting of three loss-terms including least squares adversarial loss, L1-Loss and Perceptual-Loss is evaluated on two benchmark datasets CDnet2014 and BMC. The results of our proposed method show improved performance on both datasets compared with 10 existing state-of-the-art methods.
{"title":"Dynamic Background Subtraction Using Least Square Adversarial Learning","authors":"M. Sultana, Arif Mahmood, T. Bouwmans, Soon Ki Jung","doi":"10.1109/ICIP40778.2020.9191235","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9191235","url":null,"abstract":"Dynamic Background Subtraction (BS) is a fundamental problem in many vision-based applications. BS in real complex environments has several challenging conditions like illumination variations, shadows, camera jitters, and bad weather. In this study, we aim to address the challenges of BS in complex scenes by exploiting conditional least squares adversarial networks. During training, a scene-specific conditional least squares adversarial network with two additional regularizations including L1-Loss and Perceptual-Loss is employed to learn the dynamic background variations. The given input to the model is video frames conditioned on corresponding ground truth to learn the dynamic changes in complex scenes. Afterwards, testing is performed on unseen test video frames so that the generator would conduct dynamic background subtraction. The proposed method consisting of three loss-terms including least squares adversarial loss, L1-Loss and Perceptual-Loss is evaluated on two benchmark datasets CDnet2014 and BMC. The results of our proposed method show improved performance on both datasets compared with 10 existing state-of-the-art methods.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129896440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9190971
Milan Stepanov, G. Valenzise, F. Dufaux
Light fields have additional storage requirements compared to conventional image and video signals, and demand therefore an efficient representation. In order to improve coding efficiency, in this work we propose a hybrid coding scheme which combines a learning-based compression approach with a traditional video coding scheme. Their integration offers great gains at low/mid bitrates thanks to the efficient representation of the learning-based approach and is competitive at high bitrates compared to standard tools thanks to the encoding of the residual signal. The proposed approach achieves on average 38% and 31% BD rate saving compared to HEVC and JPEG Pleno transform-based codec, respectively.
{"title":"Hybrid Learning-Based And Hevc-Based Coding Of Light Fields","authors":"Milan Stepanov, G. Valenzise, F. Dufaux","doi":"10.1109/ICIP40778.2020.9190971","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9190971","url":null,"abstract":"Light fields have additional storage requirements compared to conventional image and video signals, and demand therefore an efficient representation. In order to improve coding efficiency, in this work we propose a hybrid coding scheme which combines a learning-based compression approach with a traditional video coding scheme. Their integration offers great gains at low/mid bitrates thanks to the efficient representation of the learning-based approach and is competitive at high bitrates compared to standard tools thanks to the encoding of the residual signal. The proposed approach achieves on average 38% and 31% BD rate saving compared to HEVC and JPEG Pleno transform-based codec, respectively.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128409172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9191356
O. Oyedotun, Abd El Rahman Shabayek, Djamila Aouada, B. Ottersten
We propose the training of very deep neural networks (DNNs) without shortcut connections known as PlainNets. Training such networks is a notoriously hard problem due to: (1) the relatively popular challenge of vanishing and exploding activations, and (2) the less studied ‘near singularity’ problem. We argue that if the aforementioned problems are tackled together, the training of deeper PlainNets becomes easier. Subsequently, we propose the training of very deep PlainNets by leveraging Leaky Rectified Linear Units (LReLUs), parameter constraint and strategic parameter initialization. Our approach is simple and allows to successfully train very deep PlainNets having up to 100 layers without employing shortcut connections. To validate this approach, we validate on five challenging datasets; namely, MNIST, CIFAR-10, CIFAR100, SVHN and ImageNet datasets. We report the best results known on the ImageNet dataset using a PlainNet with top-1 and top-5 error rates of 24.1% and 7.3%, respectively.
我们提出了没有捷径连接的非常深度神经网络(dnn)的训练,称为PlainNets。训练这样的网络是一个众所周知的难题,因为:(1)相对流行的消失和爆炸激活的挑战,以及(2)研究较少的“近奇点”问题。我们认为,如果将上述问题一起解决,那么更深层次的PlainNets的训练就会变得更容易。随后,我们提出了利用Leaky Rectified Linear Units (LReLUs)、参数约束和策略参数初始化来训练非常深的PlainNets。我们的方法很简单,可以在不使用快捷连接的情况下成功训练具有多达100层的非常深的PlainNets。为了验证这种方法,我们在五个具有挑战性的数据集上进行了验证;即MNIST、CIFAR-10、CIFAR100、SVHN和ImageNet数据集。我们使用PlainNet在ImageNet数据集上报告了已知的最佳结果,前1和前5的错误率分别为24.1%和7.3%。
{"title":"Going Deeper With Neural Networks Without Skip Connections","authors":"O. Oyedotun, Abd El Rahman Shabayek, Djamila Aouada, B. Ottersten","doi":"10.1109/ICIP40778.2020.9191356","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9191356","url":null,"abstract":"We propose the training of very deep neural networks (DNNs) without shortcut connections known as PlainNets. Training such networks is a notoriously hard problem due to: (1) the relatively popular challenge of vanishing and exploding activations, and (2) the less studied ‘near singularity’ problem. We argue that if the aforementioned problems are tackled together, the training of deeper PlainNets becomes easier. Subsequently, we propose the training of very deep PlainNets by leveraging Leaky Rectified Linear Units (LReLUs), parameter constraint and strategic parameter initialization. Our approach is simple and allows to successfully train very deep PlainNets having up to 100 layers without employing shortcut connections. To validate this approach, we validate on five challenging datasets; namely, MNIST, CIFAR-10, CIFAR100, SVHN and ImageNet datasets. We report the best results known on the ImageNet dataset using a PlainNet with top-1 and top-5 error rates of 24.1% and 7.3%, respectively.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128452217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}