Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9190729
Y. Chen, Xi Xiao, Tao Dai, Shutao Xia
Image downscaling has become a classical problem in image processing and has recently connected to image super-resolution (SR), which restores high-quality images from low-resolution ones generated by predetermined downscaling kernels (e.g., bicubic). However, most existing image downscaling methods are deterministic and lose information during the downscaling process, while rarely designing specific downscaling methods for image SR. In this paper, we propose a novel learning-based image downscaling method, Hamiltonian Rescaling Network (HRNet). The design of HRNet is based on the discretization of Hamiltonian System, a pair of iterative updating equations, which formulate a mechanism of iterative correction of the error caused by information missing during image or feature downscaling. Extensive experiments demonstrate the effectiveness of our proposed method in terms of both quantitative and qualitative results.
{"title":"Hrnet: Hamiltonian Rescaling Network for Image Downscaling","authors":"Y. Chen, Xi Xiao, Tao Dai, Shutao Xia","doi":"10.1109/ICIP40778.2020.9190729","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9190729","url":null,"abstract":"Image downscaling has become a classical problem in image processing and has recently connected to image super-resolution (SR), which restores high-quality images from low-resolution ones generated by predetermined downscaling kernels (e.g., bicubic). However, most existing image downscaling methods are deterministic and lose information during the downscaling process, while rarely designing specific downscaling methods for image SR. In this paper, we propose a novel learning-based image downscaling method, Hamiltonian Rescaling Network (HRNet). The design of HRNet is based on the discretization of Hamiltonian System, a pair of iterative updating equations, which formulate a mechanism of iterative correction of the error caused by information missing during image or feature downscaling. Extensive experiments demonstrate the effectiveness of our proposed method in terms of both quantitative and qualitative results.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132578906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9190675
Shenghua Li, Quan Zhou, Jia Liu, Jie Wang, Yawen Fan, Xiaofu Wu, Longin Jan Latecki
For image semantic segmentation, a fully convolutional network is usually employed as the encoder to abstract visual features of the input image. A meticulously designed decoder is used to decoding the final feature map of the backbone. The output resolution of backbones which are designed for image classification task is too low to match segmentation task. Most existing methods for obtaining the final high-resolution feature map can not fully utilize the information of different layers of the backbone. To adequately extract the information of a single layer, the multi-scale context information of different layers, and the global information of backbone, we present a new attention-augmented module named Dense-attention Context Module (DCM), which is used to connect the common backbones and the other decoding heads. The experiments show the promising results of our method on Cityscapes dataset.
{"title":"DCM: A Dense-Attention Context Module For Semantic Segmentation","authors":"Shenghua Li, Quan Zhou, Jia Liu, Jie Wang, Yawen Fan, Xiaofu Wu, Longin Jan Latecki","doi":"10.1109/ICIP40778.2020.9190675","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9190675","url":null,"abstract":"For image semantic segmentation, a fully convolutional network is usually employed as the encoder to abstract visual features of the input image. A meticulously designed decoder is used to decoding the final feature map of the backbone. The output resolution of backbones which are designed for image classification task is too low to match segmentation task. Most existing methods for obtaining the final high-resolution feature map can not fully utilize the information of different layers of the backbone. To adequately extract the information of a single layer, the multi-scale context information of different layers, and the global information of backbone, we present a new attention-augmented module named Dense-attention Context Module (DCM), which is used to connect the common backbones and the other decoding heads. The experiments show the promising results of our method on Cityscapes dataset.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133224904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9190955
Paul Haase, H. Schwarz, H. Kirchhoffer, Simon Wiedemann, Talmaj Marinc, Arturo Marbán, K. Müller, W. Samek, D. Marpe, T. Wiegand
Recent approaches to compression of deep neural networks, like the emerging standard on compression of neural networks for multimedia content description and analysis (MPEG-7 part 17), apply scalar quantization and entropy coding of the quantization indexes. In this paper we present an advanced method for quantization of neural network parameters, which applies dependent scalar quantization (DQ) or trellis-coded quantization (TCQ), and an improved context modeling for the entropy coding of the quantization indexes. We show that the proposed method achieves 5.778% bitrate reduction and virtually no loss (0.37%) of network performance in average, compared to the baseline methods of the second test model (NCTM) of MPEG-7 part 17 for relevant working points.
最近的深度神经网络压缩方法,如新兴的用于多媒体内容描述和分析的神经网络压缩标准(MPEG-7 part 17),采用量化指标的标量量化和熵编码。本文提出了一种神经网络参数量化的新方法,即依赖标量量化(DQ)或网格编码量化(TCQ),并对量化指标的熵编码进行了改进的上下文建模。我们表明,与MPEG-7 part 17的第二个测试模型(NCTM)的基线方法相比,该方法在相关工作点上实现了5.778%的比特率降低,平均几乎没有网络性能损失(0.37%)。
{"title":"Dependent Scalar Quantization For Neural Network Compression","authors":"Paul Haase, H. Schwarz, H. Kirchhoffer, Simon Wiedemann, Talmaj Marinc, Arturo Marbán, K. Müller, W. Samek, D. Marpe, T. Wiegand","doi":"10.1109/ICIP40778.2020.9190955","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9190955","url":null,"abstract":"Recent approaches to compression of deep neural networks, like the emerging standard on compression of neural networks for multimedia content description and analysis (MPEG-7 part 17), apply scalar quantization and entropy coding of the quantization indexes. In this paper we present an advanced method for quantization of neural network parameters, which applies dependent scalar quantization (DQ) or trellis-coded quantization (TCQ), and an improved context modeling for the entropy coding of the quantization indexes. We show that the proposed method achieves 5.778% bitrate reduction and virtually no loss (0.37%) of network performance in average, compared to the baseline methods of the second test model (NCTM) of MPEG-7 part 17 for relevant working points.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127831226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9190782
I. Laradji, Negar Rostamzadeh, Pedro H. O. Pinheiro, David Vázquez, Mark W. Schmidt
Instance segmentation methods often require costly per-pixel labels. We propose a method called WISE-Net that only requires point-level annotations. During training, the model only has access to a single pixel label per object, yet the task is to output full segmentation masks. To address this challenge, we construct a network with two branches: (1) a 10-calization network (L-Net) that predicts the location of each object; and (2) an embedding network (E-Net) that learns an embedding space where pixels of the same object are close. The segmentation masks for the located objects are obtained by grouping pixels with similar embeddings. We evaluate our approach on PASCAL VOC, COCO, KITTI and CityScapes datasets. The experiments show that our method (1) obtains competitive results compared to fully-supervised methods in certain scenarios; (2) outperforms fully-and weakly-supervised methods with a fixed annotation budget; and (3) establishes a first strong baseline for instance segmentation with point-level supervision.
{"title":"Proposal-Based Instance Segmentation With Point Supervision","authors":"I. Laradji, Negar Rostamzadeh, Pedro H. O. Pinheiro, David Vázquez, Mark W. Schmidt","doi":"10.1109/ICIP40778.2020.9190782","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9190782","url":null,"abstract":"Instance segmentation methods often require costly per-pixel labels. We propose a method called WISE-Net that only requires point-level annotations. During training, the model only has access to a single pixel label per object, yet the task is to output full segmentation masks. To address this challenge, we construct a network with two branches: (1) a 10-calization network (L-Net) that predicts the location of each object; and (2) an embedding network (E-Net) that learns an embedding space where pixels of the same object are close. The segmentation masks for the located objects are obtained by grouping pixels with similar embeddings. We evaluate our approach on PASCAL VOC, COCO, KITTI and CityScapes datasets. The experiments show that our method (1) obtains competitive results compared to fully-supervised methods in certain scenarios; (2) outperforms fully-and weakly-supervised methods with a fixed annotation budget; and (3) establishes a first strong baseline for instance segmentation with point-level supervision.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"360 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131692543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9190973
Zongyao Li, Ren Togo, Takahiro Ogawa, M. Haseyama
Unsupervised domain adaptation, which transfers supervised knowledge from a labeled domain to an unlabeled domain, remains a tough problem in the field of computer vision, especially for semantic segmentation. Some methods inspired by adversarial learning and semi-supervised learning have been developed for unsupervised domain adaptation in semantic segmentation and achieved outstanding performances. In this paper, we propose a novel method for this task. Like adversarial learning-based methods using a discriminator to align the feature distributions from different domains, we employ a variational autoencoder to get to the same destination but in a non-adversarial manner. Since the two approaches are compatible, we also integrate an adversarial loss into our method. By further introducing pseudo labels, our method can achieve state-of-the-art performances on two benchmark adaptation scenarios, GTA5-toCITYSCAPES and SYNTHIA-to-CITYSCAPES.
{"title":"Variational Autoencoder Based Unsupervised Domain Adaptation For Semantic Segmentation","authors":"Zongyao Li, Ren Togo, Takahiro Ogawa, M. Haseyama","doi":"10.1109/ICIP40778.2020.9190973","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9190973","url":null,"abstract":"Unsupervised domain adaptation, which transfers supervised knowledge from a labeled domain to an unlabeled domain, remains a tough problem in the field of computer vision, especially for semantic segmentation. Some methods inspired by adversarial learning and semi-supervised learning have been developed for unsupervised domain adaptation in semantic segmentation and achieved outstanding performances. In this paper, we propose a novel method for this task. Like adversarial learning-based methods using a discriminator to align the feature distributions from different domains, we employ a variational autoencoder to get to the same destination but in a non-adversarial manner. Since the two approaches are compatible, we also integrate an adversarial loss into our method. By further introducing pseudo labels, our method can achieve state-of-the-art performances on two benchmark adaptation scenarios, GTA5-toCITYSCAPES and SYNTHIA-to-CITYSCAPES.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124212853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9190864
Abdullah M. Algamdi, Victor Sanchez, Chang-Tsun Li
Understanding human actions from videos captured by drones is a challenging task in computer vision due to the unfamiliar viewpoints of individuals and changes in their size due to the camera’s location and motion. This work proposes DroneCaps, a capsule network architecture for multi-label human action recognition (HAR) in videos captured by drones. DroneCaps uses features computed by 3D convolution neural networks plus a new set of features computed by a novel Binary Volume Comparison layer. All these features, in conjunction with the learning power of CapsNets, allow understanding and abstracting the different viewpoints and poses of the depicted individuals very efficiently, thus improving multi-label HAR. The evaluation of the DroneCaps architecture’s performance for multi-label classification shows that it outperforms state-of-the-art methods on the Okutama-Action dataset.
{"title":"Dronecaps: Recognition Of Human Actions In Drone Videos Using Capsule Networks With Binary Volume Comparisons","authors":"Abdullah M. Algamdi, Victor Sanchez, Chang-Tsun Li","doi":"10.1109/ICIP40778.2020.9190864","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9190864","url":null,"abstract":"Understanding human actions from videos captured by drones is a challenging task in computer vision due to the unfamiliar viewpoints of individuals and changes in their size due to the camera’s location and motion. This work proposes DroneCaps, a capsule network architecture for multi-label human action recognition (HAR) in videos captured by drones. DroneCaps uses features computed by 3D convolution neural networks plus a new set of features computed by a novel Binary Volume Comparison layer. All these features, in conjunction with the learning power of CapsNets, allow understanding and abstracting the different viewpoints and poses of the depicted individuals very efficiently, thus improving multi-label HAR. The evaluation of the DroneCaps architecture’s performance for multi-label classification shows that it outperforms state-of-the-art methods on the Okutama-Action dataset.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124507115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9191189
Nicolas Vercheval, H. Bie, A. Pižurica
In this paper, we propose a Variational Auto-Encoder able to correctly reconstruct a fine mesh from a very low-dimensional latent space. The architecture avoids the usual coarsening of the graph and relies on pooling layers for the decoding phase and on the mean values of the training set for the up-sampling phase. We select new operators compared to previous work, and in particular, we define a new Dirac operator which can be extended to different types of graph structured data. We show the improvements over the previous operators and compare the results with the current benchmark on the Coma Dataset.
{"title":"Variational Auto-Encoders Without Graph Coarsening For Fine Mesh Learning","authors":"Nicolas Vercheval, H. Bie, A. Pižurica","doi":"10.1109/ICIP40778.2020.9191189","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9191189","url":null,"abstract":"In this paper, we propose a Variational Auto-Encoder able to correctly reconstruct a fine mesh from a very low-dimensional latent space. The architecture avoids the usual coarsening of the graph and relies on pooling layers for the decoding phase and on the mean values of the training set for the up-sampling phase. We select new operators compared to previous work, and in particular, we define a new Dirac operator which can be extended to different types of graph structured data. We show the improvements over the previous operators and compare the results with the current benchmark on the Coma Dataset.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116387508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9190986
Pooya Tavallali, P. Tavallali, M. Khosravi, M. Singhal
Synthetic Reduced Nearest Neighbor (SRNN) is a Nearest Neighbor model which is constrained to have K synthetic samples (prototypes/centroids). There has been little attempt toward direct optimization and interpretability of SRNN with proper guarantees like convergence. To tackle these issues, this paper, inspired by K-means algorithm, provides a novel optimization of Synthetic Reduced Nearest Neighbor based on Expectation Maximization (EM-SRNN) that always converges while also monotonically decreases the objective function. The optimization consists of iterating over the centroids of the model and assignment of training samples to centroids. The EM-SRNN is interpretable since the centroids represent sub-clusters of the classes. Such type of interpretability is suitable for various studies such as image processing and epidemiological studies. In this paper, analytical aspects of problem are explored and linear complexity of optimization over the trainset is shown. Finally, EM-SRNN is shown to have superior or similar performance when compared with several other interpretable and similar state-of-the-art models such trees and kernel SVMs.
{"title":"Interpretable Synthetic Reduced Nearest Neighbor: An Expectation Maximization Approach","authors":"Pooya Tavallali, P. Tavallali, M. Khosravi, M. Singhal","doi":"10.1109/ICIP40778.2020.9190986","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9190986","url":null,"abstract":"Synthetic Reduced Nearest Neighbor (SRNN) is a Nearest Neighbor model which is constrained to have K synthetic samples (prototypes/centroids). There has been little attempt toward direct optimization and interpretability of SRNN with proper guarantees like convergence. To tackle these issues, this paper, inspired by K-means algorithm, provides a novel optimization of Synthetic Reduced Nearest Neighbor based on Expectation Maximization (EM-SRNN) that always converges while also monotonically decreases the objective function. The optimization consists of iterating over the centroids of the model and assignment of training samples to centroids. The EM-SRNN is interpretable since the centroids represent sub-clusters of the classes. Such type of interpretability is suitable for various studies such as image processing and epidemiological studies. In this paper, analytical aspects of problem are explored and linear complexity of optimization over the trainset is shown. Finally, EM-SRNN is shown to have superior or similar performance when compared with several other interpretable and similar state-of-the-art models such trees and kernel SVMs.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123401742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9191290
Myeongjun Kim, Taehun Kim, Daijin Kim
We propose Spatio-Temporal SlowFast Self-Attention network for action recognition. Conventional Convolutional Neural Networks have the advantage of capturing the local area of the data. However, to understand a human action, it is appropriate to consider both human and the overall context of given scene. Therefore, we repurpose a self-attention mechanism from Self-Attention GAN (SAGAN) to our model for retrieving global semantic context when making action recognition. Using the self-attention mechanism, we propose a module that can extract four features in video information: spatial information, temporal information, slow action information, and fast action information. We train and test our network on the Atomic Visual Actions (AVA) dataset and show significant frame-AP improvements on 28 categories.
{"title":"Spatio-Temporal Slowfast Self-Attention Network For Action Recognition","authors":"Myeongjun Kim, Taehun Kim, Daijin Kim","doi":"10.1109/ICIP40778.2020.9191290","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9191290","url":null,"abstract":"We propose Spatio-Temporal SlowFast Self-Attention network for action recognition. Conventional Convolutional Neural Networks have the advantage of capturing the local area of the data. However, to understand a human action, it is appropriate to consider both human and the overall context of given scene. Therefore, we repurpose a self-attention mechanism from Self-Attention GAN (SAGAN) to our model for retrieving global semantic context when making action recognition. Using the self-attention mechanism, we propose a module that can extract four features in video information: spatial information, temporal information, slow action information, and fast action information. We train and test our network on the Atomic Visual Actions (AVA) dataset and show significant frame-AP improvements on 28 categories.","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123475174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-10-01DOI: 10.1109/ICIP40778.2020.9191057
Shivam Saboo, F. Lefèbvre, Vincent Demoulin
In this work we extend the idea of object co-segmentation [10] to perform interactive video segmentation. Our framework predicts the coordinates of vertices along the boundary of an object for two frames of a video simultaneously. The predicted vertices are interactive in nature and a user interaction on one frame assists the network to correct the predictions for both frames. We employ attention mechanism at the encoder stage and a simple combination network at the decoder stage which allows the network to perform this simultaneous correction efficiently. The framework is also robust to the distance between the two input frames as it can handle a distance of up to 50 frames in between the two inputs.We train our model on professional dataset, which consists pixel accurate annotations given by professional Roto artists. We test our model on DAVIS [15] and achieve state of the art results in both automatic and interactive mode surpassing Curve-GCN [11] and PolyRNN++ [1].
{"title":"Deep Learning And Interactivity For Video Rotoscoping","authors":"Shivam Saboo, F. Lefèbvre, Vincent Demoulin","doi":"10.1109/ICIP40778.2020.9191057","DOIUrl":"https://doi.org/10.1109/ICIP40778.2020.9191057","url":null,"abstract":"In this work we extend the idea of object co-segmentation [10] to perform interactive video segmentation. Our framework predicts the coordinates of vertices along the boundary of an object for two frames of a video simultaneously. The predicted vertices are interactive in nature and a user interaction on one frame assists the network to correct the predictions for both frames. We employ attention mechanism at the encoder stage and a simple combination network at the decoder stage which allows the network to perform this simultaneous correction efficiently. The framework is also robust to the distance between the two input frames as it can handle a distance of up to 50 frames in between the two inputs.We train our model on professional dataset, which consists pixel accurate annotations given by professional Roto artists. We test our model on DAVIS [15] and achieve state of the art results in both automatic and interactive mode surpassing Curve-GCN [11] and PolyRNN++ [1].","PeriodicalId":405734,"journal":{"name":"2020 IEEE International Conference on Image Processing (ICIP)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123665348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}