Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9412573
Antonio Greco, Antonio Roberto, Alessia Saggese, M. Vento
Sound event recognition systems are rapidly becoming part of our life, since they can be profitably used in several vertical markets, ranging from audio security applications to scene classification and multi-modal analysis in social robotics. In the last years, a not negligible part of the scientific community started to apply Convolutional Neural Networks (CNNs) to image-based representations of the audio stream, due to their successful adoption in almost all the computer vision tasks. In this paper, we carry out a detailed benchmark of various widely used CNN architectures and visual representations on a popular dataset, namely the MIVIA Audio Events database. Our analysis is aimed at understanding how these factors affect the sound event recognition performance with a particular focus on the false positive rate, very relevant in audio surveillance solutions. In fact, although most of the proposed solutions achieve a high recognition rate, the capability of distinguishing the events-of-interest from the background is often not yet sufficient for real systems, and prevent its usage in real applications. Our comprehensive experimental analysis investigates this aspect and allows to identify useful design guidelines for increasing the specificity of sound event recognition systems.
{"title":"Which are the factors affecting the performance of audio surveillance systems?","authors":"Antonio Greco, Antonio Roberto, Alessia Saggese, M. Vento","doi":"10.1109/ICPR48806.2021.9412573","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412573","url":null,"abstract":"Sound event recognition systems are rapidly becoming part of our life, since they can be profitably used in several vertical markets, ranging from audio security applications to scene classification and multi-modal analysis in social robotics. In the last years, a not negligible part of the scientific community started to apply Convolutional Neural Networks (CNNs) to image-based representations of the audio stream, due to their successful adoption in almost all the computer vision tasks. In this paper, we carry out a detailed benchmark of various widely used CNN architectures and visual representations on a popular dataset, namely the MIVIA Audio Events database. Our analysis is aimed at understanding how these factors affect the sound event recognition performance with a particular focus on the false positive rate, very relevant in audio surveillance solutions. In fact, although most of the proposed solutions achieve a high recognition rate, the capability of distinguishing the events-of-interest from the background is often not yet sufficient for real systems, and prevent its usage in real applications. Our comprehensive experimental analysis investigates this aspect and allows to identify useful design guidelines for increasing the specificity of sound event recognition systems.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"32 1","pages":"7876-7883"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83696760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Objects from correlated classes usually share highly similar appearance while objects from uncorrelated classes are very different. Most of current image recognition works treat each class independently, which ignores these class correlations and inevitably leads to sub-optimal performance in many cases. Fortunately, object classes inherently form a hierarchy with different levels of abstraction and this hierarchy encodes rich correlations among different classes. In this work, we utilize a soft label vector that encodes the prior knowledge of class correlations as extra regularization to train the image classifiers. Specifically, for each class, instead of simply using a one-hot vector, we assign a high value to its correlated classes and assign small values to those uncorrelated ones, thus generating knowledge embedded soft labels. We conduct experiments on both general and fine-grained image recognition benchmarks and demonstrate its superiority compared with existing methods.
{"title":"Exploiting Knowledge Embedded Soft Labels for Image Recognition","authors":"Lixian Yuan, Riquan Chen, Hefeng Wu, Tianshui Chen, Wentao Wang, Pei Chen","doi":"10.1109/ICPR48806.2021.9412395","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412395","url":null,"abstract":"Objects from correlated classes usually share highly similar appearance while objects from uncorrelated classes are very different. Most of current image recognition works treat each class independently, which ignores these class correlations and inevitably leads to sub-optimal performance in many cases. Fortunately, object classes inherently form a hierarchy with different levels of abstraction and this hierarchy encodes rich correlations among different classes. In this work, we utilize a soft label vector that encodes the prior knowledge of class correlations as extra regularization to train the image classifiers. Specifically, for each class, instead of simply using a one-hot vector, we assign a high value to its correlated classes and assign small values to those uncorrelated ones, thus generating knowledge embedded soft labels. We conduct experiments on both general and fine-grained image recognition benchmarks and demonstrate its superiority compared with existing methods.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"61 1","pages":"4989-4995"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90668890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9413196
A. Renaudeau, Travis Seng, A. Carlier, F. Pierre, F. Lauze, Jean-François Aujol, Jean-Denis Durou
We propose to detect defects in old movies, as the first step of a larger framework of old movies restoration by inpainting techniques. The specificity of our work is to learn a film restorer's expertise from a pair of sequences, composed of a movie with defects, and the same movie which was semiautomatically restored with the help of a specialized software. In order to detect those defects with minimal human interaction and further reduce the time spent for a restoration, we feed a U-Net with consecutive defective frames as input to detect the unexpected variations of pixel intensity over space and time. Since the output of the network is a mask of defect location, we first have to create the dataset of mask frames on the basis of restored frames from the software used by the film restorer, instead of classical synthetic ground truth, which is not available. These masks are estimated by computing the absolute difference between restored frames and defectuous frames, combined with thresholding and morphological closing. Our network succeeds in automatically detecting real defects with more precision than the manual selection with an all-encompassing shape, including some the expert restorer could have missed for lack of time.
{"title":"Learning Defects in Old Movies from Manually Assisted Restoration","authors":"A. Renaudeau, Travis Seng, A. Carlier, F. Pierre, F. Lauze, Jean-François Aujol, Jean-Denis Durou","doi":"10.1109/ICPR48806.2021.9413196","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9413196","url":null,"abstract":"We propose to detect defects in old movies, as the first step of a larger framework of old movies restoration by inpainting techniques. The specificity of our work is to learn a film restorer's expertise from a pair of sequences, composed of a movie with defects, and the same movie which was semiautomatically restored with the help of a specialized software. In order to detect those defects with minimal human interaction and further reduce the time spent for a restoration, we feed a U-Net with consecutive defective frames as input to detect the unexpected variations of pixel intensity over space and time. Since the output of the network is a mask of defect location, we first have to create the dataset of mask frames on the basis of restored frames from the software used by the film restorer, instead of classical synthetic ground truth, which is not available. These masks are estimated by computing the absolute difference between restored frames and defectuous frames, combined with thresholding and morphological closing. Our network succeeds in automatically detecting real defects with more precision than the manual selection with an all-encompassing shape, including some the expert restorer could have missed for lack of time.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"17 1","pages":"5254-5261"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90752169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose Spherical Hierarchical modeling of 3D point cloud. Inspired by Shape Context, we design a receptive field on each 3D point by placing a spherical coordinate on it. We sample points using the furthest point method and creating overlapping balls of points. We divide the space into radial, polar angular, and azimuthal angular bins on which we form a Spherical Hierarchy for each ball. We apply 1x1 CNN convolution on points to start the initial feature extraction. Repeated 3D CNN and max-pooling over the Spherical bins propagate contextual information until all the information is condensed in the center bin. Extensive experiments on five datasets strongly evidence that our method outperforms current models on various Point Cloud Learning tasks, including 2D/3D shape classification, 3D part segmentation, and 3D semantic segmentation.
{"title":"PointSpherical: Deep Shape Context for Point Cloud Learning in Spherical Coordinates","authors":"Hua Lin, Bin Fan, Yongcheng Liu, Yirong Yang, Zheng Pan, Jianbo Shi, Chunhong Pan, Huiwen Xie","doi":"10.1109/ICPR48806.2021.9412978","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412978","url":null,"abstract":"We propose Spherical Hierarchical modeling of 3D point cloud. Inspired by Shape Context, we design a receptive field on each 3D point by placing a spherical coordinate on it. We sample points using the furthest point method and creating overlapping balls of points. We divide the space into radial, polar angular, and azimuthal angular bins on which we form a Spherical Hierarchy for each ball. We apply 1x1 CNN convolution on points to start the initial feature extraction. Repeated 3D CNN and max-pooling over the Spherical bins propagate contextual information until all the information is condensed in the center bin. Extensive experiments on five datasets strongly evidence that our method outperforms current models on various Point Cloud Learning tasks, including 2D/3D shape classification, 3D part segmentation, and 3D semantic segmentation.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"1 1","pages":"10266-10273"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91036586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9412818
O. Dehzangi, Saba Heidari Gheshlaghi, Annahita Amireskandari, N. Nasrabadi, A. Rezai
Medical image segmentation is a critical field in the domain of computer vision and with the growing acclaim of deep learning based models, research in this field is constantly expanding. Optical coherence tomography (OCT) is a non-invasive method that scans the human's retina with depth. It has been hypothesized that the thickness of the retinal layers extracted from OCTs could be an efficient and effective biomarker for early diagnosis of AD. In this work, we aim to design a self-training model architecture for the task of segmenting the retinal layers in OCT scans. Neural architecture search (NAS) is a subfield of AutoML domain, which has a significant impact on improving the accuracy of machine vision tasks. We integrate the NAS algorithm with a Unet auto-encoder architecture as its backbone. Then, we employ our proposed model to segment the retinal nerve fiber layer in our preprocessed OCT images with the aim of AD diagnosis. In this work, we trained a super-resolution generative adversarial network on the raw OCT scans to improve the quality of the images before the modeling stage. In our architecture search strategy, different primitive operations suggested to find down- & up-sampling Unet cell blocks and the binary gate method has been applied to make the search strategy more practical. Our architecture search method is empirically evaluated by training on the Unet and NAS-Unet from scratch. Specifically, the proposed NAS-Unet training significantly outperforms the baseline human-designed architecture by achieving 95.1% in the mean Intersection over Union metric and 79.1% in the Dice similarity coefficient.
{"title":"OCT Image Segmentation Using Neural Architecture Search and SRGAN","authors":"O. Dehzangi, Saba Heidari Gheshlaghi, Annahita Amireskandari, N. Nasrabadi, A. Rezai","doi":"10.1109/ICPR48806.2021.9412818","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412818","url":null,"abstract":"Medical image segmentation is a critical field in the domain of computer vision and with the growing acclaim of deep learning based models, research in this field is constantly expanding. Optical coherence tomography (OCT) is a non-invasive method that scans the human's retina with depth. It has been hypothesized that the thickness of the retinal layers extracted from OCTs could be an efficient and effective biomarker for early diagnosis of AD. In this work, we aim to design a self-training model architecture for the task of segmenting the retinal layers in OCT scans. Neural architecture search (NAS) is a subfield of AutoML domain, which has a significant impact on improving the accuracy of machine vision tasks. We integrate the NAS algorithm with a Unet auto-encoder architecture as its backbone. Then, we employ our proposed model to segment the retinal nerve fiber layer in our preprocessed OCT images with the aim of AD diagnosis. In this work, we trained a super-resolution generative adversarial network on the raw OCT scans to improve the quality of the images before the modeling stage. In our architecture search strategy, different primitive operations suggested to find down- & up-sampling Unet cell blocks and the binary gate method has been applied to make the search strategy more practical. Our architecture search method is empirically evaluated by training on the Unet and NAS-Unet from scratch. Specifically, the proposed NAS-Unet training significantly outperforms the baseline human-designed architecture by achieving 95.1% in the mean Intersection over Union metric and 79.1% in the Dice similarity coefficient.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"7 1","pages":"6425-6430"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91047553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9412110
J. A. Rodríguez-Rodríguez, Miguel A. Molina-Cabello, Rafaela Benítez-Rochel, Ezequiel López-Rubio
Convolutional Neural Networks (CNNs) are widely used due to their high performance in many tasks related to computer vision. In particular, image classification is one of the fields where CNNs are employed with success. However, images can be heavily affected by several inconveniences such as noise or illumination. Therefore, image enhancement algorithms have been developed to improve the quality of the images. In this work, the impact that brightness and image contrast enhancement techniques have on the performance achieved by CNNs in classification tasks is analyzed. More specifically, several well known CNNs architectures such as Alexnet or Googlenet, and image contrast enhancement techniques such as Gamma Correction or Logarithm Transformation are studied. Different experiments have been carried out, and the obtained qualitative and quantitative results are reported.
{"title":"The effect of image enhancement algorithms on convolutional neural networks","authors":"J. A. Rodríguez-Rodríguez, Miguel A. Molina-Cabello, Rafaela Benítez-Rochel, Ezequiel López-Rubio","doi":"10.1109/ICPR48806.2021.9412110","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412110","url":null,"abstract":"Convolutional Neural Networks (CNNs) are widely used due to their high performance in many tasks related to computer vision. In particular, image classification is one of the fields where CNNs are employed with success. However, images can be heavily affected by several inconveniences such as noise or illumination. Therefore, image enhancement algorithms have been developed to improve the quality of the images. In this work, the impact that brightness and image contrast enhancement techniques have on the performance achieved by CNNs in classification tasks is analyzed. More specifically, several well known CNNs architectures such as Alexnet or Googlenet, and image contrast enhancement techniques such as Gamma Correction or Logarithm Transformation are studied. Different experiments have been carried out, and the obtained qualitative and quantitative results are reported.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"12 1","pages":"3084-3089"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91198575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9411977
Melanie Piot, Berangere Bourdoulous, Jordan Gonzalez, Aurelia Deshayes, L. Prevost
In this paper, we propose a dual memory model inspired by psychological theory. Short-term memory processes the data stream before integrating them into long-term memory, which generalizes. The use case is learning the ability to recognize handwriting. This begins with the learning of prototypical letters. It continues throughout life and gives the individual the ability to recognize increasingly varied handwriting. This second task is achieved by incrementally training our dual-memory model. We used a convolution network for encoding and random forests as the memory model. Indeed, the latter have the advantage of being easily enhanced to integrate new data and new classes. Performances on the MNIST database are very encouraging since they exceed 95% and the complexity of the model remains reasonable.
{"title":"Dual-Memory Model for Incremental Learning: The Handwriting Recognition Use Case","authors":"Melanie Piot, Berangere Bourdoulous, Jordan Gonzalez, Aurelia Deshayes, L. Prevost","doi":"10.1109/ICPR48806.2021.9411977","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9411977","url":null,"abstract":"In this paper, we propose a dual memory model inspired by psychological theory. Short-term memory processes the data stream before integrating them into long-term memory, which generalizes. The use case is learning the ability to recognize handwriting. This begins with the learning of prototypical letters. It continues throughout life and gives the individual the ability to recognize increasingly varied handwriting. This second task is achieved by incrementally training our dual-memory model. We used a convolution network for encoding and random forests as the memory model. Indeed, the latter have the advantage of being easily enhanced to integrate new data and new classes. Performances on the MNIST database are very encouraging since they exceed 95% and the complexity of the model remains reasonable.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"3 1","pages":"5527-5534"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91369435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9412233
Yasunori Ohishi, Yuki Tanaka, K. Kashino
Audio-visual co-segmentation is a task to extract segments and regions corresponding to specific events on unlabeled audio and video signals. It is particularly important to accomplish it in an unsupervised way, since it is generally very difficult to manually label all the objects and events appearing in audio-visual signals for supervised learning. Here, we propose to take advantage of the temporal proximity of corresponding audio and video entities included in the signals. For this purpose, we newly employ a guided attention scheme to this task to efficiently detect and utilize temporal co-occurrences of audio and video information. Experiments using a real TV broadcasts of sumo wrestling, a sport event, with live commentaries show that our model can automatically extract specific athlete movements and its spoken descriptions in an unsupervised manner.
{"title":"Unsupervised Co-Segmentation for Athlete Movements and Live Commentaries Using Crossmodal Temporal Proximity","authors":"Yasunori Ohishi, Yuki Tanaka, K. Kashino","doi":"10.1109/ICPR48806.2021.9412233","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9412233","url":null,"abstract":"Audio-visual co-segmentation is a task to extract segments and regions corresponding to specific events on unlabeled audio and video signals. It is particularly important to accomplish it in an unsupervised way, since it is generally very difficult to manually label all the objects and events appearing in audio-visual signals for supervised learning. Here, we propose to take advantage of the temporal proximity of corresponding audio and video entities included in the signals. For this purpose, we newly employ a guided attention scheme to this task to efficiently detect and utilize temporal co-occurrences of audio and video information. Experiments using a real TV broadcasts of sumo wrestling, a sport event, with live commentaries show that our model can automatically extract specific athlete movements and its spoken descriptions in an unsupervised manner.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"26 1","pages":"9137-9142"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89744160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9413174
Dehui Li, Z. Cao, Ke Xian, Xinyuan Qi, Chao Zhang, Hao Lu
Context is known to be one of crucial factors effecting the performance improvement of semantic segmentation. However, state-of-the-art segmentation models built upon fully convolutional networks are inherently weak in encoding contextual information because of stacked local operations such as convolution and pooling. Failing to capture context leads to inferior segmentation performance. Despite many context modules have been proposed to relieve this problem, they still operate in a local manner or use the same contextual information in different positions (due to upsampling). In this paper, we introduce the idea of Multi-Direction Convolution (MDC)-a novel operator capable of encoding rich contextual information. This operator is inspired by an observation that the standard convolution only slides along the spatial dimension $(x,y text{direction})$ where the channel dimension $(z quad text{direction})$ is fixed, which renders slow growth of the receptive field (RF). If considering the channel-fixed convolution to be one-direction, MDC is multi-direction in the sense that MDC slides along both spatial and channel dimensions, i.e., it slides along $x,y$ when $z$ is fixed, along $x,z$ when $y$ is fixed, and along $y, z$ when $x$ is fixed. In this way, MDC is able to encode rich contextual information with the fast increase of the RF. Compared to existing context modules, the encoded context is position-sensitive because no upsampling is required. MDC is also efficient and easy to implement. It can be implemented with few standard convolution layers with permutation. We show through extensive experiments that MDC effectively and selectively enlarges the RF and outperforms existing contextual modules on two standard benchmarks, including Cityscapes and PASCAL VOC2012.
{"title":"Multi - Direction Convolution for Semantic Segmentation","authors":"Dehui Li, Z. Cao, Ke Xian, Xinyuan Qi, Chao Zhang, Hao Lu","doi":"10.1109/ICPR48806.2021.9413174","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9413174","url":null,"abstract":"Context is known to be one of crucial factors effecting the performance improvement of semantic segmentation. However, state-of-the-art segmentation models built upon fully convolutional networks are inherently weak in encoding contextual information because of stacked local operations such as convolution and pooling. Failing to capture context leads to inferior segmentation performance. Despite many context modules have been proposed to relieve this problem, they still operate in a local manner or use the same contextual information in different positions (due to upsampling). In this paper, we introduce the idea of Multi-Direction Convolution (MDC)-a novel operator capable of encoding rich contextual information. This operator is inspired by an observation that the standard convolution only slides along the spatial dimension $(x,y text{direction})$ where the channel dimension $(z quad text{direction})$ is fixed, which renders slow growth of the receptive field (RF). If considering the channel-fixed convolution to be one-direction, MDC is multi-direction in the sense that MDC slides along both spatial and channel dimensions, i.e., it slides along $x,y$ when $z$ is fixed, along $x,z$ when $y$ is fixed, and along $y, z$ when $x$ is fixed. In this way, MDC is able to encode rich contextual information with the fast increase of the RF. Compared to existing context modules, the encoded context is position-sensitive because no upsampling is required. MDC is also efficient and easy to implement. It can be implemented with few standard convolution layers with permutation. We show through extensive experiments that MDC effectively and selectively enlarges the RF and outperforms existing contextual modules on two standard benchmarks, including Cityscapes and PASCAL VOC2012.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"97 1","pages":"519-525"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90387345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-10DOI: 10.1109/ICPR48806.2021.9413210
M. Bicego, M. Orozco-Alzate
In this paper we investigate the exploitation of non linear scaling of distances for advanced nearest neighbor classification. Starting from the recently found relation between the Hypersphere Classifier (HC) [1] and the Adaptive Nearest Neighbor rule (ANN) [2], here we propose PowerHC, an improved version of HC in which distances are normalized using a non linear mapping; non linear scaling of data, whose usefulness for feature spaces has been already assessed, has been hardly investigated for distances. A thorough experimental evaluation, involving 24 datasets and a challenging real world scenario of seismic signal classification, confirms the suitability of the proposed approach.
{"title":"PowerHC: non linear normalization of distances for advanced nearest neighbor classification","authors":"M. Bicego, M. Orozco-Alzate","doi":"10.1109/ICPR48806.2021.9413210","DOIUrl":"https://doi.org/10.1109/ICPR48806.2021.9413210","url":null,"abstract":"In this paper we investigate the exploitation of non linear scaling of distances for advanced nearest neighbor classification. Starting from the recently found relation between the Hypersphere Classifier (HC) [1] and the Adaptive Nearest Neighbor rule (ANN) [2], here we propose PowerHC, an improved version of HC in which distances are normalized using a non linear mapping; non linear scaling of data, whose usefulness for feature spaces has been already assessed, has been hardly investigated for distances. A thorough experimental evaluation, involving 24 datasets and a challenging real world scenario of seismic signal classification, confirms the suitability of the proposed approach.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"12 1","pages":"1205-1211"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89415035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}