Pub Date : 2019-10-01DOI: 10.1109/AIPR47015.2019.9174598
Michael Harner, A. Groener, M. D. Pritt
In this work, we present a methodology for monitoring man-made, construction-like activities in low-resolution SAR imagery. Our source of data is the European Space Agency’s Sentinel-l satellite which provides global coverage at a 12-day revisit rate. Despite limitations in resolution, our methodology enables us to monitor activity levels (i.e. presence of vehicles, equipment) of a pre-defined location by analyzing the texture of detected SAR imagery. Using an exploratory dataset, we trained a support vector machine (SVM), a random binary forest, and a fully-connected neural network for classification. We use Haralick texture features in the VV and VH polarization channels as the input features to our classifiers. Each classifier showed promising results in being able to distinguish between two possible types of construction-site activity levels. This paper documents a case study that is centered around monitoring the construction process for oil and gas fracking wells.
{"title":"Detecting the Presence of Vehicles and Equipment in SAR Imagery Using Image Texture Features","authors":"Michael Harner, A. Groener, M. D. Pritt","doi":"10.1109/AIPR47015.2019.9174598","DOIUrl":"https://doi.org/10.1109/AIPR47015.2019.9174598","url":null,"abstract":"In this work, we present a methodology for monitoring man-made, construction-like activities in low-resolution SAR imagery. Our source of data is the European Space Agency’s Sentinel-l satellite which provides global coverage at a 12-day revisit rate. Despite limitations in resolution, our methodology enables us to monitor activity levels (i.e. presence of vehicles, equipment) of a pre-defined location by analyzing the texture of detected SAR imagery. Using an exploratory dataset, we trained a support vector machine (SVM), a random binary forest, and a fully-connected neural network for classification. We use Haralick texture features in the VV and VH polarization channels as the input features to our classifiers. Each classifier showed promising results in being able to distinguish between two possible types of construction-site activity levels. This paper documents a case study that is centered around monitoring the construction process for oil and gas fracking wells.","PeriodicalId":167075,"journal":{"name":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129783443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/AIPR47015.2019.9174601
Noor M. Al-Shakarji, F. Bunyak, H. Aliakbarpour, G. Seetharaman, K. Palaniappan
Video compression becomes a very important task during real-time aerial surveillance scenarios where limited communication bandwidth and on-board storage greatly restrict air-to-ground and air-to-air communications. In these cases, efficient handling of video data is needed to ensure optimum storage, smoother video transmission, fast and reliable video analysis. Conventional video compression schemes were typically designed for human visual perception rather than automated video analytics. Information loss and artifacts introduced during image/video compression impose serious limitations on the performance of automated video analytics tasks. These limitations are further increased in aerial imagery due to complex background and small size of objects. In this paper, we describe and evaluate a salient region estimation pipeline for aerial imagery to enable adaptive bit-rate allocation during video compression. The salient regions are estimated using a multi-cue moving vehicle detection pipeline, which synergistically fuses complementary appearance and motion cues using deep learning-based object detection and flux tensor-based spatio-temporal filtering approaches. Adaptive compression results using the described multi-cue saliency estimation pipeline are compared against conventional MPEG and JPEG encoding in terms of compression ratio, image quality, and impact on automated video analytics operations. Experimental results on ABQ urban aerial video dataset [1] show that incorporation of contextual information enables high semantic compression ratios of over 2000:1 while preserving image quality for the regions of interest. The proposed pipeline enables better utilization of the limited bandwidth of the air-to-ground or air-to-air network links.
{"title":"Performance Evaluation of Semantic Video Compression using Multi-cue Object Detection","authors":"Noor M. Al-Shakarji, F. Bunyak, H. Aliakbarpour, G. Seetharaman, K. Palaniappan","doi":"10.1109/AIPR47015.2019.9174601","DOIUrl":"https://doi.org/10.1109/AIPR47015.2019.9174601","url":null,"abstract":"Video compression becomes a very important task during real-time aerial surveillance scenarios where limited communication bandwidth and on-board storage greatly restrict air-to-ground and air-to-air communications. In these cases, efficient handling of video data is needed to ensure optimum storage, smoother video transmission, fast and reliable video analysis. Conventional video compression schemes were typically designed for human visual perception rather than automated video analytics. Information loss and artifacts introduced during image/video compression impose serious limitations on the performance of automated video analytics tasks. These limitations are further increased in aerial imagery due to complex background and small size of objects. In this paper, we describe and evaluate a salient region estimation pipeline for aerial imagery to enable adaptive bit-rate allocation during video compression. The salient regions are estimated using a multi-cue moving vehicle detection pipeline, which synergistically fuses complementary appearance and motion cues using deep learning-based object detection and flux tensor-based spatio-temporal filtering approaches. Adaptive compression results using the described multi-cue saliency estimation pipeline are compared against conventional MPEG and JPEG encoding in terms of compression ratio, image quality, and impact on automated video analytics operations. Experimental results on ABQ urban aerial video dataset [1] show that incorporation of contextual information enables high semantic compression ratios of over 2000:1 while preserving image quality for the regions of interest. The proposed pipeline enables better utilization of the limited bandwidth of the air-to-ground or air-to-air network links.","PeriodicalId":167075,"journal":{"name":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133093404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/AIPR47015.2019.9174583
Sabarish Gopalakrishnan, Premkumar Udaiyar, Shagan Sah, R. Ptucha
Deep learning frameworks have proven to be very effective at tasks like classification, segmentation, detection, and translation. Before being processed by a deep learning model, objects are first encoded into a suitable vector representation. For example, images are typically encoded using convolutional neural networks whereas texts typically use recurrent neural networks. Similarly, other modalities of data like 3D point clouds, audio signals, and videos can be transformed into vectors using appropriate encoders. Although deep learning architectures do a good job of learning these vector representations in isolation, learning a single common representation across multiple modalities is a challenging task. In this work, we develop a Multi Stage Common Vector Space (M-CVS) that is suitable for encoding multiple modalities. The M-CVS is an efficient low-dimensional vector representation in which the contextual similarity of data is preserved across all modalities through the use of contrastive loss functions. Our vector space can perform tasks like multimodal retrieval, searching and generation, where for example, images can be retrieved from text or audio input. The addition of a new modality would generally mean resetting and training the entire network. However, we introduce a stagewise learning technique where each modality is compared to a reference modality before being projected to the M-CVS. Our method ensures that a new modality can be mapped into the MCVS without changing existing encodings, allowing the extension to any number of modalities. We build and evaluate M-CVS on the XMedia and XMedianet multimodal dataset. Extensive ablation experiments using images, text, audio, video, and 3D point cloud modalities demonstrate the complexity vs. accuracy tradeoff under a wide variety of real-world use cases.
{"title":"Multi Stage Common Vector Space for Multimodal Embeddings","authors":"Sabarish Gopalakrishnan, Premkumar Udaiyar, Shagan Sah, R. Ptucha","doi":"10.1109/AIPR47015.2019.9174583","DOIUrl":"https://doi.org/10.1109/AIPR47015.2019.9174583","url":null,"abstract":"Deep learning frameworks have proven to be very effective at tasks like classification, segmentation, detection, and translation. Before being processed by a deep learning model, objects are first encoded into a suitable vector representation. For example, images are typically encoded using convolutional neural networks whereas texts typically use recurrent neural networks. Similarly, other modalities of data like 3D point clouds, audio signals, and videos can be transformed into vectors using appropriate encoders. Although deep learning architectures do a good job of learning these vector representations in isolation, learning a single common representation across multiple modalities is a challenging task. In this work, we develop a Multi Stage Common Vector Space (M-CVS) that is suitable for encoding multiple modalities. The M-CVS is an efficient low-dimensional vector representation in which the contextual similarity of data is preserved across all modalities through the use of contrastive loss functions. Our vector space can perform tasks like multimodal retrieval, searching and generation, where for example, images can be retrieved from text or audio input. The addition of a new modality would generally mean resetting and training the entire network. However, we introduce a stagewise learning technique where each modality is compared to a reference modality before being projected to the M-CVS. Our method ensures that a new modality can be mapped into the MCVS without changing existing encodings, allowing the extension to any number of modalities. We build and evaluate M-CVS on the XMedia and XMedianet multimodal dataset. Extensive ablation experiments using images, text, audio, video, and 3D point cloud modalities demonstrate the complexity vs. accuracy tradeoff under a wide variety of real-world use cases.","PeriodicalId":167075,"journal":{"name":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133190620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/AIPR47015.2019.9174592
Jonathan Worobey, S. Recker, C. Gribble
A common trade-off among object detection algorithms is accuracy-for-speed (or vice versa). To meet our application’s real-time requirement, we use a Single Shot MultiBox Detector (SSD) model. This architecture meets our latency requirements; however, a large amount of training data is required to achieve an acceptable accuracy level. While unusable for our end application, more robust network architectures, such as Regions with CNN features (R-CNN), provide an important advantage over SSD models—they can be more reliably trained on small datasets. By fine-tuning R-CNN models on a small number of hand-labeled examples, we create new, larger training datasets by running inference on the remaining unlabeled data. We show that these new, inferenced labels are beneficial to the training of lightweight models. These inferenced datasets are imperfect, and we explore various methods of dealing with the errors, including hand-labeling mislabeled data, discarding poor examples, and simply ignoring errors. Further, we explore the total cost, measured in human and computer time, required to execute this workflow compared to a hand-labeling baseline.
{"title":"Using Robust Networks to Inform Lightweight Models in Semi-Supervised Learning for Object Detection","authors":"Jonathan Worobey, S. Recker, C. Gribble","doi":"10.1109/AIPR47015.2019.9174592","DOIUrl":"https://doi.org/10.1109/AIPR47015.2019.9174592","url":null,"abstract":"A common trade-off among object detection algorithms is accuracy-for-speed (or vice versa). To meet our application’s real-time requirement, we use a Single Shot MultiBox Detector (SSD) model. This architecture meets our latency requirements; however, a large amount of training data is required to achieve an acceptable accuracy level. While unusable for our end application, more robust network architectures, such as Regions with CNN features (R-CNN), provide an important advantage over SSD models—they can be more reliably trained on small datasets. By fine-tuning R-CNN models on a small number of hand-labeled examples, we create new, larger training datasets by running inference on the remaining unlabeled data. We show that these new, inferenced labels are beneficial to the training of lightweight models. These inferenced datasets are imperfect, and we explore various methods of dealing with the errors, including hand-labeling mislabeled data, discarding poor examples, and simply ignoring errors. Further, we explore the total cost, measured in human and computer time, required to execute this workflow compared to a hand-labeling baseline.","PeriodicalId":167075,"journal":{"name":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121040075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/AIPR47015.2019.9174564
J. Irvine, J. Nolan, Nathaniel Hofmann, D. Lewis, Twakundine Simpamba, P. Zyambo, A. Travis, S. Hemami
Degradation of natural ecosystems as influenced by increasing human activity and climate change is threatening many animal populations in the wild. Zambia’s hippo population in Luangwa Valley is one example where declining forest cover from increased farming pressures has the potential of limiting hippo range and numbers by reducing water flow in this population’s critical habitat, the Luangwa River. COMACO applies economic incentives through a farmer-based business model to mitigate threats of watershed loss and has identified hippos as a key indicator species for assessing its work and the health of Luangwa’s watershed. The goal of this effort is to develop automated machine learning tools that can process fine resolution commercial satellite imagery to estimate the hippo population and associated characteristics of the habitat. The focus is the Luangwa River in Zambia, where the ideal time for imagery acquisition is the dry season of June through September. This study leverages historical commercial satellite imagery to identify selected areas with observable hippo groupings, develop an-image-based signature for hippo detection, and construct an initial image classifier to support larger-scale assessment of the hippo population over broad regions. We begin by characterizing the nature of the problem and the challenges inherent in applying remote sensing methods to the estimation of animal populations. To address these challenges, spectral signatures were constructed from analysis of historical imagery. The initial approach to classifier development relied on spectral angle to distinguish hippos from background, where background conditions included water, bare soil, low vegetation, trees, and mixtures of these materials. We present the approach and the initial classifier results. We conclude with a discussion of next steps to produce an imagebased estimate of the hippo populations and discuss lessons learned from this study.
{"title":"Estimating the Population of Large Animals in the Wild Using Satellite Imagery: A Case Study of Hippos in Zambia’s Luangwa River","authors":"J. Irvine, J. Nolan, Nathaniel Hofmann, D. Lewis, Twakundine Simpamba, P. Zyambo, A. Travis, S. Hemami","doi":"10.1109/AIPR47015.2019.9174564","DOIUrl":"https://doi.org/10.1109/AIPR47015.2019.9174564","url":null,"abstract":"Degradation of natural ecosystems as influenced by increasing human activity and climate change is threatening many animal populations in the wild. Zambia’s hippo population in Luangwa Valley is one example where declining forest cover from increased farming pressures has the potential of limiting hippo range and numbers by reducing water flow in this population’s critical habitat, the Luangwa River. COMACO applies economic incentives through a farmer-based business model to mitigate threats of watershed loss and has identified hippos as a key indicator species for assessing its work and the health of Luangwa’s watershed. The goal of this effort is to develop automated machine learning tools that can process fine resolution commercial satellite imagery to estimate the hippo population and associated characteristics of the habitat. The focus is the Luangwa River in Zambia, where the ideal time for imagery acquisition is the dry season of June through September. This study leverages historical commercial satellite imagery to identify selected areas with observable hippo groupings, develop an-image-based signature for hippo detection, and construct an initial image classifier to support larger-scale assessment of the hippo population over broad regions. We begin by characterizing the nature of the problem and the challenges inherent in applying remote sensing methods to the estimation of animal populations. To address these challenges, spectral signatures were constructed from analysis of historical imagery. The initial approach to classifier development relied on spectral angle to distinguish hippos from background, where background conditions included water, bare soil, low vegetation, trees, and mixtures of these materials. We present the approach and the initial classifier results. We conclude with a discussion of next steps to produce an imagebased estimate of the hippo populations and discuss lessons learned from this study.","PeriodicalId":167075,"journal":{"name":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125069484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/AIPR47015.2019.9174568
David R. Emerson, Lauren A. Christopher
Depth estimation is becoming increasingly important in computer vision applications. As the commercial industry moves forward with autonomous vehicle research and development, there is a demand for these systems to be able to gauge their 3D surroundings in order to avoid obstacles, and react to threats. This need requires depth estimation systems, and current research in self-driving vehicles now use LIDAR for 3D awareness. However, as LIDAR becomes more prevalent there is the potential for an increased risk of interference between this type of active measurement system on multiple vehicles. Passive methods, on the other hand, do not require the transmission of a signal in order to measure depth. Instead, they estimate the depth by using specific cues in the scene. Previous research, using a Depth from Defocus (DfD) single passive camera system, has shown that an in-focus image and an out-of-focus image can be used to produce a depth measure. This research introduces a new Deep Learning (DL) architecture that is capable of ingesting these image pairs to produce a depth map of the given scene improving both speed and performance over a range of lighting conditions. Compared to the previous state-of-the-art multi-label graph cut algorithms; the new DfD-Net produces a 63.7% and 33.6% improvement in the Normalized Root Mean Square Error (NRMSE) for the darkest and brightest images respectively. In addition to the NRMSE, an image quality metric (Structural Similarity Index (SSIM)) was also used to assess the DfD-Net performance. The DfD-Net produced a 3.6% increase (improvement) and a 2.3% reduction (slight decrease) in the SSIM metric for the darkest and brightest images respectively.
{"title":"3-D Scene Reconstruction Using Depth from Defocus and Deep Learning","authors":"David R. Emerson, Lauren A. Christopher","doi":"10.1109/AIPR47015.2019.9174568","DOIUrl":"https://doi.org/10.1109/AIPR47015.2019.9174568","url":null,"abstract":"Depth estimation is becoming increasingly important in computer vision applications. As the commercial industry moves forward with autonomous vehicle research and development, there is a demand for these systems to be able to gauge their 3D surroundings in order to avoid obstacles, and react to threats. This need requires depth estimation systems, and current research in self-driving vehicles now use LIDAR for 3D awareness. However, as LIDAR becomes more prevalent there is the potential for an increased risk of interference between this type of active measurement system on multiple vehicles. Passive methods, on the other hand, do not require the transmission of a signal in order to measure depth. Instead, they estimate the depth by using specific cues in the scene. Previous research, using a Depth from Defocus (DfD) single passive camera system, has shown that an in-focus image and an out-of-focus image can be used to produce a depth measure. This research introduces a new Deep Learning (DL) architecture that is capable of ingesting these image pairs to produce a depth map of the given scene improving both speed and performance over a range of lighting conditions. Compared to the previous state-of-the-art multi-label graph cut algorithms; the new DfD-Net produces a 63.7% and 33.6% improvement in the Normalized Root Mean Square Error (NRMSE) for the darkest and brightest images respectively. In addition to the NRMSE, an image quality metric (Structural Similarity Index (SSIM)) was also used to assess the DfD-Net performance. The DfD-Net produced a 3.6% increase (improvement) and a 2.3% reduction (slight decrease) in the SSIM metric for the darkest and brightest images respectively.","PeriodicalId":167075,"journal":{"name":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"02 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129631829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/AIPR47015.2019.9174576
Brigid Angelini, Michael R. Crystal, J. Irvine
Discerning regional political volatility is valuable for successful policy development by government and commercial entities, and necessitates having an understanding of the underlying economic, social, and political environment. Some methods of obtaining the environment information, such as global public opinion surveys, are expensive and slow to complete. We explore the feasibility of gleaning comparable information through automated image processing with a premium on freely available commercial satellite imagery. Previous work demonstrated success in predicting survey responses related to wealth, poverty, and crime in rural Afghanistan and Botswana, by utilizing spatially coinciding high resolution satellite images to develop models. We extend these findings by using similar image features to predict survey responses regarding political and economic sentiment. We also explore the feasibility of predicting survey responses with models built from Sentinel 2 satellite imagery, which is coarser-resolution, but freely available. Our fidings reiterate the potential for cheaply and quickly discerning the socio-politico-economic context of a region solely through satellite image features. We show a number of models and their cross-validated performance in predicting survey responses, and conclude with comments and recommendations for future work.
{"title":"Quantifying Socio-economic Context from Overhead Imagery","authors":"Brigid Angelini, Michael R. Crystal, J. Irvine","doi":"10.1109/AIPR47015.2019.9174576","DOIUrl":"https://doi.org/10.1109/AIPR47015.2019.9174576","url":null,"abstract":"Discerning regional political volatility is valuable for successful policy development by government and commercial entities, and necessitates having an understanding of the underlying economic, social, and political environment. Some methods of obtaining the environment information, such as global public opinion surveys, are expensive and slow to complete. We explore the feasibility of gleaning comparable information through automated image processing with a premium on freely available commercial satellite imagery. Previous work demonstrated success in predicting survey responses related to wealth, poverty, and crime in rural Afghanistan and Botswana, by utilizing spatially coinciding high resolution satellite images to develop models. We extend these findings by using similar image features to predict survey responses regarding political and economic sentiment. We also explore the feasibility of predicting survey responses with models built from Sentinel 2 satellite imagery, which is coarser-resolution, but freely available. Our fidings reiterate the potential for cheaply and quickly discerning the socio-politico-economic context of a region solely through satellite image features. We show a number of models and their cross-validated performance in predicting survey responses, and conclude with comments and recommendations for future work.","PeriodicalId":167075,"journal":{"name":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122272016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/AIPR47015.2019.9174588
Hussain I. Khajanchi, Jake Bezold, M. Kilcher, Alexander Benasutti, Brian Rentsch, Larry Pearlstein, S. Maxwell
Deep convolutional neural networks have been successfully deployed by large, well-funded teams, but their wider adoption is often limited by the cost and schedule ramifications of their requirement for massive amounts of labeled data. We address this problem through the use of a parameterized synthetic image generator. Our approach is particularly novel in that we have been able to fine tune the generator’s parameters through the use of a generative adversarial network. We describe our approach, and present results that demonstrate its potential benefits. We demonstrate the PSIG-GAN by creating images for training a DCNN to detect the existence and location of weeds in lawn grass.
{"title":"PSIG-GAN: A Parameterized Synthetic Image Generator Optimized via Non-Differentiable GAN","authors":"Hussain I. Khajanchi, Jake Bezold, M. Kilcher, Alexander Benasutti, Brian Rentsch, Larry Pearlstein, S. Maxwell","doi":"10.1109/AIPR47015.2019.9174588","DOIUrl":"https://doi.org/10.1109/AIPR47015.2019.9174588","url":null,"abstract":"Deep convolutional neural networks have been successfully deployed by large, well-funded teams, but their wider adoption is often limited by the cost and schedule ramifications of their requirement for massive amounts of labeled data. We address this problem through the use of a parameterized synthetic image generator. Our approach is particularly novel in that we have been able to fine tune the generator’s parameters through the use of a generative adversarial network. We describe our approach, and present results that demonstrate its potential benefits. We demonstrate the PSIG-GAN by creating images for training a DCNN to detect the existence and location of weeds in lawn grass.","PeriodicalId":167075,"journal":{"name":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131671168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/AIPR47015.2019.9174585
Gary Chern, A. Groener, Michael Harner, Tyler Kuhns, A. Lam, Stephen O’Neill, M. D. Pritt
GATR (Globally-scalable Automated Target Recognition) is a Lockheed Martin software system for real-time object detection and classification in satellite imagery on a worldwide basis. GATR uses GPU-accelerated deep learning software to quickly search large geographic regions. On a single GPU it processes imagery at a rate of over 16 km2/sec (or more than 10 Mpixels/sec), and it requires only two hours to search the entire state of Pennsylvania for gas fracking wells. The search time scales linearly with the geographic area, and the processing rate scales linearly with the number of GPUs. GATR has a modular, cloud-based architecture that uses Maxar’s GBDX platform and provides an ATR analytic as a service. Applications include broad area search, watch boxes for monitoring ports and airfields, and site characterization. ATR is performed by deep learning models including RetinaNet and Faster R-CNN. Results are presented for the detection of aircraft and fracking wells and show that the recalls exceed 90% even in geographic regions never seen before. GATR is extensible to new targets, such as cars and ships, and it also handles radar and infrared imagery.
{"title":"Globally-scalable Automated Target Recognition (GATR)","authors":"Gary Chern, A. Groener, Michael Harner, Tyler Kuhns, A. Lam, Stephen O’Neill, M. D. Pritt","doi":"10.1109/AIPR47015.2019.9174585","DOIUrl":"https://doi.org/10.1109/AIPR47015.2019.9174585","url":null,"abstract":"GATR (Globally-scalable Automated Target Recognition) is a Lockheed Martin software system for real-time object detection and classification in satellite imagery on a worldwide basis. GATR uses GPU-accelerated deep learning software to quickly search large geographic regions. On a single GPU it processes imagery at a rate of over 16 km2/sec (or more than 10 Mpixels/sec), and it requires only two hours to search the entire state of Pennsylvania for gas fracking wells. The search time scales linearly with the geographic area, and the processing rate scales linearly with the number of GPUs. GATR has a modular, cloud-based architecture that uses Maxar’s GBDX platform and provides an ATR analytic as a service. Applications include broad area search, watch boxes for monitoring ports and airfields, and site characterization. ATR is performed by deep learning models including RetinaNet and Faster R-CNN. Results are presented for the detection of aircraft and fracking wells and show that the recalls exceed 90% even in geographic regions never seen before. GATR is extensible to new targets, such as cars and ships, and it also handles radar and infrared imagery.","PeriodicalId":167075,"journal":{"name":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125091403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-10-01DOI: 10.1109/AIPR47015.2019.9174587
Rina Bao, K. Palaniappan, Yunxin Zhao, G. Seetharaman, Wenjun Zeng
We propose a novel deep architecture for semantic labeling of 3D point clouds referred to as Global and Local Streams Network (GLSNet) which is designed to capture both global and local structures and contextual information for large scale 3D point cloud classification. Our GLSNet tackles a hard problem – large differences of object sizes in large-scale point cloud segmentation including extremely large objects like water, and small objects like buildings and trees, and we design a two-branch deep network architecture to decompose the complex problem to separate processing problems at global and local scales and then fuse their predictions. GLSNet combines the strength of Submanifold Sparse Convolutional Network [1] for learning global structure with the strength of PointNet++ [2] for incorporating local information.The first branch of GLSNet processes a full point cloud in the global stream, and it captures long range information about the geometric structure by using a U-Net structured Submanifold Sparse Convolutional Network (SSCN-U) architecture. The second branch of GLSNet processes a point cloud in the local stream, and it partitions 3D points into slices and processes one slice of the cloud at a time by using the PointNet ++ architecture. The two streams of information are fused by max pooling over their classification prediction vectors. Our results on the IEEE GRSS Data Fusion Contest Urban Semantic 3D, Track 4 (DFT4) [3] [4] [5] point cloud classification dataset have shown that GLSNet achieved performance gains of almost 4% in mIOU and 1% in overall accuracy over the individual streams on the held-back testing dataset.
{"title":"GLSNet: Global and Local Streams Network for 3D Point Cloud Classification","authors":"Rina Bao, K. Palaniappan, Yunxin Zhao, G. Seetharaman, Wenjun Zeng","doi":"10.1109/AIPR47015.2019.9174587","DOIUrl":"https://doi.org/10.1109/AIPR47015.2019.9174587","url":null,"abstract":"We propose a novel deep architecture for semantic labeling of 3D point clouds referred to as Global and Local Streams Network (GLSNet) which is designed to capture both global and local structures and contextual information for large scale 3D point cloud classification. Our GLSNet tackles a hard problem – large differences of object sizes in large-scale point cloud segmentation including extremely large objects like water, and small objects like buildings and trees, and we design a two-branch deep network architecture to decompose the complex problem to separate processing problems at global and local scales and then fuse their predictions. GLSNet combines the strength of Submanifold Sparse Convolutional Network [1] for learning global structure with the strength of PointNet++ [2] for incorporating local information.The first branch of GLSNet processes a full point cloud in the global stream, and it captures long range information about the geometric structure by using a U-Net structured Submanifold Sparse Convolutional Network (SSCN-U) architecture. The second branch of GLSNet processes a point cloud in the local stream, and it partitions 3D points into slices and processes one slice of the cloud at a time by using the PointNet ++ architecture. The two streams of information are fused by max pooling over their classification prediction vectors. Our results on the IEEE GRSS Data Fusion Contest Urban Semantic 3D, Track 4 (DFT4) [3] [4] [5] point cloud classification dataset have shown that GLSNet achieved performance gains of almost 4% in mIOU and 1% in overall accuracy over the individual streams on the held-back testing dataset.","PeriodicalId":167075,"journal":{"name":"2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117088400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}