Randall T. Whitman, Michael B. Park, Sarah M. Ambrose, E. Hoel
{"title":"Spatial indexing and analytics on Hadoop","authors":"Randall T. Whitman, Michael B. Park, Sarah M. Ambrose, E. Hoel","doi":"10.1145/2666310.2666387","DOIUrl":null,"url":null,"abstract":"Effective processing of extremely large volumes of spatial data has led to many organizations employing distributed processing frameworks. Hadoop is one such open-source framework that is enjoying widespread adoption. In this paper, we detail an approach to indexing and performing key analytics on spatial data that is persisted in HDFS. Our technique differs from other approaches in that it combines spatial indexing, data load balancing, and data clustering in order to optimize performance across the cluster. In addition, our index supports efficient, random-access queries without requiring a MapReduce job; neither a full table scan, nor any MapReduce overhead is incurred when searching. This facilitates large numbers of concurrent query executions. We will also demonstrate how indexing and clustering positively impacts the performance of range and k-NN queries on large real-world datasets. The performance analysis will enable a number of interesting observations to be made on the behavior of spatial indexes and spatial queries in this distributed processing environment.","PeriodicalId":153031,"journal":{"name":"Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"60","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2666310.2666387","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 60
Abstract
Effective processing of extremely large volumes of spatial data has led to many organizations employing distributed processing frameworks. Hadoop is one such open-source framework that is enjoying widespread adoption. In this paper, we detail an approach to indexing and performing key analytics on spatial data that is persisted in HDFS. Our technique differs from other approaches in that it combines spatial indexing, data load balancing, and data clustering in order to optimize performance across the cluster. In addition, our index supports efficient, random-access queries without requiring a MapReduce job; neither a full table scan, nor any MapReduce overhead is incurred when searching. This facilitates large numbers of concurrent query executions. We will also demonstrate how indexing and clustering positively impacts the performance of range and k-NN queries on large real-world datasets. The performance analysis will enable a number of interesting observations to be made on the behavior of spatial indexes and spatial queries in this distributed processing environment.