{"title":"Towards Timely, Resource-Efficient Analyses Through Spatially-Aware Constructs within Spark","authors":"Daniel Rammer, S. Pallickara, S. Pallickara","doi":"10.1109/UCC48980.2020.00024","DOIUrl":null,"url":null,"abstract":"Across several domains there has been a substantial growth in data volumes. A majority of the generated data are geotagged. This data includes a wealth of information that can inform insights, planning, and decision-making. The proliferation of open-source analytical engines has democratized access to tools and processing frameworks to analyze data. However, several of the analytical engines do not include streamlined support for spatial data wrangling and processing. Here, we present our language-agnostic methodology for effective analyses over voluminous spatiotemporal datasets using Spark. In particular, we introduce support for spatial data processing within the foundational constructs underpinning development of Spark programs DataFrames, Datasets, and RDDs. Our empirical benchmarks demonstrate the suitability of our methodology; in contrast to alternative distribution spatial analytics frameworks, we achieve over 2x speed-up for spatial range queries. Our methodology also makes effective utilization of resources by reducing disk I/O by a factor of 18, network I/O by 5 orders of magnitude, and peak memory utilization by 58% for the same set of analytic tasks.","PeriodicalId":125849,"journal":{"name":"2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UCC48980.2020.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Across several domains there has been a substantial growth in data volumes. A majority of the generated data are geotagged. This data includes a wealth of information that can inform insights, planning, and decision-making. The proliferation of open-source analytical engines has democratized access to tools and processing frameworks to analyze data. However, several of the analytical engines do not include streamlined support for spatial data wrangling and processing. Here, we present our language-agnostic methodology for effective analyses over voluminous spatiotemporal datasets using Spark. In particular, we introduce support for spatial data processing within the foundational constructs underpinning development of Spark programs DataFrames, Datasets, and RDDs. Our empirical benchmarks demonstrate the suitability of our methodology; in contrast to alternative distribution spatial analytics frameworks, we achieve over 2x speed-up for spatial range queries. Our methodology also makes effective utilization of resources by reducing disk I/O by a factor of 18, network I/O by 5 orders of magnitude, and peak memory utilization by 58% for the same set of analytic tasks.