Klimatic:一个用于地理空间数据收集和分布的虚拟数据湖

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) Pub Date : 2016-11-13 DOI:10.1109/PDSW-DISCS.2016.9

Tyler J. Skluzacek, K. Chard, Ian T Foster

{"title":"Klimatic:一个用于地理空间数据收集和分布的虚拟数据湖","authors":"Tyler J. Skluzacek, K. Chard, Ian T Foster","doi":"10.1109/PDSW-DISCS.2016.9","DOIUrl":null,"url":null,"abstract":"Many interesting geospatial datasets are publicly accessible on web sites and other online repositories. However, the sheer number of datasets and locations, plus a lack of support for cross-repository search, makes it difficult for researchers to discover and integrate relevant data. We describe here early results from a system, Klimatic, that aims to overcome these barriers to discovery and use by automating the tasks of crawling, indexing, integrating, and distributing geospatial data. Klimatic implements a scalable crawling and processing architecture that uses an elastic container-based model to locate and retrieve relevant datasets and to extract metadata from headers and within files to build a global index of known geospatial data. In so doing, we create an expansive geospatial virtual data lake that records the location, formats, and other characteristics of large numbers of geospatial datasets while also caching popular data subsets for rapid access. A flexible query interface allows users to request data that satisfy supplied type, spatial, temporal, and provider specifications; in processing such queries, the system uses interpolation and aggregation to combine data of different types, data formats, resolutions, and bounds. Klimatic has so far incorporated more than 10,000 datasets from over 120 sources and has been demonstrated to scale well with data size and query complexity.","PeriodicalId":375550,"journal":{"name":"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Klimatic: A Virtual Data Lake for Harvesting and Distribution of Geospatial Data\",\"authors\":\"Tyler J. Skluzacek, K. Chard, Ian T Foster\",\"doi\":\"10.1109/PDSW-DISCS.2016.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many interesting geospatial datasets are publicly accessible on web sites and other online repositories. However, the sheer number of datasets and locations, plus a lack of support for cross-repository search, makes it difficult for researchers to discover and integrate relevant data. We describe here early results from a system, Klimatic, that aims to overcome these barriers to discovery and use by automating the tasks of crawling, indexing, integrating, and distributing geospatial data. Klimatic implements a scalable crawling and processing architecture that uses an elastic container-based model to locate and retrieve relevant datasets and to extract metadata from headers and within files to build a global index of known geospatial data. In so doing, we create an expansive geospatial virtual data lake that records the location, formats, and other characteristics of large numbers of geospatial datasets while also caching popular data subsets for rapid access. A flexible query interface allows users to request data that satisfy supplied type, spatial, temporal, and provider specifications; in processing such queries, the system uses interpolation and aggregation to combine data of different types, data formats, resolutions, and bounds. Klimatic has so far incorporated more than 10,000 datasets from over 120 sources and has been demonstrated to scale well with data size and query complexity.\",\"PeriodicalId\":375550,\"journal\":{\"name\":\"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDSW-DISCS.2016.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDSW-DISCS.2016.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

摘要

许多有趣的地理空间数据集都可以在网站和其他在线存储库上公开访问。然而，数据集和位置的绝对数量，加上缺乏对跨存储库搜索的支持，使得研究人员很难发现和整合相关数据。我们在这里描述了Klimatic系统的早期成果，该系统旨在通过自动化爬行、索引、集成和分发地理空间数据的任务来克服这些发现和使用的障碍。Klimatic实现了一个可扩展的爬行和处理架构，它使用一个弹性的基于容器的模型来定位和检索相关数据集，并从标头和文件中提取元数据，以构建已知地理空间数据的全局索引。通过这样做，我们创建了一个扩展的地理空间虚拟数据湖，它记录了大量地理空间数据集的位置、格式和其他特征，同时还缓存了流行的数据子集，以便快速访问。灵活的查询接口允许用户请求满足所提供的类型、空间、时间和提供者规范的数据;在处理此类查询时，系统使用插值和聚合来组合不同类型、数据格式、分辨率和边界的数据。到目前为止，Klimatic已经整合了来自120多个来源的10,000多个数据集，并且已经证明可以很好地扩展数据大小和查询复杂性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Klimatic: A Virtual Data Lake for Harvesting and Distribution of Geospatial Data

Many interesting geospatial datasets are publicly accessible on web sites and other online repositories. However, the sheer number of datasets and locations, plus a lack of support for cross-repository search, makes it difficult for researchers to discover and integrate relevant data. We describe here early results from a system, Klimatic, that aims to overcome these barriers to discovery and use by automating the tasks of crawling, indexing, integrating, and distributing geospatial data. Klimatic implements a scalable crawling and processing architecture that uses an elastic container-based model to locate and retrieve relevant datasets and to extract metadata from headers and within files to build a global index of known geospatial data. In so doing, we create an expansive geospatial virtual data lake that records the location, formats, and other characteristics of large numbers of geospatial datasets while also caching popular data subsets for rapid access. A flexible query interface allows users to request data that satisfy supplied type, spatial, temporal, and provider specifications; in processing such queries, the system uses interpolation and aggregation to combine data of different types, data formats, resolutions, and bounds. Klimatic has so far incorporated more than 10,000 datasets from over 120 sources and has been demonstrated to scale well with data size and query complexity.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)

自引率

0.00%

发文量