一个使用Apache Spark在统一模式上进行连续空间查询的中介系统

IF 4.2 3区 地球科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Big Earth Data Pub Date : 2023-11-09 DOI:10.1080/20964471.2023.2275854
Thi Thu Trang Ngo, François Pinet, David Sarramia, Myoung-Ah Kang
{"title":"一个使用Apache Spark在统一模式上进行连续空间查询的中介系统","authors":"Thi Thu Trang Ngo, François Pinet, David Sarramia, Myoung-Ah Kang","doi":"10.1080/20964471.2023.2275854","DOIUrl":null,"url":null,"abstract":"Recent advances in big and streaming data systems have enabled real-time analysis of data generated by Internet of Things (IoT) systems and sensors in various domains. In this context, many applications require integrating data from several heterogeneous sources, either stream or static sources. Frameworks such as Apache Spark are able to integrate and process large datasets from different sources. However, these frameworks are hard to use when the data sources are heterogeneous and numerous. To address this issue, we propose a system based on mediation techniques for integrating stream and static data sources. The integration process of our system consists of three main steps: configuration, query expression and query execution. In the configuration step, an administrator designs a mediated schema and defines mapping between the mediated schema and local data sources. In the query expression step, users express queries using customized SQL grammar on the mediated schema. Finally, our system rewrites the query into an optimized Spark application and submits the application to a Spark cluster. The results are continuously returned to users. Our experiments show that our optimizations can improve query execution time by up to one order of magnitude, making complex streaming and spatial data analysis more accessible.","PeriodicalId":8765,"journal":{"name":"Big Earth Data","volume":" 22","pages":"0"},"PeriodicalIF":4.2000,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A mediation system for continuous spatial queries on a unified schema using Apache Spark\",\"authors\":\"Thi Thu Trang Ngo, François Pinet, David Sarramia, Myoung-Ah Kang\",\"doi\":\"10.1080/20964471.2023.2275854\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advances in big and streaming data systems have enabled real-time analysis of data generated by Internet of Things (IoT) systems and sensors in various domains. In this context, many applications require integrating data from several heterogeneous sources, either stream or static sources. Frameworks such as Apache Spark are able to integrate and process large datasets from different sources. However, these frameworks are hard to use when the data sources are heterogeneous and numerous. To address this issue, we propose a system based on mediation techniques for integrating stream and static data sources. The integration process of our system consists of three main steps: configuration, query expression and query execution. In the configuration step, an administrator designs a mediated schema and defines mapping between the mediated schema and local data sources. In the query expression step, users express queries using customized SQL grammar on the mediated schema. Finally, our system rewrites the query into an optimized Spark application and submits the application to a Spark cluster. The results are continuously returned to users. Our experiments show that our optimizations can improve query execution time by up to one order of magnitude, making complex streaming and spatial data analysis more accessible.\",\"PeriodicalId\":8765,\"journal\":{\"name\":\"Big Earth Data\",\"volume\":\" 22\",\"pages\":\"0\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2023-11-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Big Earth Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/20964471.2023.2275854\",\"RegionNum\":3,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Earth Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/20964471.2023.2275854","RegionNum":3,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

大数据和流数据系统的最新进展使物联网(IoT)系统和传感器在各个领域产生的数据能够实时分析。在这种情况下,许多应用程序需要集成来自多个异构源(流或静态源)的数据。像Apache Spark这样的框架能够集成和处理来自不同来源的大型数据集。然而,当数据源异构且数量众多时,这些框架很难使用。为了解决这个问题,我们提出了一个基于中介技术的系统,用于集成流和静态数据源。本系统的集成过程包括配置、查询表达和查询执行三个主要步骤。在配置步骤中,管理员设计一个中介模式,并定义中介模式与本地数据源之间的映射。在查询表达式步骤中,用户在中介模式上使用自定义SQL语法表示查询。最后,我们的系统将查询重写为优化后的Spark应用程序,并将该应用程序提交给Spark集群。结果不断返回给用户。我们的实验表明,我们的优化可以将查询执行时间提高一个数量级,使复杂的流和空间数据分析更易于访问。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A mediation system for continuous spatial queries on a unified schema using Apache Spark
Recent advances in big and streaming data systems have enabled real-time analysis of data generated by Internet of Things (IoT) systems and sensors in various domains. In this context, many applications require integrating data from several heterogeneous sources, either stream or static sources. Frameworks such as Apache Spark are able to integrate and process large datasets from different sources. However, these frameworks are hard to use when the data sources are heterogeneous and numerous. To address this issue, we propose a system based on mediation techniques for integrating stream and static data sources. The integration process of our system consists of three main steps: configuration, query expression and query execution. In the configuration step, an administrator designs a mediated schema and defines mapping between the mediated schema and local data sources. In the query expression step, users express queries using customized SQL grammar on the mediated schema. Finally, our system rewrites the query into an optimized Spark application and submits the application to a Spark cluster. The results are continuously returned to users. Our experiments show that our optimizations can improve query execution time by up to one order of magnitude, making complex streaming and spatial data analysis more accessible.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Big Earth Data
Big Earth Data Earth and Planetary Sciences-Computers in Earth Sciences
CiteScore
7.40
自引率
10.00%
发文量
60
审稿时长
10 weeks
期刊最新文献
A dataset of lake level changes in China between 2002 and 2023 using multi-altimeter data The first 10 m resolution thermokarst lake and pond dataset for the Lena Basin in the 2020 thawing season A high-resolution dataset for lower atmospheric process studies over the Tibetan Plateau from 1981 to 2020 An application of 1D convolution and deep learning to remote sensing modelling of Secchi depth in the northern Adriatic Sea A mediation system for continuous spatial queries on a unified schema using Apache Spark
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1