FDQ:基于真实科学阵列数据集的高级分析

Roee Ebenstein, G. Agrawal, Jiali Wang, J. Boley, R. Kettimuthu
{"title":"FDQ:基于真实科学阵列数据集的高级分析","authors":"Roee Ebenstein, G. Agrawal, Jiali Wang, J. Boley, R. Kettimuthu","doi":"10.1109/eScience.2018.00134","DOIUrl":null,"url":null,"abstract":"Scientific data is not only rapidly increasing in size, but in complexity of operations performed upon as well. Compared to the prevalent use of ad-hoc approaches, structured operators provide many benefits. In this paper, we introduce FDQ - an Analytical Functions Distributed Querying Engine intended for Array Data. Motivated by needs of climate scientists in terms of both functionality and scalability, we make three major contributions: First, we introduce a new class of analytical querying - querying over windows where the planes that construct these windows are internally ordered. An example of this querying type is the introduced MINUS analytical function, a function that supports querying over accumulative measurements with data resets. Second, we describe in detail memory management optimizations for efficient processing of analytical (and other structured operators) querying over large datasets. Last, we provide efficient methods to execute these queries in parallel, using a sectioned (tiled) approach. We evaluate our methods using real multi-dimensional climate datasets, and show they outperform existing approaches. When running locally (not in a distributed manner), we observed an average performance improvement of 538% compared to other engines for analytical calculations. We also show our methods performance improve linearly with the provided computing resources (scale up and out).","PeriodicalId":6476,"journal":{"name":"2018 IEEE 14th International Conference on e-Science (e-Science)","volume":"1 1","pages":"453-463"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"FDQ: Advance Analytics Over Real Scientific Array Datasets\",\"authors\":\"Roee Ebenstein, G. Agrawal, Jiali Wang, J. Boley, R. Kettimuthu\",\"doi\":\"10.1109/eScience.2018.00134\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scientific data is not only rapidly increasing in size, but in complexity of operations performed upon as well. Compared to the prevalent use of ad-hoc approaches, structured operators provide many benefits. In this paper, we introduce FDQ - an Analytical Functions Distributed Querying Engine intended for Array Data. Motivated by needs of climate scientists in terms of both functionality and scalability, we make three major contributions: First, we introduce a new class of analytical querying - querying over windows where the planes that construct these windows are internally ordered. An example of this querying type is the introduced MINUS analytical function, a function that supports querying over accumulative measurements with data resets. Second, we describe in detail memory management optimizations for efficient processing of analytical (and other structured operators) querying over large datasets. Last, we provide efficient methods to execute these queries in parallel, using a sectioned (tiled) approach. We evaluate our methods using real multi-dimensional climate datasets, and show they outperform existing approaches. When running locally (not in a distributed manner), we observed an average performance improvement of 538% compared to other engines for analytical calculations. We also show our methods performance improve linearly with the provided computing resources (scale up and out).\",\"PeriodicalId\":6476,\"journal\":{\"name\":\"2018 IEEE 14th International Conference on e-Science (e-Science)\",\"volume\":\"1 1\",\"pages\":\"453-463\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 14th International Conference on e-Science (e-Science)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/eScience.2018.00134\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 14th International Conference on e-Science (e-Science)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2018.00134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

科学数据不仅在规模上迅速增加,而且其操作的复杂性也在迅速增加。与普遍使用的特设方法相比,结构化操作符提供了许多好处。本文介绍了面向数组数据的分析函数分布式查询引擎FDQ。出于气候科学家在功能和可扩展性方面的需求,我们做出了三个主要贡献:首先,我们引入了一类新的分析查询-在构建这些窗口的平面内部有序的窗口上查询。这种查询类型的一个示例是引入的MINUS分析函数,该函数支持对具有数据重置的累积测量值进行查询。其次,我们详细描述了在大型数据集上有效处理分析(和其他结构化操作符)查询的内存管理优化。最后,我们提供了使用分段(平铺)方法并行执行这些查询的有效方法。我们使用真实的多维气候数据集来评估我们的方法,并表明它们优于现有的方法。在本地运行时(不是以分布式方式),我们观察到与其他引擎相比,用于分析计算的平均性能提高了538%。我们还展示了我们的方法性能随着所提供的计算资源(向上和向外扩展)而线性提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
FDQ: Advance Analytics Over Real Scientific Array Datasets
Scientific data is not only rapidly increasing in size, but in complexity of operations performed upon as well. Compared to the prevalent use of ad-hoc approaches, structured operators provide many benefits. In this paper, we introduce FDQ - an Analytical Functions Distributed Querying Engine intended for Array Data. Motivated by needs of climate scientists in terms of both functionality and scalability, we make three major contributions: First, we introduce a new class of analytical querying - querying over windows where the planes that construct these windows are internally ordered. An example of this querying type is the introduced MINUS analytical function, a function that supports querying over accumulative measurements with data resets. Second, we describe in detail memory management optimizations for efficient processing of analytical (and other structured operators) querying over large datasets. Last, we provide efficient methods to execute these queries in parallel, using a sectioned (tiled) approach. We evaluate our methods using real multi-dimensional climate datasets, and show they outperform existing approaches. When running locally (not in a distributed manner), we observed an average performance improvement of 538% compared to other engines for analytical calculations. We also show our methods performance improve linearly with the provided computing resources (scale up and out).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Occam: Software Environment for Creating Reproducible Research Smart Data Scouting in Professional Soccer: Evaluating Passing Performance Based on Position Tracking Data Improving LBFGS Optimizer in PyTorch: Knowledge Transfer from Radio Interferometric Calibration to Machine Learning Nordic Exome Variant Catalogue a Web Resource for Genomic Data Browsing Survey on Research Software Engineering in the Netherlands
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1