ArrayUDF: User-Defined Scientific Data Analysis on Arrays

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2017-06-26 DOI:10.1145/3078597.3078599

Bin Dong, Kesheng Wu, S. Byna, Jialin Liu, Weijie Zhao, Florin Rusu

{"title":"ArrayUDF: User-Defined Scientific Data Analysis on Arrays","authors":"Bin Dong, Kesheng Wu, S. Byna, Jialin Liu, Weijie Zhao, Florin Rusu","doi":"10.1145/3078597.3078599","DOIUrl":null,"url":null,"abstract":"User-Defined Functions (UDF) allow application programmers to specify analysis operations on data, while leaving the data management tasks to the system. This general approach enables numerous custom analysis functions and is at the heart of the modern Big Data systems. Even though the UDF mechanism can theoretically support arbitrary operations, a wide variety of common operations -- such as computing the moving average of a time series, the vorticity of a fluid flow, etc., -- are hard to express and slow to execute. Since these operations are traditionally performed on multi-dimensional arrays, we propose to extend the expressiveness of structural locality for supporting UDF operations on arrays. We further propose an in situ UDF mechanism, called ArrayUDF, to implement the structural locality. ArrayUDF allows users to define computations on adjacent array cells without the use of join operations and executes the UDF directly on arrays stored in data files without requiring to load their content into a data management system. Additionally, we present a thorough theoretical analysis of the data access cost to exploit the structural locality, which enables ArrayUDF to automatically select the best array partitioning strategy for a given UDF operation. In a series of performance evaluations on large scientific datasets, we have observed that -- using the generic UDF interface -- ArrayUDF consistently outperforms Spark, SciDB, and RasDaMan.","PeriodicalId":436194,"journal":{"name":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3078597.3078599","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 27

Abstract

User-Defined Functions (UDF) allow application programmers to specify analysis operations on data, while leaving the data management tasks to the system. This general approach enables numerous custom analysis functions and is at the heart of the modern Big Data systems. Even though the UDF mechanism can theoretically support arbitrary operations, a wide variety of common operations -- such as computing the moving average of a time series, the vorticity of a fluid flow, etc., -- are hard to express and slow to execute. Since these operations are traditionally performed on multi-dimensional arrays, we propose to extend the expressiveness of structural locality for supporting UDF operations on arrays. We further propose an in situ UDF mechanism, called ArrayUDF, to implement the structural locality. ArrayUDF allows users to define computations on adjacent array cells without the use of join operations and executes the UDF directly on arrays stored in data files without requiring to load their content into a data management system. Additionally, we present a thorough theoretical analysis of the data access cost to exploit the structural locality, which enables ArrayUDF to automatically select the best array partitioning strategy for a given UDF operation. In a series of performance evaluations on large scientific datasets, we have observed that -- using the generic UDF interface -- ArrayUDF consistently outperforms Spark, SciDB, and RasDaMan.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ArrayUDF:用户自定义的阵列科学数据分析

用户定义函数(UDF)允许应用程序程序员指定对数据的分析操作，而将数据管理任务留给系统。这种通用方法可以实现许多自定义分析功能，并且是现代大数据系统的核心。尽管UDF机制在理论上可以支持任意操作，但是各种各样的常见操作——比如计算时间序列的移动平均，流体流动的涡度等——很难表达，执行起来也很慢。由于这些操作传统上是在多维数组上执行的，因此我们建议扩展结构局部性的表达性，以支持数组上的UDF操作。我们进一步提出了一种称为ArrayUDF的原位UDF机制来实现结构局部性。ArrayUDF允许用户在不使用连接操作的情况下定义相邻数组单元的计算，并直接对存储在数据文件中的数组执行UDF，而不需要将其内容加载到数据管理系统中。此外，我们对数据访问成本进行了全面的理论分析，以利用结构局部性，这使得ArrayUDF能够为给定的UDF操作自动选择最佳的数组分区策略。在对大型科学数据集的一系列性能评估中，我们观察到——使用通用的UDF接口——ArrayUDF始终优于Spark、SciDB和RasDaMan。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

自引率

0.00%

发文量