Big Data Analytics Integrating a Parallel Columnar DBMS and the R Language

Yiqun Zhang, C. Ordonez, Wellington Cabrera
{"title":"Big Data Analytics Integrating a Parallel Columnar DBMS and the R Language","authors":"Yiqun Zhang, C. Ordonez, Wellington Cabrera","doi":"10.1109/CCGrid.2016.94","DOIUrl":null,"url":null,"abstract":"Most research has proposed scalable and parallel analytic algorithms that work outside a DBMS. On the other hand, R has become a very popular system to perform machine learning analysis, but it is limited by main memory and single-threaded processing. Recently, novel columnar DBMSs have shown to provide orders of magnitude improvement in SQL query processing speed, preserving the parallel speedup of row-based parallel DBMSs. With that motivation in mind, we present COLUMNAR, a system integrating a parallel columnar DBMS and R, that can directly compute models on large data sets stored as relational tables. Our algorithms are based on a combination of SQL queries, user-defined functions (UDFs) and R calls, where SQL queries and UDFs compute data set summaries that are sent to R to compute models in RAM. Since our hybrid algorithms exploit the DBMS for the most demanding computations involving the data set, they show linear scalability and are highly parallel. Our algorithms generally require one pass on the data set or a few passes otherwise (i.e. fewer passes than traditional methods). Our system can analyze data sets faster than R even when they fit in RAM and it also eliminates memory limitations in R when data sets exceed RAM size. On the other hand, it is an order of magnitude faster than Spark (a prominent Hadoop system) and a traditional row-based DBMS.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2016.94","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Most research has proposed scalable and parallel analytic algorithms that work outside a DBMS. On the other hand, R has become a very popular system to perform machine learning analysis, but it is limited by main memory and single-threaded processing. Recently, novel columnar DBMSs have shown to provide orders of magnitude improvement in SQL query processing speed, preserving the parallel speedup of row-based parallel DBMSs. With that motivation in mind, we present COLUMNAR, a system integrating a parallel columnar DBMS and R, that can directly compute models on large data sets stored as relational tables. Our algorithms are based on a combination of SQL queries, user-defined functions (UDFs) and R calls, where SQL queries and UDFs compute data set summaries that are sent to R to compute models in RAM. Since our hybrid algorithms exploit the DBMS for the most demanding computations involving the data set, they show linear scalability and are highly parallel. Our algorithms generally require one pass on the data set or a few passes otherwise (i.e. fewer passes than traditional methods). Our system can analyze data sets faster than R even when they fit in RAM and it also eliminates memory limitations in R when data sets exceed RAM size. On the other hand, it is an order of magnitude faster than Spark (a prominent Hadoop system) and a traditional row-based DBMS.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
集成并行列式DBMS和R语言的大数据分析
大多数研究都提出了在DBMS之外工作的可扩展和并行分析算法。另一方面,R已经成为一种非常流行的执行机器学习分析的系统,但它受到主内存和单线程处理的限制。最近,新的列式dbms显示出在SQL查询处理速度方面提供数量级的改进,同时保留了基于行的并行dbms的并行加速。考虑到这一动机,我们提出了COLUMNAR,这是一个集成了并行列式DBMS和R的系统,它可以直接在存储为关系表的大型数据集上计算模型。我们的算法基于SQL查询、用户定义函数(udf)和R调用的组合,其中SQL查询和udf计算数据集摘要,这些摘要发送给R以在RAM中计算模型。由于我们的混合算法利用DBMS进行涉及数据集的最苛刻的计算,因此它们显示出线性可伸缩性并且高度并行。我们的算法通常需要对数据集进行一次传递或几次传递(即比传统方法更少的传递)。我们的系统可以比R更快地分析数据集,即使它们适合RAM,并且当数据集超过RAM大小时,它还消除了R中的内存限制。另一方面,它比Spark(一个著名的Hadoop系统)和传统的基于行的DBMS要快一个数量级。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Increasing the Performance of Data Centers by Combining Remote GPU Virtualization with Slurm DiBA: Distributed Power Budget Allocation for Large-Scale Computing Clusters Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era DTStorage: Dynamic Tape-Based Storage for Cost-Effective and Highly-Available Streaming Service Facilitating the Execution of HPC Workloads in Colombia through the Integration of a Private IaaS and a Scientific PaaS/SaaS Marketplace
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1