SystemML中可扩展和数值稳定的描述性统计

2012 IEEE 28th International Conference on Data Engineering Pub Date : 2012-04-01 DOI:10.1109/ICDE.2012.12

Yuanyuan Tian, S. Tatikonda, B. Reinwald

{"title":"SystemML中可扩展和数值稳定的描述性统计","authors":"Yuanyuan Tian, S. Tatikonda, B. Reinwald","doi":"10.1109/ICDE.2012.12","DOIUrl":null,"url":null,"abstract":"With the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale data analytics. In SystemML, machine learning algorithms are expressed as scripts in a high-level language, called DML, which is syntactically similar to R. DML scripts are compiled, optimized, and executed in the SystemML runtime that is built on top of MapReduce. As the basis of virtually every quantitative analysis, descriptive statistics provide powerful tools to explore data in SystemML. In this paper, we describe our experience in implementing descriptive statistics in SystemML. In particular, we elaborate on how to overcome the two major challenges: (1) achieving numerical stability while operating on large data sets in a distributed setting of MapReduce, and (2) designing scalable algorithms to compute order statistics in MapReduce. By empirically comparing to algorithms commonly used in existing tools and systems, we demonstrate the numerical accuracy achieved by SystemML. We also highlight the valuable lessons we have learned in this exercise.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Scalable and Numerically Stable Descriptive Statistics in SystemML\",\"authors\":\"Yuanyuan Tian, S. Tatikonda, B. Reinwald\",\"doi\":\"10.1109/ICDE.2012.12\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale data analytics. In SystemML, machine learning algorithms are expressed as scripts in a high-level language, called DML, which is syntactically similar to R. DML scripts are compiled, optimized, and executed in the SystemML runtime that is built on top of MapReduce. As the basis of virtually every quantitative analysis, descriptive statistics provide powerful tools to explore data in SystemML. In this paper, we describe our experience in implementing descriptive statistics in SystemML. In particular, we elaborate on how to overcome the two major challenges: (1) achieving numerical stability while operating on large data sets in a distributed setting of MapReduce, and (2) designing scalable algorithms to compute order statistics in MapReduce. By empirically comparing to algorithms commonly used in existing tools and systems, we demonstrate the numerical accuracy achieved by SystemML. We also highlight the valuable lessons we have learned in this exercise.\",\"PeriodicalId\":321608,\"journal\":{\"name\":\"2012 IEEE 28th International Conference on Data Engineering\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE 28th International Conference on Data Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE.2012.12\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 28th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2012.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

摘要

随着近年来产生的数据量呈指数级增长，迫切需要将机器学习算法应用于大型数据集。SystemML是一个采用声明式方法进行大规模数据分析的框架。在SystemML中，机器学习算法被表示为一种高级语言的脚本，称为DML，在语法上类似于r。DML脚本在构建在MapReduce之上的SystemML运行时中进行编译、优化和执行。作为几乎所有定量分析的基础，描述性统计提供了在SystemML中探索数据的强大工具。在本文中，我们描述了在SystemML中实现描述性统计的经验。特别是，我们详细阐述了如何克服两个主要挑战:(1)在MapReduce的分布式设置中处理大型数据集时实现数值稳定性;(2)设计可扩展的算法来计算MapReduce中的顺序统计。通过与现有工具和系统中常用的算法进行经验比较，我们证明了SystemML实现的数值精度。我们还强调我们在这项工作中学到的宝贵经验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Scalable and Numerically Stable Descriptive Statistics in SystemML

With the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale data analytics. In SystemML, machine learning algorithms are expressed as scripts in a high-level language, called DML, which is syntactically similar to R. DML scripts are compiled, optimized, and executed in the SystemML runtime that is built on top of MapReduce. As the basis of virtually every quantitative analysis, descriptive statistics provide powerful tools to explore data in SystemML. In this paper, we describe our experience in implementing descriptive statistics in SystemML. In particular, we elaborate on how to overcome the two major challenges: (1) achieving numerical stability while operating on large data sets in a distributed setting of MapReduce, and (2) designing scalable algorithms to compute order statistics in MapReduce. By empirically comparing to algorithms commonly used in existing tools and systems, we demonstrate the numerical accuracy achieved by SystemML. We also highlight the valuable lessons we have learned in this exercise.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 IEEE 28th International Conference on Data Engineering

自引率

0.00%

发文量

期刊最新文献

Keyword Query Reformulation on Structured Data Accuracy-Aware Uncertain Stream Databases Extracting Analyzing and Visualizing Triangle K-Core Motifs within Networks Project Daytona: Data Analytics as a Cloud Service Automatic Extraction of Structured Web Data with Domain Knowledge