用spark和BDAS征服大数据

Measurement and Modeling of Computer Systems Pub Date : 2014-06-16 DOI:10.1145/2637364.2611389

I. Stoica

{"title":"用spark和BDAS征服大数据","authors":"I. Stoica","doi":"10.1145/2637364.2611389","DOIUrl":null,"url":null,"abstract":"Today, big and small organizations alike collect huge amounts of data, and they do so with one goal in mind: extract \"value\" through sophisticated exploratory analysis, and use it as the basis to make decisions as varied as personalized treatment and ad targeting. Unfortunately, existing data analytics tools are slow in answering queries, as they typically require to sift through huge amounts of data stored on disk, and are even less suitable for complex computations, such as machine learning algorithms. These limitations leave the potential of extracting value of big data unfulfilled.\n To address this challenge, we are developing Berkeley Data Analytics Stack (BDAS), an open source data analytics stack that provides interactive response times for complex computations on massive data. To achieve this goal, BDAS supports efficient, large-scale in-memory data processing, and allows users and applications to trade between query accuracy, time, and cost. In this talk, I'll present the architecture, challenges, results, and our experience with developing BDAS, with a focus on Apache Spark, an in-memory cluster computing engine that provides support for a variety of workloads, including batch, streaming, and iterative computations. In a relatively short time, Spark has become the most active big data project in the open source community, and is already being used by over one hundred of companies and research institutions.","PeriodicalId":306456,"journal":{"name":"Measurement and Modeling of Computer Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Conquering big data with spark and BDAS\",\"authors\":\"I. Stoica\",\"doi\":\"10.1145/2637364.2611389\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Today, big and small organizations alike collect huge amounts of data, and they do so with one goal in mind: extract \\\"value\\\" through sophisticated exploratory analysis, and use it as the basis to make decisions as varied as personalized treatment and ad targeting. Unfortunately, existing data analytics tools are slow in answering queries, as they typically require to sift through huge amounts of data stored on disk, and are even less suitable for complex computations, such as machine learning algorithms. These limitations leave the potential of extracting value of big data unfulfilled.\\n To address this challenge, we are developing Berkeley Data Analytics Stack (BDAS), an open source data analytics stack that provides interactive response times for complex computations on massive data. To achieve this goal, BDAS supports efficient, large-scale in-memory data processing, and allows users and applications to trade between query accuracy, time, and cost. In this talk, I'll present the architecture, challenges, results, and our experience with developing BDAS, with a focus on Apache Spark, an in-memory cluster computing engine that provides support for a variety of workloads, including batch, streaming, and iterative computations. In a relatively short time, Spark has become the most active big data project in the open source community, and is already being used by over one hundred of companies and research institutions.\",\"PeriodicalId\":306456,\"journal\":{\"name\":\"Measurement and Modeling of Computer Systems\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-06-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Measurement and Modeling of Computer Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2637364.2611389\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Measurement and Modeling of Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2637364.2611389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

今天，大大小小的组织都收集了大量的数据，他们这样做的目的只有一个:通过复杂的探索性分析提取“价值”，并将其作为制定个性化治疗和广告定位等各种决策的基础。不幸的是，现有的数据分析工具在回答查询时速度很慢，因为它们通常需要筛选存储在磁盘上的大量数据，并且更不适合复杂的计算，例如机器学习算法。这些限制使得提取大数据价值的潜力无法实现。为了应对这一挑战，我们正在开发伯克利数据分析堆栈(BDAS)，这是一个开源数据分析堆栈，可以为大规模数据上的复杂计算提供交互式响应时间。为了实现这一目标，BDAS支持高效、大规模的内存内数据处理，并允许用户和应用程序在查询准确性、时间和成本之间进行权衡。在这次演讲中，我将介绍架构、挑战、结果以及我们开发BDAS的经验，重点是Apache Spark，这是一个内存集群计算引擎，提供了对各种工作负载的支持，包括批处理、流计算和迭代计算。在相对较短的时间内，Spark已经成为开源社区中最活跃的大数据项目，并且已经被一百多家公司和研究机构使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Conquering big data with spark and BDAS

Today, big and small organizations alike collect huge amounts of data, and they do so with one goal in mind: extract "value" through sophisticated exploratory analysis, and use it as the basis to make decisions as varied as personalized treatment and ad targeting. Unfortunately, existing data analytics tools are slow in answering queries, as they typically require to sift through huge amounts of data stored on disk, and are even less suitable for complex computations, such as machine learning algorithms. These limitations leave the potential of extracting value of big data unfulfilled. To address this challenge, we are developing Berkeley Data Analytics Stack (BDAS), an open source data analytics stack that provides interactive response times for complex computations on massive data. To achieve this goal, BDAS supports efficient, large-scale in-memory data processing, and allows users and applications to trade between query accuracy, time, and cost. In this talk, I'll present the architecture, challenges, results, and our experience with developing BDAS, with a focus on Apache Spark, an in-memory cluster computing engine that provides support for a variety of workloads, including batch, streaming, and iterative computations. In a relatively short time, Spark has become the most active big data project in the open source community, and is already being used by over one hundred of companies and research institutions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Measurement and Modeling of Computer Systems

自引率

0.00%

发文量

期刊最新文献

Queueing delays in buffered multistage interconnection networks Data dissemination performance in large-scale sensor networks Index policies for a multi-class queue with convex holding cost and abandonments Neighbor-cell assisted error correction for MLC NAND flash memories Collecting, organizing, and sharing pins in pinterest: interest-driven or social-driven?