Data4U '14最新文献

英文中文

Taming Big Data: Integrating diverse public data sources for economic competitiveness analytics 驯服大数据:整合各种公共数据源，用于经济竞争力分析

Data4U '14

Pub Date : 2014-09-01 DOI: 10.1145/2658840.2658845

R. Neamtu, Ramoza Ahsan, J. Stokes, Armend Hoxha, Jialiang Bao, Stefan Gvozdenovic, Ted Meyer, Nilesh Patel, Raghu Rangan, Yumou Wang, Dongyun Zhang, Elke A. Rundensteiner

In an era where Big Data can greatly impact a broad population, many novel opportunities arise, chief among them the ability to integrate data from diverse sources and "wrangle" it to extract novel insights. Conceived as a tool that can help both expert and non-expert users better understand public data, MATTERS was collaboratively developed by the Massachusetts High Tech Council, WPI and other institutions as an analytic platform offering dynamic modeling capabilities. MATTERS is an integrative data source on high fidelity cost and talent competitiveness metrics. Its goal is to extract, integrate and model rich economic, financial, educational and technological information from renowned heterogeneous web data sources ranging from The US Census Bureau, The Bureau of Labor Statistics to the Institute of Education Sciences, all known to be critical factors influencing economic competitiveness of states. This demonstration of MATTERS illustrates how we tackle challenges of data acquisition, cleaning, integration and wrangling into appropriate representations, visualization and story-telling with data in the context of state competitiveness in the high-tech sector.

在一个大数据可以极大地影响广泛人群的时代，出现了许多新的机会，其中最主要的是整合来自不同来源的数据并“争论”它以提取新颖见解的能力。作为一个可以帮助专家和非专业用户更好地理解公共数据的工具，MATTERS是由马萨诸塞州高科技委员会、WPI和其他机构合作开发的，作为一个提供动态建模功能的分析平台。MATTERS是一个关于高保真成本和人才竞争力指标的综合数据源。其目标是从著名的异构网络数据源中提取、整合和建模丰富的经济、金融、教育和技术信息，这些数据源包括美国人口普查局、劳工统计局和教育科学研究所，这些数据源都是影响国家经济竞争力的关键因素。这个MATTERS的演示说明了我们如何在国家在高科技领域的竞争力背景下应对数据采集、清理、整合和争论的挑战，以适当的表示、可视化和讲故事的数据。

{"title":"Taming Big Data: Integrating diverse public data sources for economic competitiveness analytics","authors":"R. Neamtu, Ramoza Ahsan, J. Stokes, Armend Hoxha, Jialiang Bao, Stefan Gvozdenovic, Ted Meyer, Nilesh Patel, Raghu Rangan, Yumou Wang, Dongyun Zhang, Elke A. Rundensteiner","doi":"10.1145/2658840.2658845","DOIUrl":"https://doi.org/10.1145/2658840.2658845","url":null,"abstract":"In an era where Big Data can greatly impact a broad population, many novel opportunities arise, chief among them the ability to integrate data from diverse sources and \"wrangle\" it to extract novel insights. Conceived as a tool that can help both expert and non-expert users better understand public data, MATTERS was collaboratively developed by the Massachusetts High Tech Council, WPI and other institutions as an analytic platform offering dynamic modeling capabilities. MATTERS is an integrative data source on high fidelity cost and talent competitiveness metrics. Its goal is to extract, integrate and model rich economic, financial, educational and technological information from renowned heterogeneous web data sources ranging from The US Census Bureau, The Bureau of Labor Statistics to the Institute of Education Sciences, all known to be critical factors influencing economic competitiveness of states. This demonstration of MATTERS illustrates how we tackle challenges of data acquisition, cleaning, integration and wrangling into appropriate representations, visualization and story-telling with data in the context of state competitiveness in the high-tech sector.","PeriodicalId":135661,"journal":{"name":"Data4U '14","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134543287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Paradigm for Learning Queries on Big Data 基于大数据的查询学习范式

Data4U '14

Pub Date : 2014-09-01 DOI: 10.1145/2658840.2658842

A. Bonifati, Radu Ciucanu, Aurélien Lemay, S. Staworko

Specifying a database query using a formal query language is typically a challenging task for non-expert users. In the context of big data, this problem becomes even harder as it requires the users to deal with database instances of big sizes and hence difficult to visualize. Such instances usually lack a schema to help the users specify their queries, or have an incomplete schema as they come from disparate data sources. In this paper, we propose a novel paradigm for interactive learning of queries on big data, without assuming any knowledge of the database schema. The paradigm can be applied to different database models and a class of queries adequate to the database model. In particular, in this paper we present two instantiations that validated the proposed paradigm for learning relational join queries and for learning path queries on graph databases. Finally, we discuss the challenges of employing the paradigm for further data models and for learning cross-model schema mappings.

对于非专业用户来说，使用正式查询语言指定数据库查询通常是一项具有挑战性的任务。在大数据环境中，这个问题变得更加困难，因为它要求用户处理大尺寸的数据库实例，因此难以可视化。这类实例通常缺乏帮助用户指定查询的模式，或者模式不完整，因为它们来自不同的数据源。在本文中，我们提出了一种新的大数据查询交互学习范式，而不需要假设任何数据库模式的知识。该范式可以应用于不同的数据库模型和适合该数据库模型的查询类。特别是，在本文中，我们提出了两个实例，验证了所提出的学习关系连接查询和在图数据库上学习路径查询的范例。最后，我们讨论了在进一步的数据模型和学习跨模型模式映射中使用范式所面临的挑战。

引用次数: 19

DiNoDB: Efficient Large-Scale Raw Data Analytics 高效的大规模原始数据分析

Data4U '14

Pub Date : 2014-09-01 DOI: 10.1145/2658840.2658841

Yongchao Tian, Ioannis Alagiannis, Erietta Liarou, A. Ailamaki, P. Michiardi, M. Vukolic

Modern big data workflows, found in e.g., machine learning use cases, often involve iterations of cycles of batch analytics and interactive analytics on temporary data. Whereas batch analytics solutions for large volumes of raw data are well established (e.g., Hadoop, MapReduce), state-of-the-art interactive analytics solutions (e.g., distributed shared nothing RDBMSs) require data loading and/or transformation phase, which is inherently expensive for temporary data. In this paper, we propose a novel scalable distributed solution for in-situ data analytics, that offers both scalable batch and interactive data analytics on raw data, hence avoiding the loading phase bottleneck of RDBMSs. Our system combines a MapReduce based platform with the recently proposed NoDB paradigm, which optimizes traditional centralized RDBMSs for in-situ queries of raw files. We revisit the NoDB's centralized design and scale it out supporting multiple clients and data processing nodes to produce a new distributed data analytics system we call Distributed NoDB (DiNoDB). DiNoDB leverages MapReduce batch queries to produce critical pieces of metadata (e.g., distributed positional maps and vertical indices) to speed up interactive queries without the overheads of the data loading and data movement phases allowing users to quickly and efficiently exploit their data. Our experimental analysis demonstrates that DiNoDB significantly reduces the data-to-query latency with respect to comparable state-of-the-art distributed query engines, like Shark, Hive and HadoopDB.

现代大数据工作流程，例如机器学习用例，通常涉及批处理分析周期的迭代和对临时数据的交互式分析。虽然针对大量原始数据的批处理分析解决方案已经建立良好(例如Hadoop、MapReduce)，但最先进的交互式分析解决方案(例如分布式无共享rdbms)需要数据加载和/或转换阶段，这对于临时数据来说本质上是昂贵的。在本文中，我们提出了一种新的可扩展的分布式原位数据分析解决方案，该解决方案提供了对原始数据的可扩展批处理和交互式数据分析，从而避免了rdbms的加载阶段瓶颈。我们的系统结合了基于MapReduce的平台和最近提出的NoDB范例，它优化了传统的集中式rdbms，用于原始文件的原位查询。我们重新审视了NoDB的集中式设计，并将其扩展为支持多个客户端和数据处理节点，从而产生一个新的分布式数据分析系统，我们称之为分布式NoDB (DiNoDB)。DiNoDB利用MapReduce批量查询来生成关键的元数据(例如，分布式位置地图和垂直索引)，以加速交互式查询，而不需要数据加载和数据移动阶段的开销，允许用户快速有效地利用他们的数据。我们的实验分析表明，与Shark、Hive和HadoopDB等先进的分布式查询引擎相比，DiNoDB显著降低了数据到查询的延迟。

{"title":"DiNoDB: Efficient Large-Scale Raw Data Analytics","authors":"Yongchao Tian, Ioannis Alagiannis, Erietta Liarou, A. Ailamaki, P. Michiardi, M. Vukolic","doi":"10.1145/2658840.2658841","DOIUrl":"https://doi.org/10.1145/2658840.2658841","url":null,"abstract":"Modern big data workflows, found in e.g., machine learning use cases, often involve iterations of cycles of batch analytics and interactive analytics on temporary data. Whereas batch analytics solutions for large volumes of raw data are well established (e.g., Hadoop, MapReduce), state-of-the-art interactive analytics solutions (e.g., distributed shared nothing RDBMSs) require data loading and/or transformation phase, which is inherently expensive for temporary data.\u0000 In this paper, we propose a novel scalable distributed solution for in-situ data analytics, that offers both scalable batch and interactive data analytics on raw data, hence avoiding the loading phase bottleneck of RDBMSs. Our system combines a MapReduce based platform with the recently proposed NoDB paradigm, which optimizes traditional centralized RDBMSs for in-situ queries of raw files. We revisit the NoDB's centralized design and scale it out supporting multiple clients and data processing nodes to produce a new distributed data analytics system we call Distributed NoDB (DiNoDB). DiNoDB leverages MapReduce batch queries to produce critical pieces of metadata (e.g., distributed positional maps and vertical indices) to speed up interactive queries without the overheads of the data loading and data movement phases allowing users to quickly and efficiently exploit their data.\u0000 Our experimental analysis demonstrates that DiNoDB significantly reduces the data-to-query latency with respect to comparable state-of-the-art distributed query engines, like Shark, Hive and HadoopDB.","PeriodicalId":135661,"journal":{"name":"Data4U '14","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132591683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

An Efficient Processing of k-Dominant Skyline Query in MapReduce MapReduce中k-Dominant Skyline查询的高效处理

Data4U '14

Pub Date : 2014-09-01 DOI: 10.1145/2658840.2658846

Hao Tian, M. A. Siddique, Y. Morimoto

Filtering uninteresting data is important to utilize "big data". Skyline query is one of popular techniques to filter uninteresting data, in which it selects a set of points that are not dominated by another from a given large database. However, a skyline query often retrieves too many points to analyze intensively especially for high-dimensional dataset. In order to solve the problem, k-dominant skyline queries have been introduced, which can control the number of retrieved points. However, conventional algorithms for computing k-dominant skyline queries are not well suited for parallel and distributed environments, such as the MapReduce framework. In this paper we considered an efficient parallel algorithm to process k-dominant skyline query in the MapReduce framework. Extensive experiments are conducted to evaluate the algorithm under different settings of data distribution, dimensionality, and cardinality.

过滤不感兴趣的数据对于利用“大数据”很重要。Skyline查询是一种流行的过滤无趣数据的技术，它从给定的大型数据库中选择一组不受其他点支配的点。然而，对于高维数据集来说，skyline查询通常会检索到太多的点而无法进行深入分析。为了解决这个问题，引入了k主导的天际线查询，它可以控制检索点的数量。然而，用于计算k-dominant skyline查询的传统算法并不适合并行和分布式环境，例如MapReduce框架。在本文中，我们考虑了一种在MapReduce框架中处理k-dominant skyline查询的高效并行算法。在不同的数据分布、维数和基数设置下，进行了大量的实验来评估该算法。

引用次数: 9

Affordable Analytics on Expensive Data 基于昂贵数据的平价分析

Data4U '14

Pub Date : 2014-09-01 DOI: 10.1145/2658840.2658844

P. Upadhyaya, Martina Unutzer, M. Balazinska, Dan Suciu, Hakan Hacıgümüş

In this paper, we outline steps towards supporting "data analysis on a budget" when operating in a setting where data must be bought, possibly periodically. We model the problem, and explore the design choices for analytic applications as well as potentially fruitful algorithmic techniques to reduce the cost of acquiring data. Simulations suggest that an order of magnitude improvements are possible.

在本文中，我们概述了在必须定期购买数据的环境中支持“预算数据分析”的步骤。我们对问题进行建模，并探索分析应用程序的设计选择，以及潜在的富有成效的算法技术，以降低获取数据的成本。模拟表明，一个数量级的改进是可能的。

引用次数: 2

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Data4U '14

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀