数据质量:大数据的另一面

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI:10.1109/ICDE.2014.6816764

B. Saha, D. Srivastava

{"title":"数据质量:大数据的另一面","authors":"B. Saha, D. Srivastava","doi":"10.1109/ICDE.2014.6816764","DOIUrl":null,"url":null,"abstract":"In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth `V' of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three `V's, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"217","resultStr":"{\"title\":\"Data quality: The other face of Big Data\",\"authors\":\"B. Saha, D. Srivastava\",\"doi\":\"10.1109/ICDE.2014.6816764\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth `V' of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three `V's, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.\",\"PeriodicalId\":159130,\"journal\":{\"name\":\"2014 IEEE 30th International Conference on Data Engineering\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"217\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE 30th International Conference on Data Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE.2014.6816764\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 30th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2014.6816764","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 217

摘要

在大数据时代，数据正在以前所未有的规模产生、收集和分析，数据驱动的决策正在席卷社会的方方面面。最近的研究表明，在大型数据库和网络上，低质量的数据普遍存在。由于低质量的数据可能会对数据分析结果造成严重后果，因此准确性的重要性，即大数据的第四个“V”正日益得到认可。在本教程中，我们将重点介绍前三个“V”(volume, velocity和variety)在处理大数据的准确性时所带来的重大挑战。由于数据的庞大数量和速度，需要以可扩展和及时的方式理解和(可能)修复错误数据。由于数据的多样性，通常来自不同的来源，数据质量规则不能先验地指定;为了发现数据的语义，需要让“数据自己说话”。本教程介绍了与大数据质量管理相关的最新成果，重点关注两个主要方面:(i)从数据本身发现质量问题，(ii)权衡准确性与效率，并为社区确定了一系列开放问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Data quality: The other face of Big Data

In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth `V' of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three `V's, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE 30th International Conference on Data Engineering

自引率

0.00%

发文量

期刊最新文献

Managing uncertainty in spatial and spatio-temporal data Locality-sensitive operators for parallel main-memory database clusters KnowLife: A knowledge graph for health and life sciences We can learn your #hashtags: Connecting tweets to explicit topics A demonstration of MNTG - A web-based road network traffic generator