Clean Answers over Dirty Databases: A Probabilistic Approach

22nd International Conference on Data Engineering (ICDE'06) Pub Date : 2006-04-03 DOI:10.1109/ICDE.2006.35

Periklis Andritsos, A. Fuxman, Renée J. Miller

{"title":"Clean Answers over Dirty Databases: A Probabilistic Approach","authors":"Periklis Andritsos, A. Fuxman, Renée J. Miller","doi":"10.1109/ICDE.2006.35","DOIUrl":null,"url":null,"abstract":"The detection of duplicate tuples, corresponding to the same real-world entity, is an important task in data integration and cleaning. While many techniques exist to identify such tuples, the merging or elimination of duplicates can be a difficult task that relies on ad-hoc and often manual solutions. We propose a complementary approach that permits declarative query answering over duplicated data, where each duplicate is associated with a probability of being in the clean database. We rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database. Our rewritten queries are sensitive to the semantics of duplication and help a user understand which query answers are most likely to be present in the clean database. The semantics that we adopt is independent of the way the probabilities are produced, but is able to effectively exploit them during query answering. In the absence of external knowledge that associates each database tuple with a probability, we offer a technique, based on tuple summaries, that automates this task. We experimentally study the performance of our rewritten queries. Our studies show that the rewriting does not introduce a significant overhead in query execution time. This work is done in the context of the ConQuer project at the University of Toronto, which focuses on the efficient management of inconsistent and dirty databases.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"20 5 1","pages":"30-30"},"PeriodicalIF":0.0000,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"202","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"22nd International Conference on Data Engineering (ICDE'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2006.35","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 202

Abstract

The detection of duplicate tuples, corresponding to the same real-world entity, is an important task in data integration and cleaning. While many techniques exist to identify such tuples, the merging or elimination of duplicates can be a difficult task that relies on ad-hoc and often manual solutions. We propose a complementary approach that permits declarative query answering over duplicated data, where each duplicate is associated with a probability of being in the clean database. We rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database. Our rewritten queries are sensitive to the semantics of duplication and help a user understand which query answers are most likely to be present in the clean database. The semantics that we adopt is independent of the way the probabilities are produced, but is able to effectively exploit them during query answering. In the absence of external knowledge that associates each database tuple with a probability, we offer a technique, based on tuple summaries, that automates this task. We experimentally study the performance of our rewritten queries. Our studies show that the rewriting does not introduce a significant overhead in query execution time. This work is done in the context of the ConQuer project at the University of Toronto, which focuses on the efficient management of inconsistent and dirty databases.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

干净的答案胜于肮脏的数据库:一种概率方法

在数据集成和清理中，检测对应于相同现实世界实体的重复元组是一项重要任务。虽然有许多技术可以识别这样的元组，但是合并或消除重复项可能是一项困难的任务，它依赖于特别的、通常是手动的解决方案。我们提出了一种补充方法，允许对重复数据进行声明性查询回答，其中每个重复数据都与在干净数据库中的概率相关联。我们重写对包含重复项的数据库的查询，以返回每个答案的概率为答案在干净的数据库中。我们重写的查询对重复的语义很敏感，并帮助用户理解哪些查询答案最有可能出现在干净的数据库中。我们采用的语义独立于概率产生的方式，但能够在查询回答期间有效地利用它们。在缺乏将每个数据库元组与概率关联起来的外部知识的情况下，我们提供了一种基于元组摘要的技术，可以自动执行此任务。我们通过实验研究了重写查询的性能。我们的研究表明，重写不会在查询执行时间上带来很大的开销。这项工作是在多伦多大学的ConQuer项目的上下文中完成的，该项目的重点是对不一致和脏数据库的有效管理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

22nd International Conference on Data Engineering (ICDE'06)

自引率

0.00%

发文量

期刊最新文献

An Approach to Adaptive Memory Management in Data Stream Systems Revision Processing in a Stream Processing Engine: A High-Level Design SUBSKY: Efficient Computation of Skylines in Subspaces How to Determine a Good Multi-Programming Level for External Scheduling Warehousing and Analyzing Massive RFID Data Sets