Clean Answers over Dirty Databases: A Probabilistic Approach

Periklis Andritsos, A. Fuxman, Renée J. Miller
{"title":"Clean Answers over Dirty Databases: A Probabilistic Approach","authors":"Periklis Andritsos, A. Fuxman, Renée J. Miller","doi":"10.1109/ICDE.2006.35","DOIUrl":null,"url":null,"abstract":"The detection of duplicate tuples, corresponding to the same real-world entity, is an important task in data integration and cleaning. While many techniques exist to identify such tuples, the merging or elimination of duplicates can be a difficult task that relies on ad-hoc and often manual solutions. We propose a complementary approach that permits declarative query answering over duplicated data, where each duplicate is associated with a probability of being in the clean database. We rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database. Our rewritten queries are sensitive to the semantics of duplication and help a user understand which query answers are most likely to be present in the clean database. The semantics that we adopt is independent of the way the probabilities are produced, but is able to effectively exploit them during query answering. In the absence of external knowledge that associates each database tuple with a probability, we offer a technique, based on tuple summaries, that automates this task. We experimentally study the performance of our rewritten queries. Our studies show that the rewriting does not introduce a significant overhead in query execution time. This work is done in the context of the ConQuer project at the University of Toronto, which focuses on the efficient management of inconsistent and dirty databases.","PeriodicalId":6819,"journal":{"name":"22nd International Conference on Data Engineering (ICDE'06)","volume":"20 5 1","pages":"30-30"},"PeriodicalIF":0.0000,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"202","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"22nd International Conference on Data Engineering (ICDE'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2006.35","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 202

Abstract

The detection of duplicate tuples, corresponding to the same real-world entity, is an important task in data integration and cleaning. While many techniques exist to identify such tuples, the merging or elimination of duplicates can be a difficult task that relies on ad-hoc and often manual solutions. We propose a complementary approach that permits declarative query answering over duplicated data, where each duplicate is associated with a probability of being in the clean database. We rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database. Our rewritten queries are sensitive to the semantics of duplication and help a user understand which query answers are most likely to be present in the clean database. The semantics that we adopt is independent of the way the probabilities are produced, but is able to effectively exploit them during query answering. In the absence of external knowledge that associates each database tuple with a probability, we offer a technique, based on tuple summaries, that automates this task. We experimentally study the performance of our rewritten queries. Our studies show that the rewriting does not introduce a significant overhead in query execution time. This work is done in the context of the ConQuer project at the University of Toronto, which focuses on the efficient management of inconsistent and dirty databases.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
干净的答案胜于肮脏的数据库:一种概率方法
在数据集成和清理中,检测对应于相同现实世界实体的重复元组是一项重要任务。虽然有许多技术可以识别这样的元组,但是合并或消除重复项可能是一项困难的任务,它依赖于特别的、通常是手动的解决方案。我们提出了一种补充方法,允许对重复数据进行声明性查询回答,其中每个重复数据都与在干净数据库中的概率相关联。我们重写对包含重复项的数据库的查询,以返回每个答案的概率为答案在干净的数据库中。我们重写的查询对重复的语义很敏感,并帮助用户理解哪些查询答案最有可能出现在干净的数据库中。我们采用的语义独立于概率产生的方式,但能够在查询回答期间有效地利用它们。在缺乏将每个数据库元组与概率关联起来的外部知识的情况下,我们提供了一种基于元组摘要的技术,可以自动执行此任务。我们通过实验研究了重写查询的性能。我们的研究表明,重写不会在查询执行时间上带来很大的开销。这项工作是在多伦多大学的ConQuer项目的上下文中完成的,该项目的重点是对不一致和脏数据库的有效管理。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
An Approach to Adaptive Memory Management in Data Stream Systems Revision Processing in a Stream Processing Engine: A High-Level Design SUBSKY: Efficient Computation of Skylines in Subspaces How to Determine a Good Multi-Programming Level for External Scheduling Warehousing and Analyzing Massive RFID Data Sets
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1