Ranking for data repairs

M. Yakout, A. Elmagarmid, Jennifer Neville
{"title":"Ranking for data repairs","authors":"M. Yakout, A. Elmagarmid, Jennifer Neville","doi":"10.1109/ICDEW.2010.5452767","DOIUrl":null,"url":null,"abstract":"Improving data quality is a time-consuming, laborintensive and often domain specific operation. A recent principled approach for repairing dirty database is to use data quality rules in the form of database constraints to identify dirty tuples and then use the rules to derive data repairs. Most of existing data repair approaches focus on providing fully automated solutions, which could be risky to depend upon especially for critical data. To guarantee the optimal quality repairs applied to the database, users should be involved to confirm each repair. This highlights the need for an interactive approach that combines the best of both; automatically generating repairs, while efficiently employing user's efforts to verify the repairs. In such approach, the user will guide an online repairing process to incrementally generate repairs. A key challenge in this approach is the response time within the user's interactive sessions, because the process of generating the repairs is time consuming due to the large search space of possible repairs. To this end, we present in this paper a mechanism to continuously generate repairs only to the current top k important violated data quality rules. Moreover, the repairs are grouped and ranked such that the most beneficial in terms of improving data quality comes first to consult the user for verification and feedback. Our experiments on real-world dataset demonstrate the effectiveness of our ranking mechanism to provide a fast response time for the user while improving the data quality as quickly as possible.","PeriodicalId":442345,"journal":{"name":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDEW.2010.5452767","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Improving data quality is a time-consuming, laborintensive and often domain specific operation. A recent principled approach for repairing dirty database is to use data quality rules in the form of database constraints to identify dirty tuples and then use the rules to derive data repairs. Most of existing data repair approaches focus on providing fully automated solutions, which could be risky to depend upon especially for critical data. To guarantee the optimal quality repairs applied to the database, users should be involved to confirm each repair. This highlights the need for an interactive approach that combines the best of both; automatically generating repairs, while efficiently employing user's efforts to verify the repairs. In such approach, the user will guide an online repairing process to incrementally generate repairs. A key challenge in this approach is the response time within the user's interactive sessions, because the process of generating the repairs is time consuming due to the large search space of possible repairs. To this end, we present in this paper a mechanism to continuously generate repairs only to the current top k important violated data quality rules. Moreover, the repairs are grouped and ranked such that the most beneficial in terms of improving data quality comes first to consult the user for verification and feedback. Our experiments on real-world dataset demonstrate the effectiveness of our ranking mechanism to provide a fast response time for the user while improving the data quality as quickly as possible.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
数据修复排名
提高数据质量是一项耗时、费力且通常是特定领域的操作。最近修复脏数据库的一种有原则的方法是,以数据库约束的形式使用数据质量规则来识别脏元组,然后使用这些规则派生数据修复。大多数现有的数据修复方法都侧重于提供完全自动化的解决方案,依赖这些解决方案可能存在风险,尤其是对于关键数据。为了保证应用于数据库的最佳质量修复,用户应该参与确认每次修复。这突出表明需要一种结合两者优点的互动方法;自动生成修复,同时有效地利用用户的努力来验证修复。在这种方法中,用户将引导在线修复过程以增量方式生成修复。这种方法的一个关键挑战是用户交互会话中的响应时间,因为生成修复的过程非常耗时,因为可能修复的搜索空间很大。为此,我们在本文中提出了一种机制,仅对当前最重要的k个违反数据质量规则的数据持续生成修复。此外,对修复进行分组和排序,以便在改进数据质量方面最有利的修复首先咨询用户以进行验证和反馈。我们在真实数据集上的实验证明了我们的排名机制在为用户提供快速响应时间的同时尽可能快地提高数据质量的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Fast algorithms for time series mining Ontology alignment argumentation with mutual dependency between arguments and mappings A first step towards integration independence Towards enterprise software as a service in the cloud U-DBSCAN : A density-based clustering algorithm for uncertain objects
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1