{"title":"A unified model for data and constraint repair","authors":"Fei Chiang, Renée J. Miller","doi":"10.1109/ICDE.2011.5767833","DOIUrl":null,"url":null,"abstract":"Integrity constraints play an important role in data design. However, in an operational database, they may not be enforced for many reasons. Hence, over time, data may become inconsistent with respect to the constraints. To manage this, several approaches have proposed techniques to repair the data, by finding minimal or lowest cost changes to the data that make it consistent with the constraints. Such techniques are appropriate for the old world where data changes, but schemas and their constraints remain fixed. In many modern applications however, constraints may evolve over time as application or business rules change, as data is integrated with new data sources, or as the underlying semantics of the data evolves. In such settings, when an inconsistency occurs, it is no longer clear if there is an error in the data (and the data should be repaired), or if the constraints have evolved (and the constraints should be repaired). In this work, we present a novel unified cost model that allows data and constraint repairs to be compared on an equal footing. We consider repairs over a database that is inconsistent with respect to a set of rules, modeled as functional dependencies (FDs). FDs are the most common type of constraint, and are known to play an important role in maintaining data quality. We evaluate the quality and scalability of our repair algorithms over synthetic data and present a qualitative case study using a well-known real dataset. The results show that our repair algorithms not only scale well for large datasets, but are able to accurately capture and correct inconsistencies, and accurately decide when a data repair versus a constraint repair is best.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"107","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 27th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2011.5767833","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 107
Abstract
Integrity constraints play an important role in data design. However, in an operational database, they may not be enforced for many reasons. Hence, over time, data may become inconsistent with respect to the constraints. To manage this, several approaches have proposed techniques to repair the data, by finding minimal or lowest cost changes to the data that make it consistent with the constraints. Such techniques are appropriate for the old world where data changes, but schemas and their constraints remain fixed. In many modern applications however, constraints may evolve over time as application or business rules change, as data is integrated with new data sources, or as the underlying semantics of the data evolves. In such settings, when an inconsistency occurs, it is no longer clear if there is an error in the data (and the data should be repaired), or if the constraints have evolved (and the constraints should be repaired). In this work, we present a novel unified cost model that allows data and constraint repairs to be compared on an equal footing. We consider repairs over a database that is inconsistent with respect to a set of rules, modeled as functional dependencies (FDs). FDs are the most common type of constraint, and are known to play an important role in maintaining data quality. We evaluate the quality and scalability of our repair algorithms over synthetic data and present a qualitative case study using a well-known real dataset. The results show that our repair algorithms not only scale well for large datasets, but are able to accurately capture and correct inconsistencies, and accurately decide when a data repair versus a constraint repair is best.