An incremental clustering scheme for duplicate detection in large databases

9th International Database Engineering & Application Symposium (IDEAS'05) Pub Date : 2005-07-25 DOI:10.1109/IDEAS.2005.10

Eugenio Cesario, Francesco Folino, G. Manco, L. Pontieri

引用次数: 12

Abstract

We propose an incremental algorithm for clustering duplicate tuples in large databases, which allows to assign any new tuple t to the cluster containing the database tuples which are most similar to t (and hence are likely to refer to the same real-world entity t is associated with). The core of the approach is a hash-based indexing technique that tends to assign highly similar objects to the same buckets. Empirical evaluation proves that the proposed method allows to gain considerable efficiency improvement over a state-of-art index structure for proximity searches in metric spaces.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种用于大型数据库重复检测的增量聚类方案

我们提出了一种用于大型数据库中重复元组聚类的增量算法，该算法允许将任何新的元组t分配给包含与t最相似的数据库元组的集群(因此可能引用与t相关联的相同的现实世界实体)。该方法的核心是基于散列的索引技术，该技术倾向于将高度相似的对象分配到相同的桶中。经验评估证明，该方法相对于度量空间中邻近搜索的最先进索引结构，可以获得相当大的效率提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

9th International Database Engineering & Application Symposium (IDEAS'05)

自引率

0.00%

发文量

期刊最新文献

Using the lock manager to choose timestamps Semantic query transformation using ontologies Querying with negation in data integration systems Design and evaluation of database layouts for MEMS-based storage systems Evaluation of integration of ACBL and AOCC caching algorithms