大规模实体解析的阻塞:挑战、算法和实际示例

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI:10.1109/ICDE.2016.7498364

G. Papadakis, Themis Palpanas

{"title":"大规模实体解析的阻塞:挑战、算法和实际示例","authors":"G. Papadakis, Themis Palpanas","doi":"10.1109/ICDE.2016.7498364","DOIUrl":null,"url":null,"abstract":"Entity Resolution constitutes one of the cornerstone tasks for the integration of overlapping information sources. Due to its quadratic complexity, a large amount of research has focused on improving its efficiency so that it scales to Web Data collections, which are inherently voluminous and highly heterogeneous. The most common approach for this purpose is blocking, which clusters similar entities into blocks so that the pair-wise comparisons are restricted to the entities contained within each block. In this tutorial, we take a close look on blocking-based Entity Resolution, starting from the early blocking methods that were crafted for database integration. We highlight the challenges posed by contemporary heterogeneous, noisy, voluminous Web Data and explain why they render inapplicable these schema-based techniques. We continue with the presentation of blocking methods that have been developed for large-scale and heterogeneous information and are suitable for Web Data collections. We also explain how their efficiency can be further improved by meta-blocking and parallelization techniques. We conclude with a hands-on session that demonstrates the relative performance of several, state-of-the-art techniques. The participants of the tutorial will put in practice all the topics discussed in the theory part, and will get familiar with a reference toolbox, which includes the most prominent techniques in the area and can be readily used to tackle Entity Resolution problems.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"19 1","pages":"1436-1439"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Blocking for large-scale Entity Resolution: Challenges, algorithms, and practical examples\",\"authors\":\"G. Papadakis, Themis Palpanas\",\"doi\":\"10.1109/ICDE.2016.7498364\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Entity Resolution constitutes one of the cornerstone tasks for the integration of overlapping information sources. Due to its quadratic complexity, a large amount of research has focused on improving its efficiency so that it scales to Web Data collections, which are inherently voluminous and highly heterogeneous. The most common approach for this purpose is blocking, which clusters similar entities into blocks so that the pair-wise comparisons are restricted to the entities contained within each block. In this tutorial, we take a close look on blocking-based Entity Resolution, starting from the early blocking methods that were crafted for database integration. We highlight the challenges posed by contemporary heterogeneous, noisy, voluminous Web Data and explain why they render inapplicable these schema-based techniques. We continue with the presentation of blocking methods that have been developed for large-scale and heterogeneous information and are suitable for Web Data collections. We also explain how their efficiency can be further improved by meta-blocking and parallelization techniques. We conclude with a hands-on session that demonstrates the relative performance of several, state-of-the-art techniques. The participants of the tutorial will put in practice all the topics discussed in the theory part, and will get familiar with a reference toolbox, which includes the most prominent techniques in the area and can be readily used to tackle Entity Resolution problems.\",\"PeriodicalId\":6883,\"journal\":{\"name\":\"2016 IEEE 32nd International Conference on Data Engineering (ICDE)\",\"volume\":\"19 1\",\"pages\":\"1436-1439\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE 32nd International Conference on Data Engineering (ICDE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE.2016.7498364\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2016.7498364","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

实体解析是重叠信息源集成的基础任务之一。由于它的二次复杂度，大量的研究集中在提高它的效率，使其扩展到Web数据集合，本质上是庞大的和高度异构的。用于此目的的最常见方法是块，它将相似的实体聚集到块中，以便将成对比较限制在每个块中包含的实体中。在本教程中，我们将仔细研究基于块的实体解析，从早期为数据库集成而设计的块方法开始。我们强调了当代异构、嘈杂、海量的Web数据所带来的挑战，并解释了为什么它们使这些基于模式的技术变得不适用。我们继续介绍阻塞方法，这些方法是为大规模和异构信息开发的，适用于Web数据收集。我们还解释了如何通过元阻塞和并行化技术进一步提高它们的效率。我们以演示几种最先进技术的相对性能的实践会话结束。本教程的参与者将把理论部分讨论的所有主题付诸实践，并将熟悉参考工具箱，其中包括该领域最突出的技术，可以很容易地用于解决实体解决问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Blocking for large-scale Entity Resolution: Challenges, algorithms, and practical examples

Entity Resolution constitutes one of the cornerstone tasks for the integration of overlapping information sources. Due to its quadratic complexity, a large amount of research has focused on improving its efficiency so that it scales to Web Data collections, which are inherently voluminous and highly heterogeneous. The most common approach for this purpose is blocking, which clusters similar entities into blocks so that the pair-wise comparisons are restricted to the entities contained within each block. In this tutorial, we take a close look on blocking-based Entity Resolution, starting from the early blocking methods that were crafted for database integration. We highlight the challenges posed by contemporary heterogeneous, noisy, voluminous Web Data and explain why they render inapplicable these schema-based techniques. We continue with the presentation of blocking methods that have been developed for large-scale and heterogeneous information and are suitable for Web Data collections. We also explain how their efficiency can be further improved by meta-blocking and parallelization techniques. We conclude with a hands-on session that demonstrates the relative performance of several, state-of-the-art techniques. The participants of the tutorial will put in practice all the topics discussed in the theory part, and will get familiar with a reference toolbox, which includes the most prominent techniques in the area and can be readily used to tackle Entity Resolution problems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量

期刊最新文献

Data profiling SEED: A system for entity exploration and debugging in large-scale knowledge graphs TemProRA: Top-k temporal-probabilistic results analysis Durable graph pattern queries on historical graphs SCouT: Scalable coupled matrix-tensor factorization - algorithm and discoveries