一种基于尝试的编辑距离约束近似实体提取方法

2012 IEEE 28th International Conference on Data Engineering Pub Date : 2012-04-01 DOI:10.1109/ICDE.2012.29

Dong Deng, Guoliang Li, Jianhua Feng

{"title":"一种基于尝试的编辑距离约束近似实体提取方法","authors":"Dong Deng, Guoliang Li, Jianhua Feng","doi":"10.1109/ICDE.2012.29","DOIUrl":null,"url":null,"abstract":"Dictionary-based entity extraction has attracted much attention from the database community recently, which locates sub strings in a document into predefined entities (e.g., person names or locations). To improve extraction recall, a recent trend is to provide approximate matching between sub strings of the document and entities by tolerating minor errors. In this paper we study dictionary-based approximate entity extraction with edit-distance constraints. Existing methods have several limitations. First, they need to tune many parameters to achieve high performance. Second, they are inefficient for large edit-distance thresholds. We propose a trie-based method to address these problems. We first partition each entity into a set of segments, and then use a trie structure to index segments. To extract similar entities, we search segments from the document, and extend the matching segments in both entities and the document to find similar pairs. We develop an extension-based method to efficiently find similar string pairs by extending the matching segments. We optimize our partition scheme and select the best partition strategy to improve the extraction performance. Experimental results show that our method achieves much higher performance compared with state-of-the-art studies.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":"{\"title\":\"An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints\",\"authors\":\"Dong Deng, Guoliang Li, Jianhua Feng\",\"doi\":\"10.1109/ICDE.2012.29\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dictionary-based entity extraction has attracted much attention from the database community recently, which locates sub strings in a document into predefined entities (e.g., person names or locations). To improve extraction recall, a recent trend is to provide approximate matching between sub strings of the document and entities by tolerating minor errors. In this paper we study dictionary-based approximate entity extraction with edit-distance constraints. Existing methods have several limitations. First, they need to tune many parameters to achieve high performance. Second, they are inefficient for large edit-distance thresholds. We propose a trie-based method to address these problems. We first partition each entity into a set of segments, and then use a trie structure to index segments. To extract similar entities, we search segments from the document, and extend the matching segments in both entities and the document to find similar pairs. We develop an extension-based method to efficiently find similar string pairs by extending the matching segments. We optimize our partition scheme and select the best partition strategy to improve the extraction performance. Experimental results show that our method achieves much higher performance compared with state-of-the-art studies.\",\"PeriodicalId\":321608,\"journal\":{\"name\":\"2012 IEEE 28th International Conference on Data Engineering\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"30\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE 28th International Conference on Data Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE.2012.29\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 28th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2012.29","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

摘要

基于字典的实体提取最近引起了数据库界的广泛关注，它将文档中的子字符串定位到预定义的实体中(例如，人名或位置)。为了提高提取召回率，最近的一个趋势是通过容忍小错误在文档的子字符串和实体之间提供近似匹配。本文研究了具有编辑距离约束的基于字典的近似实体抽取。现有的方法有一些局限性。首先，它们需要调优许多参数以实现高性能。其次，对于较大的编辑距离阈值，它们是低效的。我们提出了一种基于尝试的方法来解决这些问题。我们首先将每个实体划分为一组段，然后使用trie结构对段进行索引。为了提取相似的实体，我们从文档中搜索片段，并扩展实体和文档中的匹配片段以找到相似的对。我们开发了一种基于扩展的方法，通过扩展匹配段来有效地找到相似的字符串对。我们对分区方案进行了优化，选择了最佳的分区策略来提高提取性能。实验结果表明，与目前的研究相比，我们的方法取得了更高的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints

Dictionary-based entity extraction has attracted much attention from the database community recently, which locates sub strings in a document into predefined entities (e.g., person names or locations). To improve extraction recall, a recent trend is to provide approximate matching between sub strings of the document and entities by tolerating minor errors. In this paper we study dictionary-based approximate entity extraction with edit-distance constraints. Existing methods have several limitations. First, they need to tune many parameters to achieve high performance. Second, they are inefficient for large edit-distance thresholds. We propose a trie-based method to address these problems. We first partition each entity into a set of segments, and then use a trie structure to index segments. To extract similar entities, we search segments from the document, and extend the matching segments in both entities and the document to find similar pairs. We develop an extension-based method to efficiently find similar string pairs by extending the matching segments. We optimize our partition scheme and select the best partition strategy to improve the extraction performance. Experimental results show that our method achieves much higher performance compared with state-of-the-art studies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 IEEE 28th International Conference on Data Engineering

自引率

0.00%

发文量

期刊最新文献

Keyword Query Reformulation on Structured Data Accuracy-Aware Uncertain Stream Databases Extracting Analyzing and Visualizing Triangle K-Core Motifs within Networks Project Daytona: Data Analytics as a Cloud Service Automatic Extraction of Structured Web Data with Domain Knowledge