Optimizing Statistical Information Extraction Programs over Evolving Text

2012 IEEE 28th International Conference on Data Engineering Pub Date : 2012-04-01 DOI:10.1109/ICDE.2012.60

Fei Chen, Xixuan Feng, C. Ré, Min Wang

{"title":"Optimizing Statistical Information Extraction Programs over Evolving Text","authors":"Fei Chen, Xixuan Feng, C. Ré, Min Wang","doi":"10.1109/ICDE.2012.60","DOIUrl":null,"url":null,"abstract":"Statistical information extraction (IE) programs are increasingly used to build real-world IE systems such as Alibaba, CiteSeer, Kylin, and YAGO. Current statistical IE approaches consider the text corpora underlying the extraction program to be static. However, many real-world text corpora are dynamic (documents are inserted, modified, and removed). As the corpus evolves, and IE programs must be applied repeatedly to consecutive corpus snapshots to keep extracted information up to date. Applying IE from scratch to each snapshot may be inefficient: a pair of consecutive snapshots may change very little, but unaware of this, the program must run again from scratch. In this paper, we present CRFlex, a system that efficiently executes such repeated statistical IE, by recycling previous IE results to enable incremental update. As the first step, CRFlex focuses on statistical IE programs which use a leading statistical model, Conditional Random Fields (CRFs). We show how to model properties of the CRF inference algorithms for incremental update and how to exploit them to correctly recycle previous inference results. Then we show how to efficiently capture and store intermediate results of IE programs for subsequent recycling. We find that there is a tradeoff between the I/O cost spent on reading and writing intermediate results, and CPU cost we can save from recycling those intermediate results. Therefore we present a cost-based solution to determine the most efficient recycling approach for any given CRF-based IE program and an evolving corpus. We conduct extensive experiments with CRF-based IE programs for 3 IE tasks over a real-world data set to demonstrate the utility of our approach.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"357 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 28th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2012.60","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

Statistical information extraction (IE) programs are increasingly used to build real-world IE systems such as Alibaba, CiteSeer, Kylin, and YAGO. Current statistical IE approaches consider the text corpora underlying the extraction program to be static. However, many real-world text corpora are dynamic (documents are inserted, modified, and removed). As the corpus evolves, and IE programs must be applied repeatedly to consecutive corpus snapshots to keep extracted information up to date. Applying IE from scratch to each snapshot may be inefficient: a pair of consecutive snapshots may change very little, but unaware of this, the program must run again from scratch. In this paper, we present CRFlex, a system that efficiently executes such repeated statistical IE, by recycling previous IE results to enable incremental update. As the first step, CRFlex focuses on statistical IE programs which use a leading statistical model, Conditional Random Fields (CRFs). We show how to model properties of the CRF inference algorithms for incremental update and how to exploit them to correctly recycle previous inference results. Then we show how to efficiently capture and store intermediate results of IE programs for subsequent recycling. We find that there is a tradeoff between the I/O cost spent on reading and writing intermediate results, and CPU cost we can save from recycling those intermediate results. Therefore we present a cost-based solution to determine the most efficient recycling approach for any given CRF-based IE program and an evolving corpus. We conduct extensive experiments with CRF-based IE programs for 3 IE tasks over a real-world data set to demonstrate the utility of our approach.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

优化统计信息提取程序在不断发展的文本

统计信息提取(IE)程序越来越多地用于构建现实世界的IE系统，如阿里巴巴、CiteSeer、麒麟和YAGO。当前的统计IE方法认为文本语料库底层的提取程序是静态的。然而，许多现实世界的文本语料库是动态的(文档被插入、修改和删除)。随着语料库的发展，IE程序必须重复应用于连续的语料库快照，以保持提取的信息是最新的。从头开始对每个快照应用IE可能效率低下:一对连续的快照可能变化很小，但不知道这一点，程序必须从头开始再次运行。在本文中，我们提出了CRFlex，一个有效执行这种重复统计IE的系统，通过回收以前的IE结果来实现增量更新。作为第一步，CRFlex将重点放在使用领先统计模型条件随机场(CRFs)的统计IE程序上。我们展示了如何为增量更新的CRF推理算法的属性建模，以及如何利用它们来正确地回收以前的推理结果。然后，我们展示了如何有效地捕获和存储IE程序的中间结果，以便后续回收。我们发现在读写中间结果所花费的I/O成本与回收这些中间结果所节省的CPU成本之间存在权衡。因此，我们提出了一个基于成本的解决方案，以确定任何给定的基于crf的IE程序和不断发展的语料库的最有效的回收方法。我们对基于crf的IE程序在现实世界数据集上的3个IE任务进行了广泛的实验，以证明我们方法的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2012 IEEE 28th International Conference on Data Engineering

自引率

0.00%

发文量

期刊最新文献

Keyword Query Reformulation on Structured Data Accuracy-Aware Uncertain Stream Databases Extracting Analyzing and Visualizing Triangle K-Core Motifs within Networks Project Daytona: Data Analytics as a Cloud Service Automatic Extraction of Structured Web Data with Domain Knowledge