文本数据提取算子的选择性估计

D. Wang, Long Wei, Yunyao Li, Frederick Reiss, Shivakumar Vaithyanathan
{"title":"文本数据提取算子的选择性估计","authors":"D. Wang, Long Wei, Yunyao Li, Frederick Reiss, Shivakumar Vaithyanathan","doi":"10.1109/ICDE.2011.5767931","DOIUrl":null,"url":null,"abstract":"Recently, there has been increasing interest in extending relational query processing to efficiently support extraction operators, such as dictionaries and regular expressions, over text data. Many text processing queries are sophisticated in that they involve multiple extraction and join operators, resulting in many possible query plans. However, there has been little research on building the selectivity or cost estimation for these extraction operators, which is crucial for an optimizer to pick a good query plan. In this paper, we define the problem of selectivity estimation for dictionaries and regular expressions, and propose to develop document synopses over a text corpus, from which the selectivity can be estimated. We first adapt the language models in the Natural Language Processing literature to form the top-k n-gram synopsis as the baseline document synopsis. Then we develop two classes of novel document synopses: stratified bloom filter synopsis and roll-up synopsis. We also develop techniques to decompose a complicated regular expression into subparts to achieve more effective and accurate estimation. We conduct experiments over the Enron email corpus using both real-world and synthetic workloads to compare the accuracy of the selectivity estimation over different classes and variations of synopses. The results show that, the top-k stratified bloom filter synopsis and the roll-up synopsis is the most accurate in dictionary and regular expression selectivity estimation respectively.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Selectivity estimation for extraction operators over text data\",\"authors\":\"D. Wang, Long Wei, Yunyao Li, Frederick Reiss, Shivakumar Vaithyanathan\",\"doi\":\"10.1109/ICDE.2011.5767931\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, there has been increasing interest in extending relational query processing to efficiently support extraction operators, such as dictionaries and regular expressions, over text data. Many text processing queries are sophisticated in that they involve multiple extraction and join operators, resulting in many possible query plans. However, there has been little research on building the selectivity or cost estimation for these extraction operators, which is crucial for an optimizer to pick a good query plan. In this paper, we define the problem of selectivity estimation for dictionaries and regular expressions, and propose to develop document synopses over a text corpus, from which the selectivity can be estimated. We first adapt the language models in the Natural Language Processing literature to form the top-k n-gram synopsis as the baseline document synopsis. Then we develop two classes of novel document synopses: stratified bloom filter synopsis and roll-up synopsis. We also develop techniques to decompose a complicated regular expression into subparts to achieve more effective and accurate estimation. We conduct experiments over the Enron email corpus using both real-world and synthetic workloads to compare the accuracy of the selectivity estimation over different classes and variations of synopses. The results show that, the top-k stratified bloom filter synopsis and the roll-up synopsis is the most accurate in dictionary and regular expression selectivity estimation respectively.\",\"PeriodicalId\":332374,\"journal\":{\"name\":\"2011 IEEE 27th International Conference on Data Engineering\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE 27th International Conference on Data Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE.2011.5767931\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 27th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2011.5767931","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

摘要

最近,人们对扩展关系查询处理越来越感兴趣,以便有效地支持文本数据上的提取操作符,如字典和正则表达式。许多文本处理查询非常复杂,因为它们涉及多个提取和连接操作符,从而产生许多可能的查询计划。然而,很少有研究为这些提取算子构建选择性或成本估计,这对于优化器选择一个好的查询计划至关重要。在本文中,我们定义了字典和正则表达式的选择性估计问题,并提出在文本语料库上开发文档概要,从中可以估计选择性。我们首先采用自然语言处理文献中的语言模型,形成top-k n-gram摘要作为基准文档摘要。然后,我们开发了两类新颖的文档概要:分层布隆过滤器概要和卷取概要。我们还开发了将复杂正则表达式分解为子部分的技术,以实现更有效和准确的估计。我们在安然电子邮件语料库上进行实验,使用真实世界和合成工作负载来比较不同类别和概要变化的选择性估计的准确性。结果表明,top-k分层布隆过滤器概要和卷取概要分别在字典和正则表达式选择性估计中最准确。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Selectivity estimation for extraction operators over text data
Recently, there has been increasing interest in extending relational query processing to efficiently support extraction operators, such as dictionaries and regular expressions, over text data. Many text processing queries are sophisticated in that they involve multiple extraction and join operators, resulting in many possible query plans. However, there has been little research on building the selectivity or cost estimation for these extraction operators, which is crucial for an optimizer to pick a good query plan. In this paper, we define the problem of selectivity estimation for dictionaries and regular expressions, and propose to develop document synopses over a text corpus, from which the selectivity can be estimated. We first adapt the language models in the Natural Language Processing literature to form the top-k n-gram synopsis as the baseline document synopsis. Then we develop two classes of novel document synopses: stratified bloom filter synopsis and roll-up synopsis. We also develop techniques to decompose a complicated regular expression into subparts to achieve more effective and accurate estimation. We conduct experiments over the Enron email corpus using both real-world and synthetic workloads to compare the accuracy of the selectivity estimation over different classes and variations of synopses. The results show that, the top-k stratified bloom filter synopsis and the roll-up synopsis is the most accurate in dictionary and regular expression selectivity estimation respectively.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Advanced search, visualization and tagging of sensor metadata Bidirectional mining of non-redundant recurrent rules from a sequence database Web-scale information extraction with vertex Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins Dynamic prioritization of database queries
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1