2011 IEEE 27th International Conference on Data Engineering最新文献_第10页

Jackpine: A benchmark to evaluate spatial database performance Jackpine:用于评估空间数据库性能的基准

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767929

S. Ray, Bogdan Simion, Angela Demke Brown

The volume of spatial data generated and consumed is rising exponentially and new applications are emerging as the costs of storage, processing power and network bandwidth continue to decline. Database support for spatial operations is fast becoming a necessity rather than a niche feature provided by a few products. However, the spatial functionality offered by current commercial and open-source relational databases differs significantly in terms of available features, true geodetic support, spatial functions and indexing. Benchmarks play a crucial role in evaluating the functionality and performance of a particular database, both for application users and developers, and for the database developers themselves. In contrast to transaction processing, however, there is no standard, widely used benchmark for spatial database operations. In this paper, we present a spatial database benchmark called Jackpine. Our benchmark is portable (it can support any database with a JDBC driver implementation) and includes both micro benchmarks and macro workload scenarios. The micro benchmark component tests basic spatial operations in isolation; it consists of queries based on the Dimensionally Extended 9-intersection model of topological relations and queries based on spatial analysis functions. Each macro workload includes a series of queries that are based on a common spatial data application. These macro scenarios include map search and browsing, geocoding, reverse geocoding, flood risk analysis, land information management and toxic spill analysis. We use Jackpine to evaluate the spatial features in 2 open source databases and 1 commercial offering.

生成和消耗的空间数据量呈指数级增长，随着存储、处理能力和网络带宽成本的持续下降，新的应用程序正在出现。对空间操作的数据库支持正迅速成为一种必需品，而不是少数产品提供的小众功能。然而，当前商业和开源关系数据库提供的空间功能在可用特性、真正的大地测量支持、空间功能和索引方面有很大不同。对于应用程序用户和开发人员，以及数据库开发人员本身，基准测试在评估特定数据库的功能和性能方面起着至关重要的作用。然而，与事务处理相比，空间数据库操作没有标准的、广泛使用的基准。在本文中，我们提出了一个空间数据库基准Jackpine。我们的基准测试是可移植的(它可以支持任何具有JDBC驱动程序实现的数据库)，并且包括微基准测试和宏工作负载场景。微基准组件对基本空间操作进行隔离测试;它由基于拓扑关系的维度扩展9交模型的查询和基于空间分析函数的查询组成。每个宏工作负载都包含一系列基于公共空间数据应用程序的查询。这些宏观场景包括地图搜索和浏览、地理编码、反向地理编码、洪水风险分析、土地信息管理和有毒物质泄漏分析。我们使用Jackpine来评估2个开源数据库和1个商业产品中的空间特征。

{"title":"Jackpine: A benchmark to evaluate spatial database performance","authors":"S. Ray, Bogdan Simion, Angela Demke Brown","doi":"10.1109/ICDE.2011.5767929","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767929","url":null,"abstract":"The volume of spatial data generated and consumed is rising exponentially and new applications are emerging as the costs of storage, processing power and network bandwidth continue to decline. Database support for spatial operations is fast becoming a necessity rather than a niche feature provided by a few products. However, the spatial functionality offered by current commercial and open-source relational databases differs significantly in terms of available features, true geodetic support, spatial functions and indexing. Benchmarks play a crucial role in evaluating the functionality and performance of a particular database, both for application users and developers, and for the database developers themselves. In contrast to transaction processing, however, there is no standard, widely used benchmark for spatial database operations. In this paper, we present a spatial database benchmark called Jackpine. Our benchmark is portable (it can support any database with a JDBC driver implementation) and includes both micro benchmarks and macro workload scenarios. The micro benchmark component tests basic spatial operations in isolation; it consists of queries based on the Dimensionally Extended 9-intersection model of topological relations and queries based on spatial analysis functions. Each macro workload includes a series of queries that are based on a common spatial data application. These macro scenarios include map search and browsing, geocoding, reverse geocoding, flood risk analysis, land information management and toxic spill analysis. We use Jackpine to evaluate the spatial features in 2 open source databases and 1 commercial offering.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116213358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 83

SmartTrace: Finding similar trajectories in smartphone networks without disclosing the traces SmartTrace:在智能手机网络中找到类似的轨迹，而不泄露痕迹

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767934

Constantinos Costa, C. Laoudias, D. Zeinalipour-Yazti, D. Gunopulos

In this demonstration paper, we present a powerful distributed framework for finding similar trajectories in a smartphone network, without disclosing the traces of participating users. Our framework, exploits opportunistic and participatory sensing in order to quickly answer queries of the form: “Report objects (i.e., trajectories) that follow a similar spatio-temporal motion to Q, where Q is some query trajectory.” SmartTrace, relies on an in-situ data storage model, where geo-location data is recorded locally on smartphones for both performance and privacy reasons. SmartTrace then deploys an efficient top-K query processing algorithm that exploits distributed trajectory similarity measures, resilient to spatial and temporal noise, in order to derive the most relevant answers to Q quickly and efficiently. Our demonstration shows how the SmartTrace algorithmics are ported on a network of Android-based smartphone devices with impressive query response times. To demonstrate the capabilities of SmartTrace during the conference, we will allow the attendees to query local smartphone networks in the following two modes: i) Interactive Mode, where devices will be handed out to participants aiming to identify who is moving similar to the querying node; and ii) Trace-driven Mode, where a large-scale deployment can be launched in order to show how the K most similar trajectories can be identified quickly and efficiently. The conference attendees will be able to appreciate how interesting spatio-temporal search applications can be implemented efficiently (for performance reasons) and without disclosing the complete user traces to the query processor (for privacy reasons)1. For instance, an attendee might be able to determine other attendees that have participated in common sessions, in order to initiate new discussions and collaborations, without knowing their trajectory or revealing his/her own trajectory either.

在这篇演示论文中，我们提出了一个强大的分布式框架，用于在智能手机网络中寻找类似的轨迹，而不会泄露参与用户的痕迹。我们的框架利用机会主义和参与式感知来快速回答以下形式的查询:“报告对象(即轨迹)遵循与Q相似的时空运动，其中Q是一些查询轨迹。”SmartTrace依赖于原位数据存储模型，出于性能和隐私原因，地理位置数据被本地记录在智能手机上。然后，SmartTrace部署了一个高效的top-K查询处理算法，该算法利用分布式轨迹相似性度量，对空间和时间噪声具有弹性，以便快速有效地得出与Q最相关的答案。我们的演示展示了如何将SmartTrace算法移植到基于android的智能手机设备网络上，并具有令人印象深刻的查询响应时间。为了在会议期间展示SmartTrace的功能，我们将允许与会者以以下两种模式查询本地智能手机网络:i)交互模式，其中设备将分发给与会者，旨在识别与查询节点相似的移动对象;ii)轨迹驱动模式，其中可以启动大规模部署，以展示如何快速有效地识别K个最相似的轨迹。与会者将能够体会到如何有效地实现有趣的时空搜索应用程序(出于性能原因)，而不会向查询处理器披露完整的用户跟踪(出于隐私原因)1。例如，为了发起新的讨论和合作，一个与会者可能能够确定参加了共同会议的其他与会者，而不知道他们的轨迹或透露他/她自己的轨迹。

{"title":"SmartTrace: Finding similar trajectories in smartphone networks without disclosing the traces","authors":"Constantinos Costa, C. Laoudias, D. Zeinalipour-Yazti, D. Gunopulos","doi":"10.1109/ICDE.2011.5767934","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767934","url":null,"abstract":"In this demonstration paper, we present a powerful distributed framework for finding similar trajectories in a smartphone network, without disclosing the traces of participating users. Our framework, exploits opportunistic and participatory sensing in order to quickly answer queries of the form: “Report objects (i.e., trajectories) that follow a similar spatio-temporal motion to Q, where Q is some query trajectory.” SmartTrace, relies on an in-situ data storage model, where geo-location data is recorded locally on smartphones for both performance and privacy reasons. SmartTrace then deploys an efficient top-K query processing algorithm that exploits distributed trajectory similarity measures, resilient to spatial and temporal noise, in order to derive the most relevant answers to Q quickly and efficiently. Our demonstration shows how the SmartTrace algorithmics are ported on a network of Android-based smartphone devices with impressive query response times. To demonstrate the capabilities of SmartTrace during the conference, we will allow the attendees to query local smartphone networks in the following two modes: i) Interactive Mode, where devices will be handed out to participants aiming to identify who is moving similar to the querying node; and ii) Trace-driven Mode, where a large-scale deployment can be launched in order to show how the K most similar trajectories can be identified quickly and efficiently. The conference attendees will be able to appreciate how interesting spatio-temporal search applications can be implemented efficiently (for performance reasons) and without disclosing the complete user traces to the query processor (for privacy reasons)1. For instance, an attendee might be able to determine other attendees that have participated in common sessions, in order to initiate new discussions and collaborations, without knowing their trajectory or revealing his/her own trajectory either.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116607003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

SMM: A data stream management system for knowledge discovery 用于知识发现的数据流管理系统

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767879

Hetal Thakkar, N. Laptev, Hamid Mousavi, Barzan Mozafari, Vincenzo Russo, C. Zaniolo

The problem of supporting data mining applications proved to be difficult for database management systems and it is now proving to be very challenging for data stream management systems (DSMSs), where the limitations of SQL are made even more severe by the requirements of continuous queries. The major technical advances that achieved separately on DSMSs and on data stream mining algorithms have failed to converge and produce powerful data stream mining systems. Such systems, however, are essential since the traditional pull-based approach of cache mining is no longer applicable, and the push-based computing mode of data streams and their bursty traffic complicate application development. For instance, to write mining applications with quality of service (QoS) levels approaching those of DSMSs, a mining analyst would have to contend with many arduous tasks, such as support for data buffering, complex storage and retrieval methods, scheduling, fault-tolerance, synopsis-management, load shedding, and query optimization. Our Stream Mill Miner (SMM) system solves these problems by providing a data stream mining workbench that combines the ease of specifying high-level mining tasks, as in Weka, with the performance and QoS guarantees of a DSMS. This is accomplished in three main steps. The first is an open and extensible DSMS architecture where KDD queries can be easily expressed as user-defined aggregates (UDAs)—our system combines that with the efficiency of synoptic data structures and mining-aware load shedding and optimizations. The second key component of SMM is its integrated library of fast mining algorithms that are light enough to be effective on data streams. The third advanced feature of SMM is a Mining Model Definition Language (MMDL) that allows users to define the flow of mining tasks, integrated with a simple box&arrow GUI, to shield the mining analyst from the complexities of lower-level queries. SMM is the first DSMS capable of online mining and this paper describes its architecture, design, and performance on mining queries.

支持数据挖掘应用程序的问题对数据库管理系统来说是困难的，现在对数据流管理系统(DSMSs)来说是非常具有挑战性的，其中SQL的限制由于连续查询的需求而变得更加严重。在dsm和数据流挖掘算法上分别取得的主要技术进步未能融合并产生强大的数据流挖掘系统。然而，由于传统的基于拉的缓存挖掘方法不再适用，并且基于推送的数据流计算模式及其突发流量使应用程序开发复杂化，因此这种系统是必不可少的。例如，要编写服务质量(QoS)级别接近dsm的挖掘应用程序，挖掘分析师必须处理许多艰巨的任务，例如支持数据缓冲、复杂的存储和检索方法、调度、容错、概要管理、负载减少和查询优化。我们的Stream Mill Miner (SMM)系统通过提供一个数据流挖掘工作台解决了这些问题，该工作台结合了指定高级挖掘任务的便利性，就像在Weka中一样，以及DSMS的性能和QoS保证。这可以通过三个主要步骤来完成。第一个是开放和可扩展的DSMS体系结构，其中KDD查询可以很容易地表示为用户定义的聚合(UDAs)——我们的系统将其与概要数据结构的效率以及挖掘感知的负载减少和优化相结合。SMM的第二个关键组件是其集成的快速挖掘算法库，这些算法足够轻，可以有效地处理数据流。SMM的第三个高级特性是挖掘模型定义语言(MMDL)，它允许用户定义挖掘任务的流程，并集成了一个简单的方框和箭头GUI，以保护挖掘分析师免受低级查询的复杂性。SMM是第一个能够在线挖掘的DSMS，本文描述了它的架构、设计和挖掘查询的性能。

{"title":"SMM: A data stream management system for knowledge discovery","authors":"Hetal Thakkar, N. Laptev, Hamid Mousavi, Barzan Mozafari, Vincenzo Russo, C. Zaniolo","doi":"10.1109/ICDE.2011.5767879","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767879","url":null,"abstract":"The problem of supporting data mining applications proved to be difficult for database management systems and it is now proving to be very challenging for data stream management systems (DSMSs), where the limitations of SQL are made even more severe by the requirements of continuous queries. The major technical advances that achieved separately on DSMSs and on data stream mining algorithms have failed to converge and produce powerful data stream mining systems. Such systems, however, are essential since the traditional pull-based approach of cache mining is no longer applicable, and the push-based computing mode of data streams and their bursty traffic complicate application development. For instance, to write mining applications with quality of service (QoS) levels approaching those of DSMSs, a mining analyst would have to contend with many arduous tasks, such as support for data buffering, complex storage and retrieval methods, scheduling, fault-tolerance, synopsis-management, load shedding, and query optimization. Our Stream Mill Miner (SMM) system solves these problems by providing a data stream mining workbench that combines the ease of specifying high-level mining tasks, as in Weka, with the performance and QoS guarantees of a DSMS. This is accomplished in three main steps. The first is an open and extensible DSMS architecture where KDD queries can be easily expressed as user-defined aggregates (UDAs)—our system combines that with the efficiency of synoptic data structures and mining-aware load shedding and optimizations. The second key component of SMM is its integrated library of fast mining algorithms that are light enough to be effective on data streams. The third advanced feature of SMM is a Mining Model Definition Language (MMDL) that allows users to define the flow of mining tasks, integrated with a simple box&arrow GUI, to shield the mining analyst from the complexities of lower-level queries. SMM is the first DSMS capable of online mining and this paper describes its architecture, design, and performance on mining queries.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130191042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

PrefJoin: An efficient preference-aware join operator PrefJoin:一个有效的优先级感知连接操作符

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767894

Mohamed E. Khalefa, M. Mokbel, Justin J. Levandoski

Preference queries are essential to a wide spectrum of applications including multi-criteria decision-making tools and personalized databases. Unfortunately, most of the evaluation techniques for preference queries assume that the set of preferred attributes are stored in only one relation, waiving on a wide set of queries that include preference computations over multiple relations. This paper presents PrefJoin, an efficient preference-aware join query operator, designed specifically to deal with preference queries over multiple relations. PrefJoin consists of four main phases: Local Pruning, Data Preparation, Joining, and Refining that filter out, from each input relation, those tuples that are guaranteed not to be in the final preference set, associate meta data with each non-filtered tuple that will be used to optimize the execution of the next phases, produce a subset of join result that are relevant for the given preference function, and refine these tuples respectively. An interesting characteristic of PrefJoin is that it tightly integrates preference computation with join hence we can early prune those tuples that are guaranteed not to be an answer, and hence it saves significant unnecessary computations cost. PrefJoin supports a variety of preference function including skyline, multi-objective and k-dominance preference queries. We show the correctness of PrefJoin. Experimental evaluation based on a real system implementation inside PostgreSQL shows that PrefJoin consistently achieves from one to three orders of magnitude performance gain over its competitors in various scenarios.

偏好查询对于包括多标准决策工具和个性化数据库在内的广泛应用程序都是必不可少的。不幸的是，大多数首选项查询的评估技术都假定首选属性集仅存储在一个关系中，从而忽略了包含多个关系上的首选项计算的广泛查询集。PrefJoin是一种高效的优先级感知连接查询操作符，专门用于处理多个关系上的优先级查询。PrefJoin由四个主要阶段组成:Local Pruning、Data Preparation、Joining和refine，这些阶段从每个输入关系中过滤掉那些保证不在最终首选项集中的元组，将元数据与每个将用于优化下一阶段执行的未过滤元组关联起来，生成与给定首选项函数相关的连接结果子集，并分别对这些元组进行细化。PrefJoin的一个有趣的特点是，它将首选项计算与join紧密地集成在一起，因此我们可以提前修剪那些保证不是答案的元组，从而节省了大量不必要的计算成本。PrefJoin支持各种偏好函数，包括天际线，多目标和k-优势偏好查询。我们将展示PrefJoin的正确性。基于PostgreSQL内部真实系统实现的实验评估表明，在各种场景中，PrefJoin始终比其竞争对手获得一到三个数量级的性能提升。

{"title":"PrefJoin: An efficient preference-aware join operator","authors":"Mohamed E. Khalefa, M. Mokbel, Justin J. Levandoski","doi":"10.1109/ICDE.2011.5767894","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767894","url":null,"abstract":"Preference queries are essential to a wide spectrum of applications including multi-criteria decision-making tools and personalized databases. Unfortunately, most of the evaluation techniques for preference queries assume that the set of preferred attributes are stored in only one relation, waiving on a wide set of queries that include preference computations over multiple relations. This paper presents PrefJoin, an efficient preference-aware join query operator, designed specifically to deal with preference queries over multiple relations. PrefJoin consists of four main phases: Local Pruning, Data Preparation, Joining, and Refining that filter out, from each input relation, those tuples that are guaranteed not to be in the final preference set, associate meta data with each non-filtered tuple that will be used to optimize the execution of the next phases, produce a subset of join result that are relevant for the given preference function, and refine these tuples respectively. An interesting characteristic of PrefJoin is that it tightly integrates preference computation with join hence we can early prune those tuples that are guaranteed not to be an answer, and hence it saves significant unnecessary computations cost. PrefJoin supports a variety of preference function including skyline, multi-objective and k-dominance preference queries. We show the correctness of PrefJoin. Experimental evaluation based on a real system implementation inside PostgreSQL shows that PrefJoin consistently achieves from one to three orders of magnitude performance gain over its competitors in various scenarios.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128606246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Selectivity estimation for extraction operators over text data 文本数据提取算子的选择性估计

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767931

D. Wang, Long Wei, Yunyao Li, Frederick Reiss, Shivakumar Vaithyanathan

Recently, there has been increasing interest in extending relational query processing to efficiently support extraction operators, such as dictionaries and regular expressions, over text data. Many text processing queries are sophisticated in that they involve multiple extraction and join operators, resulting in many possible query plans. However, there has been little research on building the selectivity or cost estimation for these extraction operators, which is crucial for an optimizer to pick a good query plan. In this paper, we define the problem of selectivity estimation for dictionaries and regular expressions, and propose to develop document synopses over a text corpus, from which the selectivity can be estimated. We first adapt the language models in the Natural Language Processing literature to form the top-k n-gram synopsis as the baseline document synopsis. Then we develop two classes of novel document synopses: stratified bloom filter synopsis and roll-up synopsis. We also develop techniques to decompose a complicated regular expression into subparts to achieve more effective and accurate estimation. We conduct experiments over the Enron email corpus using both real-world and synthetic workloads to compare the accuracy of the selectivity estimation over different classes and variations of synopses. The results show that, the top-k stratified bloom filter synopsis and the roll-up synopsis is the most accurate in dictionary and regular expression selectivity estimation respectively.

最近，人们对扩展关系查询处理越来越感兴趣，以便有效地支持文本数据上的提取操作符，如字典和正则表达式。许多文本处理查询非常复杂，因为它们涉及多个提取和连接操作符，从而产生许多可能的查询计划。然而，很少有研究为这些提取算子构建选择性或成本估计，这对于优化器选择一个好的查询计划至关重要。在本文中，我们定义了字典和正则表达式的选择性估计问题，并提出在文本语料库上开发文档概要，从中可以估计选择性。我们首先采用自然语言处理文献中的语言模型，形成top-k n-gram摘要作为基准文档摘要。然后，我们开发了两类新颖的文档概要:分层布隆过滤器概要和卷取概要。我们还开发了将复杂正则表达式分解为子部分的技术，以实现更有效和准确的估计。我们在安然电子邮件语料库上进行实验，使用真实世界和合成工作负载来比较不同类别和概要变化的选择性估计的准确性。结果表明，top-k分层布隆过滤器概要和卷取概要分别在字典和正则表达式选择性估计中最准确。

{"title":"Selectivity estimation for extraction operators over text data","authors":"D. Wang, Long Wei, Yunyao Li, Frederick Reiss, Shivakumar Vaithyanathan","doi":"10.1109/ICDE.2011.5767931","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767931","url":null,"abstract":"Recently, there has been increasing interest in extending relational query processing to efficiently support extraction operators, such as dictionaries and regular expressions, over text data. Many text processing queries are sophisticated in that they involve multiple extraction and join operators, resulting in many possible query plans. However, there has been little research on building the selectivity or cost estimation for these extraction operators, which is crucial for an optimizer to pick a good query plan. In this paper, we define the problem of selectivity estimation for dictionaries and regular expressions, and propose to develop document synopses over a text corpus, from which the selectivity can be estimated. We first adapt the language models in the Natural Language Processing literature to form the top-k n-gram synopsis as the baseline document synopsis. Then we develop two classes of novel document synopses: stratified bloom filter synopsis and roll-up synopsis. We also develop techniques to decompose a complicated regular expression into subparts to achieve more effective and accurate estimation. We conduct experiments over the Enron email corpus using both real-world and synthetic workloads to compare the accuracy of the selectivity estimation over different classes and variations of synopses. The results show that, the top-k stratified bloom filter synopsis and the roll-up synopsis is the most accurate in dictionary and regular expression selectivity estimation respectively.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122612131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Answering approximate string queries on large data sets using external memory 使用外部内存回答大型数据集上的近似字符串查询

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767856

Alexander Behm, Chen Li, M. Carey

An approximate string query is to find from a collection of strings those that are similar to a given query string. Answering such queries is important in many applications such as data cleaning and record linkage, where errors could occur in queries as well as the data. Many existing algorithms have focused on in-memory indexes. In this paper we investigate how to efficiently answer such queries in a disk-based setting, by systematically studying the effects of storing data and indexes on disk. We devise a novel physical layout for an inverted index to answer queries and we study how to construct it with limited buffer space. To answer queries, we develop a cost-based, adaptive algorithm that balances the I/O costs of retrieving candidate matches and accessing inverted lists. Experiments on large, real datasets verify that simply adapting existing algorithms to a disk-based setting does not work well and that our new techniques answer queries efficiently. Further, our solutions significantly outperform a recent tree-based index, BED-tree.

近似字符串查询是从字符串集合中查找与给定查询字符串相似的字符串。在许多应用程序(如数据清理和记录链接)中，回答此类查询非常重要，在这些应用程序中，查询和数据都可能出现错误。许多现有的算法都集中在内存索引上。在本文中，我们通过系统地研究在磁盘上存储数据和索引的影响，来研究如何在基于磁盘的设置中有效地回答此类查询。我们设计了一种新的倒排索引的物理布局来回答查询，并研究了如何在有限的缓冲空间下构造倒排索引。为了回答查询，我们开发了一种基于成本的自适应算法，以平衡检索候选匹配项和访问倒排列表的I/O成本。在大型真实数据集上的实验证明，简单地将现有算法调整为基于磁盘的设置并不能很好地工作，我们的新技术可以有效地回答查询。此外，我们的解决方案明显优于最近的基于树的指数BED-tree。

引用次数: 30

Updating XML schemas and associated documents through exup 通过 exup 更新 XML 架构和相关文件

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767951

Federico Cavalieri, G. Guerrini, M. Mesiti

Data on the Web mostly are in XML format and the need often arises to update their structure, commonly described by an XML Schema. When a schema is modified the effects of the modification on documents need to be faced. XSUpdate is a language that allows to easily identify parts of an XML Schema, apply a modification primitive on them and finally define an adaptation for associated documents, while Eχup is the corresponding engine for processing schema modification and document adaptation statements. Purpose of this demonstration is to provide an overview of the facilities of the XSUpdate language and of the Eχup system.

网络上的数据大多采用 XML 格式，因此经常需要更新它们的结构，这些结构通常由 XML 模式描述。当模式被修改时，需要面对修改对文档的影响。XSUpdate 是一种语言，可以轻松识别 XML 模式的各个部分，对其应用修改原语，并最终定义相关文档的适配，而 Eχup 则是处理模式修改和文档适配语句的相应引擎。本演示的目的是概述 XSUpdate 语言和 Eχup 系统的功能。

引用次数: 20

Interval-based pruning for top-k processing over compressed lists 在压缩列表上进行top-k处理的基于间隔的剪枝

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767855

K. Chakrabarti, S. Chaudhuri, Venkatesh Ganti

Optimizing execution of top-k queries over record-id ordered, compressed lists is challenging. The threshold family of algorithms cannot be effectively used in such cases. Yet, improving execution of such queries is of great value. For example, top-k keyword search in information retrieval (IR) engines represents an important scenario where such optimization can be directly beneficial. In this paper, we develop novel algorithms to improve execution of such queries over state of the art techniques. Our main insights are pruning based on fine-granularity bounds and traversing the lists based on judiciously chosen “intervals” rather than individual records. We formally study the optimality characteristics of the proposed algorithms. Our algorithms require minimal changes and can be easily integrated into IR engines. Our experiments on real-life datasets show that our algorithm outperform the state of the art techniques by a factor of 3–6 in terms of query execution times.

在记录id有序的压缩列表上优化top-k查询的执行是一项挑战。在这种情况下，阈值算法族不能有效地使用。然而，改进这类查询的执行是很有价值的。例如，信息检索(IR)引擎中的top-k关键字搜索代表了一种重要的场景，这种优化可以直接带来好处。在本文中，我们开发了新的算法来改进这种查询的执行。我们的主要见解是基于细粒度边界进行修剪，并基于明智选择的“间隔”而不是单个记录遍历列表。我们正式研究了所提出算法的最优性特征。我们的算法需要最小的变化，可以很容易地集成到红外引擎。我们在真实数据集上的实验表明，就查询执行时间而言，我们的算法比最先进的技术性能高出3-6倍。

引用次数: 61

Ontological queries: Rewriting and optimization 本体查询:重写和优化

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767965

G. Gottlob, G. Orsi, Andreas Pieris

Ontological queries are evaluated against an enterprise ontology rather than directly on a database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this paper we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent query against the underlying relational database. The focus here is on soundness and completeness. We review previous results and present a new rewriting algorithm for rather general types of ontological constraints (description logics). In particular, we show how a conjunctive query (CQ) against an enterprise ontology can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this process so to produce possibly small and cost-effective output UCQ. We review existing optimization methods, and propose an effective new method that works for Linear Datalog±, a description logic that encompasses well-known description logics of the DL-Lite family.

本体查询是根据企业本体而不是直接在数据库上进行计算的。这类查询的评估和优化是数据库研究中一个有趣的新问题。本文讨论了该问题的两个重要方面:查询重写和查询优化。查询重写包括将本体查询编译为针对底层关系数据库的等效查询。这里的重点是健全性和完整性。我们回顾了以前的结果，并提出了一种新的重写算法，用于相当一般类型的本体约束(描述逻辑)。特别是，我们展示了如何将针对企业本体的联合查询(CQ)编译为针对底层数据库的联合查询(UCQ)。在这种情况下，本体论查询优化试图改进这一过程，以产生尽可能小且具有成本效益的输出UCQ。我们回顾了现有的优化方法，并提出了一种有效的新方法，适用于线性Datalog±，线性Datalog±是一种包含DL-Lite家族中众所周知的描述逻辑的描述逻辑。

引用次数: 164

Program transformations for asynchronous query submission 异步查询提交的程序转换

2011 IEEE 27th International Conference on Data Engineering

Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767870

Mahendra Chavan, Ravindra Guravannavar, Karthik Ramachandra, Sundararajarao Sudarshan

Synchronous execution of queries or Web service requests forces the calling application to block until the query/request is satisfied. The performance of applications can be significantly improved by asynchronous submission of queries, which allows the application to perform other processing instead of blocking while the query is executed, and to concurrently issue multiple queries. Concurrent submission of multiple queries can allow the query execution engine to better utilize multiple processors and disks, and to reorder disk IO requests to minimize seeks. Concurrent submission also reduces the impact of network round-trip latency and delays at the database, when processing multiple queries. However, manually writing applications to exploit asynchronous query submission is tedious. In this paper we address the issue of automatically transforming a program written assuming synchronous query submission, to one that exploits asynchronous query submission. Our program transformation method is based on dataflow analysis and is framed as a set of transformation rules. Our rules can handle query executions within loops, unlike some of the earlier work in this area. We have built a tool that implements our transformation techniques on Java code that uses JDBC calls; our tool can be extended to handle Web service calls. We have carried out a detailed experimental study on several real-life applications rewritten using our transformation techniques. The experimental study shows the effectiveness of the proposed rewrite techniques, both in terms of their applicability and performance gains achieved.

查询或Web服务请求的同步执行迫使调用应用程序阻塞，直到查询/请求得到满足。通过异步提交查询可以显著提高应用程序的性能，这允许应用程序在执行查询时执行其他处理，而不是阻塞，并并发地发出多个查询。并发提交多个查询可以允许查询执行引擎更好地利用多个处理器和磁盘，并重新排序磁盘IO请求以最小化寻道。在处理多个查询时，并发提交还减少了网络往返延迟和数据库延迟的影响。然而，手动编写应用程序来利用异步查询提交是乏味的。在本文中，我们将讨论如何将假设同步查询提交的程序自动转换为利用异步查询提交的程序。我们的程序转换方法基于数据流分析，并以一组转换规则为框架。我们的规则可以处理循环内的查询执行，这与该领域的一些早期工作不同。我们已经构建了一个工具，在使用JDBC调用的Java代码上实现我们的转换技术;我们的工具可以扩展为处理Web服务调用。我们对使用我们的转换技术重写的几个实际应用程序进行了详细的实验研究。实验研究表明，所提出的重写技术在适用性和性能方面都是有效的。

{"title":"Program transformations for asynchronous query submission","authors":"Mahendra Chavan, Ravindra Guravannavar, Karthik Ramachandra, Sundararajarao Sudarshan","doi":"10.1109/ICDE.2011.5767870","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767870","url":null,"abstract":"Synchronous execution of queries or Web service requests forces the calling application to block until the query/request is satisfied. The performance of applications can be significantly improved by asynchronous submission of queries, which allows the application to perform other processing instead of blocking while the query is executed, and to concurrently issue multiple queries. Concurrent submission of multiple queries can allow the query execution engine to better utilize multiple processors and disks, and to reorder disk IO requests to minimize seeks. Concurrent submission also reduces the impact of network round-trip latency and delays at the database, when processing multiple queries. However, manually writing applications to exploit asynchronous query submission is tedious. In this paper we address the issue of automatically transforming a program written assuming synchronous query submission, to one that exploits asynchronous query submission. Our program transformation method is based on dataflow analysis and is framed as a set of transformation rules. Our rules can handle query executions within loops, unlike some of the earlier work in this area. We have built a tool that implements our transformation techniques on Java code that uses JDBC calls; our tool can be extended to handle Web service calls. We have carried out a detailed experimental study on several real-life applications rewritten using our transformation techniques. The experimental study shows the effectiveness of the proposed rewrite techniques, both in terms of their applicability and performance gains achieved.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125016615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27