Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767929
S. Ray, Bogdan Simion, Angela Demke Brown
The volume of spatial data generated and consumed is rising exponentially and new applications are emerging as the costs of storage, processing power and network bandwidth continue to decline. Database support for spatial operations is fast becoming a necessity rather than a niche feature provided by a few products. However, the spatial functionality offered by current commercial and open-source relational databases differs significantly in terms of available features, true geodetic support, spatial functions and indexing. Benchmarks play a crucial role in evaluating the functionality and performance of a particular database, both for application users and developers, and for the database developers themselves. In contrast to transaction processing, however, there is no standard, widely used benchmark for spatial database operations. In this paper, we present a spatial database benchmark called Jackpine. Our benchmark is portable (it can support any database with a JDBC driver implementation) and includes both micro benchmarks and macro workload scenarios. The micro benchmark component tests basic spatial operations in isolation; it consists of queries based on the Dimensionally Extended 9-intersection model of topological relations and queries based on spatial analysis functions. Each macro workload includes a series of queries that are based on a common spatial data application. These macro scenarios include map search and browsing, geocoding, reverse geocoding, flood risk analysis, land information management and toxic spill analysis. We use Jackpine to evaluate the spatial features in 2 open source databases and 1 commercial offering.
{"title":"Jackpine: A benchmark to evaluate spatial database performance","authors":"S. Ray, Bogdan Simion, Angela Demke Brown","doi":"10.1109/ICDE.2011.5767929","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767929","url":null,"abstract":"The volume of spatial data generated and consumed is rising exponentially and new applications are emerging as the costs of storage, processing power and network bandwidth continue to decline. Database support for spatial operations is fast becoming a necessity rather than a niche feature provided by a few products. However, the spatial functionality offered by current commercial and open-source relational databases differs significantly in terms of available features, true geodetic support, spatial functions and indexing. Benchmarks play a crucial role in evaluating the functionality and performance of a particular database, both for application users and developers, and for the database developers themselves. In contrast to transaction processing, however, there is no standard, widely used benchmark for spatial database operations. In this paper, we present a spatial database benchmark called Jackpine. Our benchmark is portable (it can support any database with a JDBC driver implementation) and includes both micro benchmarks and macro workload scenarios. The micro benchmark component tests basic spatial operations in isolation; it consists of queries based on the Dimensionally Extended 9-intersection model of topological relations and queries based on spatial analysis functions. Each macro workload includes a series of queries that are based on a common spatial data application. These macro scenarios include map search and browsing, geocoding, reverse geocoding, flood risk analysis, land information management and toxic spill analysis. We use Jackpine to evaluate the spatial features in 2 open source databases and 1 commercial offering.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116213358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767934
Constantinos Costa, C. Laoudias, D. Zeinalipour-Yazti, D. Gunopulos
In this demonstration paper, we present a powerful distributed framework for finding similar trajectories in a smartphone network, without disclosing the traces of participating users. Our framework, exploits opportunistic and participatory sensing in order to quickly answer queries of the form: “Report objects (i.e., trajectories) that follow a similar spatio-temporal motion to Q, where Q is some query trajectory.” SmartTrace, relies on an in-situ data storage model, where geo-location data is recorded locally on smartphones for both performance and privacy reasons. SmartTrace then deploys an efficient top-K query processing algorithm that exploits distributed trajectory similarity measures, resilient to spatial and temporal noise, in order to derive the most relevant answers to Q quickly and efficiently. Our demonstration shows how the SmartTrace algorithmics are ported on a network of Android-based smartphone devices with impressive query response times. To demonstrate the capabilities of SmartTrace during the conference, we will allow the attendees to query local smartphone networks in the following two modes: i) Interactive Mode, where devices will be handed out to participants aiming to identify who is moving similar to the querying node; and ii) Trace-driven Mode, where a large-scale deployment can be launched in order to show how the K most similar trajectories can be identified quickly and efficiently. The conference attendees will be able to appreciate how interesting spatio-temporal search applications can be implemented efficiently (for performance reasons) and without disclosing the complete user traces to the query processor (for privacy reasons)1. For instance, an attendee might be able to determine other attendees that have participated in common sessions, in order to initiate new discussions and collaborations, without knowing their trajectory or revealing his/her own trajectory either.
{"title":"SmartTrace: Finding similar trajectories in smartphone networks without disclosing the traces","authors":"Constantinos Costa, C. Laoudias, D. Zeinalipour-Yazti, D. Gunopulos","doi":"10.1109/ICDE.2011.5767934","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767934","url":null,"abstract":"In this demonstration paper, we present a powerful distributed framework for finding similar trajectories in a smartphone network, without disclosing the traces of participating users. Our framework, exploits opportunistic and participatory sensing in order to quickly answer queries of the form: “Report objects (i.e., trajectories) that follow a similar spatio-temporal motion to Q, where Q is some query trajectory.” SmartTrace, relies on an in-situ data storage model, where geo-location data is recorded locally on smartphones for both performance and privacy reasons. SmartTrace then deploys an efficient top-K query processing algorithm that exploits distributed trajectory similarity measures, resilient to spatial and temporal noise, in order to derive the most relevant answers to Q quickly and efficiently. Our demonstration shows how the SmartTrace algorithmics are ported on a network of Android-based smartphone devices with impressive query response times. To demonstrate the capabilities of SmartTrace during the conference, we will allow the attendees to query local smartphone networks in the following two modes: i) Interactive Mode, where devices will be handed out to participants aiming to identify who is moving similar to the querying node; and ii) Trace-driven Mode, where a large-scale deployment can be launched in order to show how the K most similar trajectories can be identified quickly and efficiently. The conference attendees will be able to appreciate how interesting spatio-temporal search applications can be implemented efficiently (for performance reasons) and without disclosing the complete user traces to the query processor (for privacy reasons)1. For instance, an attendee might be able to determine other attendees that have participated in common sessions, in order to initiate new discussions and collaborations, without knowing their trajectory or revealing his/her own trajectory either.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116607003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767879
Hetal Thakkar, N. Laptev, Hamid Mousavi, Barzan Mozafari, Vincenzo Russo, C. Zaniolo
The problem of supporting data mining applications proved to be difficult for database management systems and it is now proving to be very challenging for data stream management systems (DSMSs), where the limitations of SQL are made even more severe by the requirements of continuous queries. The major technical advances that achieved separately on DSMSs and on data stream mining algorithms have failed to converge and produce powerful data stream mining systems. Such systems, however, are essential since the traditional pull-based approach of cache mining is no longer applicable, and the push-based computing mode of data streams and their bursty traffic complicate application development. For instance, to write mining applications with quality of service (QoS) levels approaching those of DSMSs, a mining analyst would have to contend with many arduous tasks, such as support for data buffering, complex storage and retrieval methods, scheduling, fault-tolerance, synopsis-management, load shedding, and query optimization. Our Stream Mill Miner (SMM) system solves these problems by providing a data stream mining workbench that combines the ease of specifying high-level mining tasks, as in Weka, with the performance and QoS guarantees of a DSMS. This is accomplished in three main steps. The first is an open and extensible DSMS architecture where KDD queries can be easily expressed as user-defined aggregates (UDAs)—our system combines that with the efficiency of synoptic data structures and mining-aware load shedding and optimizations. The second key component of SMM is its integrated library of fast mining algorithms that are light enough to be effective on data streams. The third advanced feature of SMM is a Mining Model Definition Language (MMDL) that allows users to define the flow of mining tasks, integrated with a simple box&arrow GUI, to shield the mining analyst from the complexities of lower-level queries. SMM is the first DSMS capable of online mining and this paper describes its architecture, design, and performance on mining queries.
{"title":"SMM: A data stream management system for knowledge discovery","authors":"Hetal Thakkar, N. Laptev, Hamid Mousavi, Barzan Mozafari, Vincenzo Russo, C. Zaniolo","doi":"10.1109/ICDE.2011.5767879","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767879","url":null,"abstract":"The problem of supporting data mining applications proved to be difficult for database management systems and it is now proving to be very challenging for data stream management systems (DSMSs), where the limitations of SQL are made even more severe by the requirements of continuous queries. The major technical advances that achieved separately on DSMSs and on data stream mining algorithms have failed to converge and produce powerful data stream mining systems. Such systems, however, are essential since the traditional pull-based approach of cache mining is no longer applicable, and the push-based computing mode of data streams and their bursty traffic complicate application development. For instance, to write mining applications with quality of service (QoS) levels approaching those of DSMSs, a mining analyst would have to contend with many arduous tasks, such as support for data buffering, complex storage and retrieval methods, scheduling, fault-tolerance, synopsis-management, load shedding, and query optimization. Our Stream Mill Miner (SMM) system solves these problems by providing a data stream mining workbench that combines the ease of specifying high-level mining tasks, as in Weka, with the performance and QoS guarantees of a DSMS. This is accomplished in three main steps. The first is an open and extensible DSMS architecture where KDD queries can be easily expressed as user-defined aggregates (UDAs)—our system combines that with the efficiency of synoptic data structures and mining-aware load shedding and optimizations. The second key component of SMM is its integrated library of fast mining algorithms that are light enough to be effective on data streams. The third advanced feature of SMM is a Mining Model Definition Language (MMDL) that allows users to define the flow of mining tasks, integrated with a simple box&arrow GUI, to shield the mining analyst from the complexities of lower-level queries. SMM is the first DSMS capable of online mining and this paper describes its architecture, design, and performance on mining queries.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130191042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767894
Mohamed E. Khalefa, M. Mokbel, Justin J. Levandoski
Preference queries are essential to a wide spectrum of applications including multi-criteria decision-making tools and personalized databases. Unfortunately, most of the evaluation techniques for preference queries assume that the set of preferred attributes are stored in only one relation, waiving on a wide set of queries that include preference computations over multiple relations. This paper presents PrefJoin, an efficient preference-aware join query operator, designed specifically to deal with preference queries over multiple relations. PrefJoin consists of four main phases: Local Pruning, Data Preparation, Joining, and Refining that filter out, from each input relation, those tuples that are guaranteed not to be in the final preference set, associate meta data with each non-filtered tuple that will be used to optimize the execution of the next phases, produce a subset of join result that are relevant for the given preference function, and refine these tuples respectively. An interesting characteristic of PrefJoin is that it tightly integrates preference computation with join hence we can early prune those tuples that are guaranteed not to be an answer, and hence it saves significant unnecessary computations cost. PrefJoin supports a variety of preference function including skyline, multi-objective and k-dominance preference queries. We show the correctness of PrefJoin. Experimental evaluation based on a real system implementation inside PostgreSQL shows that PrefJoin consistently achieves from one to three orders of magnitude performance gain over its competitors in various scenarios.
{"title":"PrefJoin: An efficient preference-aware join operator","authors":"Mohamed E. Khalefa, M. Mokbel, Justin J. Levandoski","doi":"10.1109/ICDE.2011.5767894","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767894","url":null,"abstract":"Preference queries are essential to a wide spectrum of applications including multi-criteria decision-making tools and personalized databases. Unfortunately, most of the evaluation techniques for preference queries assume that the set of preferred attributes are stored in only one relation, waiving on a wide set of queries that include preference computations over multiple relations. This paper presents PrefJoin, an efficient preference-aware join query operator, designed specifically to deal with preference queries over multiple relations. PrefJoin consists of four main phases: Local Pruning, Data Preparation, Joining, and Refining that filter out, from each input relation, those tuples that are guaranteed not to be in the final preference set, associate meta data with each non-filtered tuple that will be used to optimize the execution of the next phases, produce a subset of join result that are relevant for the given preference function, and refine these tuples respectively. An interesting characteristic of PrefJoin is that it tightly integrates preference computation with join hence we can early prune those tuples that are guaranteed not to be an answer, and hence it saves significant unnecessary computations cost. PrefJoin supports a variety of preference function including skyline, multi-objective and k-dominance preference queries. We show the correctness of PrefJoin. Experimental evaluation based on a real system implementation inside PostgreSQL shows that PrefJoin consistently achieves from one to three orders of magnitude performance gain over its competitors in various scenarios.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128606246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767931
D. Wang, Long Wei, Yunyao Li, Frederick Reiss, Shivakumar Vaithyanathan
Recently, there has been increasing interest in extending relational query processing to efficiently support extraction operators, such as dictionaries and regular expressions, over text data. Many text processing queries are sophisticated in that they involve multiple extraction and join operators, resulting in many possible query plans. However, there has been little research on building the selectivity or cost estimation for these extraction operators, which is crucial for an optimizer to pick a good query plan. In this paper, we define the problem of selectivity estimation for dictionaries and regular expressions, and propose to develop document synopses over a text corpus, from which the selectivity can be estimated. We first adapt the language models in the Natural Language Processing literature to form the top-k n-gram synopsis as the baseline document synopsis. Then we develop two classes of novel document synopses: stratified bloom filter synopsis and roll-up synopsis. We also develop techniques to decompose a complicated regular expression into subparts to achieve more effective and accurate estimation. We conduct experiments over the Enron email corpus using both real-world and synthetic workloads to compare the accuracy of the selectivity estimation over different classes and variations of synopses. The results show that, the top-k stratified bloom filter synopsis and the roll-up synopsis is the most accurate in dictionary and regular expression selectivity estimation respectively.
{"title":"Selectivity estimation for extraction operators over text data","authors":"D. Wang, Long Wei, Yunyao Li, Frederick Reiss, Shivakumar Vaithyanathan","doi":"10.1109/ICDE.2011.5767931","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767931","url":null,"abstract":"Recently, there has been increasing interest in extending relational query processing to efficiently support extraction operators, such as dictionaries and regular expressions, over text data. Many text processing queries are sophisticated in that they involve multiple extraction and join operators, resulting in many possible query plans. However, there has been little research on building the selectivity or cost estimation for these extraction operators, which is crucial for an optimizer to pick a good query plan. In this paper, we define the problem of selectivity estimation for dictionaries and regular expressions, and propose to develop document synopses over a text corpus, from which the selectivity can be estimated. We first adapt the language models in the Natural Language Processing literature to form the top-k n-gram synopsis as the baseline document synopsis. Then we develop two classes of novel document synopses: stratified bloom filter synopsis and roll-up synopsis. We also develop techniques to decompose a complicated regular expression into subparts to achieve more effective and accurate estimation. We conduct experiments over the Enron email corpus using both real-world and synthetic workloads to compare the accuracy of the selectivity estimation over different classes and variations of synopses. The results show that, the top-k stratified bloom filter synopsis and the roll-up synopsis is the most accurate in dictionary and regular expression selectivity estimation respectively.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122612131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767856
Alexander Behm, Chen Li, M. Carey
An approximate string query is to find from a collection of strings those that are similar to a given query string. Answering such queries is important in many applications such as data cleaning and record linkage, where errors could occur in queries as well as the data. Many existing algorithms have focused on in-memory indexes. In this paper we investigate how to efficiently answer such queries in a disk-based setting, by systematically studying the effects of storing data and indexes on disk. We devise a novel physical layout for an inverted index to answer queries and we study how to construct it with limited buffer space. To answer queries, we develop a cost-based, adaptive algorithm that balances the I/O costs of retrieving candidate matches and accessing inverted lists. Experiments on large, real datasets verify that simply adapting existing algorithms to a disk-based setting does not work well and that our new techniques answer queries efficiently. Further, our solutions significantly outperform a recent tree-based index, BED-tree.
{"title":"Answering approximate string queries on large data sets using external memory","authors":"Alexander Behm, Chen Li, M. Carey","doi":"10.1109/ICDE.2011.5767856","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767856","url":null,"abstract":"An approximate string query is to find from a collection of strings those that are similar to a given query string. Answering such queries is important in many applications such as data cleaning and record linkage, where errors could occur in queries as well as the data. Many existing algorithms have focused on in-memory indexes. In this paper we investigate how to efficiently answer such queries in a disk-based setting, by systematically studying the effects of storing data and indexes on disk. We devise a novel physical layout for an inverted index to answer queries and we study how to construct it with limited buffer space. To answer queries, we develop a cost-based, adaptive algorithm that balances the I/O costs of retrieving candidate matches and accessing inverted lists. Experiments on large, real datasets verify that simply adapting existing algorithms to a disk-based setting does not work well and that our new techniques answer queries efficiently. Further, our solutions significantly outperform a recent tree-based index, BED-tree.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115232350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767951
Federico Cavalieri, G. Guerrini, M. Mesiti
Data on the Web mostly are in XML format and the need often arises to update their structure, commonly described by an XML Schema. When a schema is modified the effects of the modification on documents need to be faced. XSUpdate is a language that allows to easily identify parts of an XML Schema, apply a modification primitive on them and finally define an adaptation for associated documents, while Eχup is the corresponding engine for processing schema modification and document adaptation statements. Purpose of this demonstration is to provide an overview of the facilities of the XSUpdate language and of the Eχup system.
网络上的数据大多采用 XML 格式,因此经常需要更新它们的结构,这些结构通常由 XML 模式描述。当模式被修改时,需要面对修改对文档的影响。XSUpdate 是一种语言,可以轻松识别 XML 模式的各个部分,对其应用修改原语,并最终定义相关文档的适配,而 Eχup 则是处理模式修改和文档适配语句的相应引擎。本演示的目的是概述 XSUpdate 语言和 Eχup 系统的功能。
{"title":"Updating XML schemas and associated documents through exup","authors":"Federico Cavalieri, G. Guerrini, M. Mesiti","doi":"10.1109/ICDE.2011.5767951","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767951","url":null,"abstract":"Data on the Web mostly are in XML format and the need often arises to update their structure, commonly described by an XML Schema. When a schema is modified the effects of the modification on documents need to be faced. XSUpdate is a language that allows to easily identify parts of an XML Schema, apply a modification primitive on them and finally define an adaptation for associated documents, while Eχup is the corresponding engine for processing schema modification and document adaptation statements. Purpose of this demonstration is to provide an overview of the facilities of the XSUpdate language and of the Eχup system.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132673216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767855
K. Chakrabarti, S. Chaudhuri, Venkatesh Ganti
Optimizing execution of top-k queries over record-id ordered, compressed lists is challenging. The threshold family of algorithms cannot be effectively used in such cases. Yet, improving execution of such queries is of great value. For example, top-k keyword search in information retrieval (IR) engines represents an important scenario where such optimization can be directly beneficial. In this paper, we develop novel algorithms to improve execution of such queries over state of the art techniques. Our main insights are pruning based on fine-granularity bounds and traversing the lists based on judiciously chosen “intervals” rather than individual records. We formally study the optimality characteristics of the proposed algorithms. Our algorithms require minimal changes and can be easily integrated into IR engines. Our experiments on real-life datasets show that our algorithm outperform the state of the art techniques by a factor of 3–6 in terms of query execution times.
{"title":"Interval-based pruning for top-k processing over compressed lists","authors":"K. Chakrabarti, S. Chaudhuri, Venkatesh Ganti","doi":"10.1109/ICDE.2011.5767855","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767855","url":null,"abstract":"Optimizing execution of top-k queries over record-id ordered, compressed lists is challenging. The threshold family of algorithms cannot be effectively used in such cases. Yet, improving execution of such queries is of great value. For example, top-k keyword search in information retrieval (IR) engines represents an important scenario where such optimization can be directly beneficial. In this paper, we develop novel algorithms to improve execution of such queries over state of the art techniques. Our main insights are pruning based on fine-granularity bounds and traversing the lists based on judiciously chosen “intervals” rather than individual records. We formally study the optimality characteristics of the proposed algorithms. Our algorithms require minimal changes and can be easily integrated into IR engines. Our experiments on real-life datasets show that our algorithm outperform the state of the art techniques by a factor of 3–6 in terms of query execution times.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134609844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-04-11DOI: 10.1109/ICDE.2011.5767965
G. Gottlob, G. Orsi, Andreas Pieris
Ontological queries are evaluated against an enterprise ontology rather than directly on a database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this paper we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent query against the underlying relational database. The focus here is on soundness and completeness. We review previous results and present a new rewriting algorithm for rather general types of ontological constraints (description logics). In particular, we show how a conjunctive query (CQ) against an enterprise ontology can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this process so to produce possibly small and cost-effective output UCQ. We review existing optimization methods, and propose an effective new method that works for Linear Datalog±, a description logic that encompasses well-known description logics of the DL-Lite family.
{"title":"Ontological queries: Rewriting and optimization","authors":"G. Gottlob, G. Orsi, Andreas Pieris","doi":"10.1109/ICDE.2011.5767965","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767965","url":null,"abstract":"Ontological queries are evaluated against an enterprise ontology rather than directly on a database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this paper we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent query against the underlying relational database. The focus here is on soundness and completeness. We review previous results and present a new rewriting algorithm for rather general types of ontological constraints (description logics). In particular, we show how a conjunctive query (CQ) against an enterprise ontology can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this process so to produce possibly small and cost-effective output UCQ. We review existing optimization methods, and propose an effective new method that works for Linear Datalog±, a description logic that encompasses well-known description logics of the DL-Lite family.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129008371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Synchronous execution of queries or Web service requests forces the calling application to block until the query/request is satisfied. The performance of applications can be significantly improved by asynchronous submission of queries, which allows the application to perform other processing instead of blocking while the query is executed, and to concurrently issue multiple queries. Concurrent submission of multiple queries can allow the query execution engine to better utilize multiple processors and disks, and to reorder disk IO requests to minimize seeks. Concurrent submission also reduces the impact of network round-trip latency and delays at the database, when processing multiple queries. However, manually writing applications to exploit asynchronous query submission is tedious. In this paper we address the issue of automatically transforming a program written assuming synchronous query submission, to one that exploits asynchronous query submission. Our program transformation method is based on dataflow analysis and is framed as a set of transformation rules. Our rules can handle query executions within loops, unlike some of the earlier work in this area. We have built a tool that implements our transformation techniques on Java code that uses JDBC calls; our tool can be extended to handle Web service calls. We have carried out a detailed experimental study on several real-life applications rewritten using our transformation techniques. The experimental study shows the effectiveness of the proposed rewrite techniques, both in terms of their applicability and performance gains achieved.
{"title":"Program transformations for asynchronous query submission","authors":"Mahendra Chavan, Ravindra Guravannavar, Karthik Ramachandra, Sundararajarao Sudarshan","doi":"10.1109/ICDE.2011.5767870","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767870","url":null,"abstract":"Synchronous execution of queries or Web service requests forces the calling application to block until the query/request is satisfied. The performance of applications can be significantly improved by asynchronous submission of queries, which allows the application to perform other processing instead of blocking while the query is executed, and to concurrently issue multiple queries. Concurrent submission of multiple queries can allow the query execution engine to better utilize multiple processors and disks, and to reorder disk IO requests to minimize seeks. Concurrent submission also reduces the impact of network round-trip latency and delays at the database, when processing multiple queries. However, manually writing applications to exploit asynchronous query submission is tedious. In this paper we address the issue of automatically transforming a program written assuming synchronous query submission, to one that exploits asynchronous query submission. Our program transformation method is based on dataflow analysis and is framed as a set of transformation rules. Our rules can handle query executions within loops, unlike some of the earlier work in this area. We have built a tool that implements our transformation techniques on Java code that uses JDBC calls; our tool can be extended to handle Web service calls. We have carried out a detailed experimental study on several real-life applications rewritten using our transformation techniques. The experimental study shows the effectiveness of the proposed rewrite techniques, both in terms of their applicability and performance gains achieved.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125016615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}