首页 > 最新文献

2011 IEEE 27th International Conference on Data Engineering最新文献

英文 中文
Memory-constrained aggregate computation over data streams 数据流上内存受限的聚合计算
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767860
K. Naidu, R. Rastogi, Scott Satkin, A. Srinivasan
In this paper, we study the problem of efficiently computing multiple aggregation queries over a data stream. In order to share computation, prior proposals have suggested instantiating certain intermediate aggregates which are then used to generate the final answers for input queries. In this work, we make a number of important contributions aimed at improving the execution and generation of query plans containing intermediate aggregates. These include: (1) a different hashing model, which has low eviction rates, and also allows us to accurately estimate the number of evictions, (2) a comprehensive query execution cost model based on these estimates, (3) an efficient greedy heuristic for constructing good low-cost query plans, (4) provably near-optimal and optimal algorithms for allocating the available memory to aggregates in the query plan when the input data distribution is Zipf-like and Uniform, respectively, and (5) a detailed performance study with real-life IP flow data sets, which show that our multiple aggregates computation techniques consistently outperform the best-known approach.
在本文中,我们研究了在一个数据流上高效计算多个聚合查询的问题。为了共享计算,之前的建议建议实例化某些中间聚合,然后使用它们为输入查询生成最终答案。在这项工作中,我们做出了许多重要的贡献,旨在改进包含中间聚合的查询计划的执行和生成。这些包括:(1)不同的哈希模型,该模型具有较低的驱逐率,并允许我们准确地估计驱逐次数;(2)基于这些估计的综合查询执行成本模型;(3)用于构建良好的低成本查询计划的高效贪婪启发式算法;(4)当输入数据分布分别为Zipf-like和Uniform时,用于将可用内存分配给查询计划中的聚合的可证明的近最优和最优算法。(5)对真实IP流数据集进行了详细的性能研究,结果表明我们的多聚合计算技术始终优于最知名的方法。
{"title":"Memory-constrained aggregate computation over data streams","authors":"K. Naidu, R. Rastogi, Scott Satkin, A. Srinivasan","doi":"10.1109/ICDE.2011.5767860","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767860","url":null,"abstract":"In this paper, we study the problem of efficiently computing multiple aggregation queries over a data stream. In order to share computation, prior proposals have suggested instantiating certain intermediate aggregates which are then used to generate the final answers for input queries. In this work, we make a number of important contributions aimed at improving the execution and generation of query plans containing intermediate aggregates. These include: (1) a different hashing model, which has low eviction rates, and also allows us to accurately estimate the number of evictions, (2) a comprehensive query execution cost model based on these estimates, (3) an efficient greedy heuristic for constructing good low-cost query plans, (4) provably near-optimal and optimal algorithms for allocating the available memory to aggregates in the query plan when the input data distribution is Zipf-like and Uniform, respectively, and (5) a detailed performance study with real-life IP flow data sets, which show that our multiple aggregates computation techniques consistently outperform the best-known approach.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131573115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
A unified approach for computing top-k pairs in multidimensional space 多维空间中计算top-k对的统一方法
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767903
M. A. Cheema, Xuemin Lin, Haixun Wang, Jianmin Wang, W. Zhang
Top-k pairs queries have many real applications. k closest pairs queries, k furthest pairs queries and their bichromatic variants are some of the examples of the top-k pairs queries that rank the pairs on distance functions. While these queries have received significant research attention, there does not exist a unified approach that can efficiently answer all these queries. Moreover, there is no existing work that supports top-k pairs queries based on generic scoring functions. In this paper, we present a unified approach that supports a broad class of top-k pairs queries including the queries mentioned above. Our proposed approach allows the users to define a local scoring function for each attribute involved in the query and a global scoring function that computes the final score of each pair by combining its scores on different attributes. We propose efficient internal and external memory algorithms and our theoretical analysis shows that the expected performance of the algorithms is optimal when two or less attributes are involved. Our approach does not require any pre-built indexes, is easy to implement and has low memory requirement. We conduct extensive experiments to demonstrate the efficiency of our proposed approach.
Top-k对查询有许多实际应用。K个最近对查询,K个最远对查询和它们的双色变体是top-k对查询的一些例子,这些查询根据距离函数对对进行排序。虽然这些问题已经得到了大量的研究关注,但目前还没有一个统一的方法可以有效地回答所有这些问题。此外,没有现有的工作支持基于通用评分函数的top-k对查询。在本文中,我们提出了一种统一的方法,支持广泛的top-k对查询,包括上面提到的查询。我们提出的方法允许用户为查询中涉及的每个属性定义一个本地评分函数和一个全局评分函数,该函数通过组合其在不同属性上的分数来计算每对的最终分数。我们提出了高效的内部和外部存储算法,我们的理论分析表明,当涉及两个或更少的属性时,算法的预期性能是最优的。我们的方法不需要任何预先构建的索引,易于实现并且内存需求低。我们进行了大量的实验来证明我们提出的方法的有效性。
{"title":"A unified approach for computing top-k pairs in multidimensional space","authors":"M. A. Cheema, Xuemin Lin, Haixun Wang, Jianmin Wang, W. Zhang","doi":"10.1109/ICDE.2011.5767903","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767903","url":null,"abstract":"Top-k pairs queries have many real applications. k closest pairs queries, k furthest pairs queries and their bichromatic variants are some of the examples of the top-k pairs queries that rank the pairs on distance functions. While these queries have received significant research attention, there does not exist a unified approach that can efficiently answer all these queries. Moreover, there is no existing work that supports top-k pairs queries based on generic scoring functions. In this paper, we present a unified approach that supports a broad class of top-k pairs queries including the queries mentioned above. Our proposed approach allows the users to define a local scoring function for each attribute involved in the query and a global scoring function that computes the final score of each pair by combining its scores on different attributes. We propose efficient internal and external memory algorithms and our theoretical analysis shows that the expected performance of the algorithms is optimal when two or less attributes are involved. Our approach does not require any pre-built indexes, is easy to implement and has low memory requirement. We conduct extensive experiments to demonstrate the efficiency of our proposed approach.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134031739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
On dimensionality reduction of massive graphs for indexing and retrieval 面向索引和检索的海量图的降维研究
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767834
C. Aggarwal, Haixun Wang
In this paper, we will examine the problem of dimensionality reduction of massive disk-resident data sets. Graph mining has become important in recent years because of its numerous applications in community detection, social networking, and web mining. Many graph data sets are defined on massive node domains in which the number of nodes in the underlying domain is very large. As a result, it is often difficult to store and hold the information necessary in order to retrieve and index the data. Most known methods for dimensionality reduction are effective only for data sets defined on modest domains. Furthermore, while the problem of dimensionality reduction is most relevant to the problem of massive data sets, these algorithms are inherently not designed for the case of disk-resident data in terms of the order in which the data is accessed on disk. This is a serious limitation which restricts the applicability of current dimensionality reduction methods. Furthermore, since dimensionality reduction methods are typically designed for database applications such as indexing, it is important to design the underlying data reduction method, so that it can be effectively used for such applications. In this paper, we will examine the difficult problem of dimensionality reduction of graph data in the difficult case in which the underlying number of nodes are very large and the data set is disk-resident. We will propose an effective sampling algorithm for dimensionality reduction and show how to perform the dimensionality reduction in a limited number of passes on disk. We will also design the technique to be highly interpretable and friendly for indexing applications. We will illustrate the effectiveness and efficiency of the approach on a number of real data sets.
在本文中,我们将研究大量磁盘驻留数据集的降维问题。近年来,由于图挖掘在社区检测、社交网络和web挖掘方面的大量应用,它变得越来越重要。许多图数据集定义在海量节点域上,其中底层域的节点数量非常大。因此,通常很难存储和保存检索和索引数据所需的信息。大多数已知的降维方法仅对定义在适度域上的数据集有效。此外,虽然降维问题与大规模数据集的问题最为相关,但就数据在磁盘上访问的顺序而言,这些算法本质上不是为磁盘驻留数据的情况而设计的。这是制约当前降维方法适用性的一个严重缺陷。此外,由于降维方法通常是为诸如索引之类的数据库应用程序设计的,因此设计底层数据降维方法非常重要,这样它才能有效地用于此类应用程序。在本文中,我们将研究在底层节点数量非常大且数据集驻留在磁盘上的困难情况下图数据降维的难题。我们将提出一种有效的降维采样算法,并展示如何在有限的磁盘传输次数中执行降维。我们还将对索引应用程序设计高度可解释性和友好性的技术。我们将在一些真实数据集上说明该方法的有效性和效率。
{"title":"On dimensionality reduction of massive graphs for indexing and retrieval","authors":"C. Aggarwal, Haixun Wang","doi":"10.1109/ICDE.2011.5767834","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767834","url":null,"abstract":"In this paper, we will examine the problem of dimensionality reduction of massive disk-resident data sets. Graph mining has become important in recent years because of its numerous applications in community detection, social networking, and web mining. Many graph data sets are defined on massive node domains in which the number of nodes in the underlying domain is very large. As a result, it is often difficult to store and hold the information necessary in order to retrieve and index the data. Most known methods for dimensionality reduction are effective only for data sets defined on modest domains. Furthermore, while the problem of dimensionality reduction is most relevant to the problem of massive data sets, these algorithms are inherently not designed for the case of disk-resident data in terms of the order in which the data is accessed on disk. This is a serious limitation which restricts the applicability of current dimensionality reduction methods. Furthermore, since dimensionality reduction methods are typically designed for database applications such as indexing, it is important to design the underlying data reduction method, so that it can be effectively used for such applications. In this paper, we will examine the difficult problem of dimensionality reduction of graph data in the difficult case in which the underlying number of nodes are very large and the data set is disk-resident. We will propose an effective sampling algorithm for dimensionality reduction and show how to perform the dimensionality reduction in a limited number of passes on disk. We will also design the technique to be highly interpretable and friendly for indexing applications. We will illustrate the effectiveness and efficiency of the approach on a number of real data sets.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129278334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots HyPer:基于虚拟内存快照的混合OLTP&OLAP主内存数据库系统
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767867
A. Kemper, Thomas Neumann
The two areas of online transaction processing (OLTP) and online analytical processing (OLAP) present different challenges for database architectures. Currently, customers with high rates of mission-critical transactions have split their data into two separate systems, one database for OLTP and one so-called data warehouse for OLAP. While allowing for decent transaction rates, this separation has many disadvantages including data freshness issues due to the delay caused by only periodically initiating the Extract Transform Load-data staging and excessive resource consumption due to maintaining two separate information systems. We present an efficient hybrid system, called HyPer, that can handle both OLTP and OLAP simultaneously by using hardware-assisted replication mechanisms to maintain consistent snapshots of the transactional data. HyPer is a main-memory database system that guarantees the ACID properties of OLTP transactions and executes OLAP query sessions (multiple queries) on the same, arbitrarily current and consistent snapshot. The utilization of the processor-inherent support for virtual memory management (address translation, caching, copy on update) yields both at the same time: unprecedentedly high transaction rates as high as 100000 per second and very fast OLAP query response times on a single system executing both workloads in parallel. The performance analysis is based on a combined TPC-C and TPC-H benchmark.
联机事务处理(OLTP)和联机分析处理(OLAP)这两个领域对数据库体系结构提出了不同的挑战。目前,具有高任务关键事务率的客户将其数据分成两个独立的系统,一个用于OLTP的数据库和一个用于OLAP的所谓数据仓库。虽然允许适当的事务率,但这种分离有许多缺点,包括数据新鲜问题,这是由于只周期性地启动Extract Transform Load-data阶段造成的延迟造成的,以及由于维护两个独立的信息系统造成的过度资源消耗。我们提出了一种称为HyPer的高效混合系统,它可以同时处理OLTP和OLAP,方法是使用硬件辅助复制机制来维护事务数据的一致快照。HyPer是一个主存数据库系统,它保证OLTP事务的ACID属性,并在相同、任意当前和一致的快照上执行OLAP查询会话(多个查询)。利用处理器对虚拟内存管理(地址转换、缓存、更新时复制)的固有支持,可以同时实现这两种功能:在并行执行两个工作负载的单个系统上,史无前例的高事务率(高达每秒100000次)和非常快的OLAP查询响应时间。性能分析基于组合的TPC-C和TPC-H基准。
{"title":"HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots","authors":"A. Kemper, Thomas Neumann","doi":"10.1109/ICDE.2011.5767867","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767867","url":null,"abstract":"The two areas of online transaction processing (OLTP) and online analytical processing (OLAP) present different challenges for database architectures. Currently, customers with high rates of mission-critical transactions have split their data into two separate systems, one database for OLTP and one so-called data warehouse for OLAP. While allowing for decent transaction rates, this separation has many disadvantages including data freshness issues due to the delay caused by only periodically initiating the Extract Transform Load-data staging and excessive resource consumption due to maintaining two separate information systems. We present an efficient hybrid system, called HyPer, that can handle both OLTP and OLAP simultaneously by using hardware-assisted replication mechanisms to maintain consistent snapshots of the transactional data. HyPer is a main-memory database system that guarantees the ACID properties of OLTP transactions and executes OLAP query sessions (multiple queries) on the same, arbitrarily current and consistent snapshot. The utilization of the processor-inherent support for virtual memory management (address translation, caching, copy on update) yields both at the same time: unprecedentedly high transaction rates as high as 100000 per second and very fast OLAP query response times on a single system executing both workloads in parallel. The performance analysis is based on a combined TPC-C and TPC-H benchmark.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121793806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 651
Top-k keyword search over probabilistic XML data 对概率XML数据进行Top-k关键字搜索
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767875
Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang
Despite the proliferation of work on XML keyword query, it remains open to support keyword query over probabilistic XML data. Compared with traditional keyword search, it is far more expensive to answer a keyword query over probabilistic XML data due to the consideration of possible world semantics. In this paper, we firstly define the new problem of studying top-k keyword search over probabilistic XML data, which is to retrieve k SLCA results with the k highest probabilities of existence. And then we propose two efficient algorithms. The first algorithm PrStack can find k SLCA results with the k highest probabilities by scanning the relevant keyword nodes only once. To further improve the efficiency, we propose a second algorithm EagerTopK based on a set of pruning properties which can quickly prune unsatisfied SLCA candidates. Finally, we implement the two algorithms and compare their performance with analysis of extensive experimental results.
尽管在XML关键字查询方面的工作越来越多,但它仍然支持对概率XML数据进行关键字查询。与传统的关键字搜索相比,由于要考虑可能的世界语义,在概率性XML数据上回答关键字查询的成本要高得多。本文首先定义了在概率性XML数据上研究top-k关键字搜索的新问题,即检索k个存在概率最高的SLCA结果。然后我们提出了两种有效的算法。第一个算法PrStack只需扫描一次相关关键字节点,就能找到k个具有k个最高概率的SLCA结果。为了进一步提高效率,我们提出了基于一组剪枝属性的第二种算法EagerTopK,该算法可以快速剪枝不满意的SLCA候选。最后,我们实现了这两种算法,并对它们的性能进行了比较,分析了大量的实验结果。
{"title":"Top-k keyword search over probabilistic XML data","authors":"Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang","doi":"10.1109/ICDE.2011.5767875","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767875","url":null,"abstract":"Despite the proliferation of work on XML keyword query, it remains open to support keyword query over probabilistic XML data. Compared with traditional keyword search, it is far more expensive to answer a keyword query over probabilistic XML data due to the consideration of possible world semantics. In this paper, we firstly define the new problem of studying top-k keyword search over probabilistic XML data, which is to retrieve k SLCA results with the k highest probabilities of existence. And then we propose two efficient algorithms. The first algorithm PrStack can find k SLCA results with the k highest probabilities by scanning the relevant keyword nodes only once. To further improve the efficiency, we propose a second algorithm EagerTopK based on a set of pruning properties which can quickly prune unsatisfied SLCA candidates. Finally, we implement the two algorithms and compare their performance with analysis of extensive experimental results.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122502955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 74
Flexible use of cloud resources through profit maximization and price discrimination 通过利润最大化和价格歧视灵活使用云资源
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767932
Konstantinos Tsakalozos, H. Kllapi, Evangelia A. Sitaridi, M. Roussopoulos, Dimitris Paparas, A. Delis
Modern frameworks, such as Hadoop, combined with abundance of computing resources from the cloud, offer a significant opportunity to address long standing challenges in distributed processing. Infrastructure-as-a-Service clouds reduce the investment cost of renting a large data center while distributed processing frameworks are capable of efficiently harvesting the rented physical resources. Yet, the performance users get out of these resources varies greatly because the cloud hardware is shared by all users. The value for money cloud consumers achieve renders resource sharing policies a key player in both cloud performance and user satisfaction. In this paper, we employ microeconomics to direct the allotment of cloud resources for consumption in highly scalable master-worker virtual infrastructures. Our approach is developed on two premises: the cloud-consumer always has a budget and cloud physical resources are limited. Using our approach, the cloud administration is able to maximize per-user financial profit. We show that there is an equilibrium point at which our method achieves resource sharing proportional to each user's budget. Ultimately, this approach allows us to answer the question of how many resources a consumer should request from the seemingly endless pool provided by the cloud.
像Hadoop这样的现代框架,结合了来自云的丰富计算资源,为解决分布式处理中长期存在的挑战提供了重要的机会。基础设施即服务云降低了租用大型数据中心的投资成本,而分布式处理框架能够有效地获取租用的物理资源。然而,用户从这些资源中获得的性能差异很大,因为云硬件是由所有用户共享的。云消费者实现的物有所值使资源共享策略成为云性能和用户满意度的关键因素。在本文中,我们使用微观经济学来指导云资源在高度可扩展的主工虚拟基础设施中的消费分配。我们的方法是在两个前提下开发的:云消费者总是有预算,云物理资源是有限的。使用我们的方法,云管理能够最大化每个用户的财务利润。我们证明了存在一个平衡点,在这个平衡点上,我们的方法实现了与每个用户预算成比例的资源共享。最终,这种方法使我们能够回答这样一个问题:消费者应该从云提供的看似无穷无尽的资源池中请求多少资源。
{"title":"Flexible use of cloud resources through profit maximization and price discrimination","authors":"Konstantinos Tsakalozos, H. Kllapi, Evangelia A. Sitaridi, M. Roussopoulos, Dimitris Paparas, A. Delis","doi":"10.1109/ICDE.2011.5767932","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767932","url":null,"abstract":"Modern frameworks, such as Hadoop, combined with abundance of computing resources from the cloud, offer a significant opportunity to address long standing challenges in distributed processing. Infrastructure-as-a-Service clouds reduce the investment cost of renting a large data center while distributed processing frameworks are capable of efficiently harvesting the rented physical resources. Yet, the performance users get out of these resources varies greatly because the cloud hardware is shared by all users. The value for money cloud consumers achieve renders resource sharing policies a key player in both cloud performance and user satisfaction. In this paper, we employ microeconomics to direct the allotment of cloud resources for consumption in highly scalable master-worker virtual infrastructures. Our approach is developed on two premises: the cloud-consumer always has a budget and cloud physical resources are limited. Using our approach, the cloud administration is able to maximize per-user financial profit. We show that there is an equilibrium point at which our method achieves resource sharing proportional to each user's budget. Ultimately, this approach allows us to answer the question of how many resources a consumer should request from the seemingly endless pool provided by the cloud.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128116633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 117
CT-index: Fingerprint-based graph indexing combining cycles and trees ct索引:结合循环和树的基于指纹的图形索引
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767909
K. Klein, Nils M. Kriege, Petra Mutzel
Efficient subgraph queries in large databases are a time-critical task in many application areas as e.g. biology or chemistry, where biological networks or chemical compounds are modeled as graphs. The NP-completeness of the underlying subgraph isomorphism problem renders an exact subgraph test for each database graph infeasible. Therefore efficient methods have to be found that avoid most of these tests but still allow to identify all graphs containing the query pattern. We propose a new approach based on the filter-verification paradigm, using a new hash-key fingerprint technique with a combination of tree and cycle features for filtering and a new subgraph isomorphism test for verification. Our approach is able to cope with edge and vertex labels and also allows to use wild card patterns for the search. We present an experimental comparison of our approach with state-of-the-art methods using a benchmark set of both real world and generated graph instances that shows its practicability. Our approach is implemented as part of the Scaffold Hunter software, a tool for the visual analysis of chemical compound databases.
在许多应用领域,在大型数据库中高效的子图查询是一项时间紧迫的任务,例如生物学或化学,其中生物网络或化合物被建模为图。子图同构问题的np完备性使得对每个数据库图进行精确的子图测试是不可行的。因此,必须找到有效的方法,避免大多数这些测试,但仍然允许识别包含查询模式的所有图。我们提出了一种基于过滤器-验证范式的新方法,使用一种结合树和循环特征的新的哈希键指纹技术进行过滤,并使用一种新的子图同构测试进行验证。我们的方法能够处理边缘和顶点标签,也允许使用通配符模式进行搜索。我们将我们的方法与最先进的方法进行了实验比较,使用真实世界和生成的图形实例的基准集来显示其实用性。我们的方法作为Scaffold Hunter软件的一部分实现,该软件是一种用于化学化合物数据库可视化分析的工具。
{"title":"CT-index: Fingerprint-based graph indexing combining cycles and trees","authors":"K. Klein, Nils M. Kriege, Petra Mutzel","doi":"10.1109/ICDE.2011.5767909","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767909","url":null,"abstract":"Efficient subgraph queries in large databases are a time-critical task in many application areas as e.g. biology or chemistry, where biological networks or chemical compounds are modeled as graphs. The NP-completeness of the underlying subgraph isomorphism problem renders an exact subgraph test for each database graph infeasible. Therefore efficient methods have to be found that avoid most of these tests but still allow to identify all graphs containing the query pattern. We propose a new approach based on the filter-verification paradigm, using a new hash-key fingerprint technique with a combination of tree and cycle features for filtering and a new subgraph isomorphism test for verification. Our approach is able to cope with edge and vertex labels and also allows to use wild card patterns for the search. We present an experimental comparison of our approach with state-of-the-art methods using a benchmark set of both real world and generated graph instances that shows its practicability. Our approach is implemented as part of the Scaffold Hunter software, a tool for the visual analysis of chemical compound databases.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127111389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
T-verifier: Verifying truthfulness of fact statements t -验证者:验证事实陈述的真实性
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767859
Xian Li, W. Meng, Clement T. Yu
The Web has become the most popular place for people to acquire information. Unfortunately, it is widely recognized that the Web contains a significant amount of untruthful information. As a result, good tools are needed to help Web users determine the truthfulness of certain information. In this paper, we propose a two-step method that aims to determine whether a given statement is truthful, and if it is not, find out the truthful statement most related to the given statement. In the first step, we try to find a small number of alternative statements of the same topic as the given statement and make sure that one of these statements is truthful. In the second step, we identify the truthful statement from the given statement and the alternative statements. Both steps heavily rely on analysing various features extracted from the search results returned by a popular search engine for appropriate queries. Our experimental results show the best variation of the proposed method can achieve a precision of about 90%.
网络已经成为人们获取信息最流行的地方。不幸的是,人们普遍认为网络包含大量不真实的信息。因此,需要好的工具来帮助网络用户确定某些信息的真实性。在本文中,我们提出了一种两步法,旨在确定给定陈述是否真实,如果不是,找出与给定陈述最相关的真实陈述。在第一步中,我们尝试找到少量与给定语句相同主题的替代语句,并确保其中一个语句是真实的。在第二步,我们从给定的陈述和替代陈述中识别真实陈述。这两个步骤都严重依赖于分析从流行搜索引擎返回的搜索结果中提取的各种特征,以获得适当的查询。实验结果表明,该方法的最佳变化精度可达90%左右。
{"title":"T-verifier: Verifying truthfulness of fact statements","authors":"Xian Li, W. Meng, Clement T. Yu","doi":"10.1109/ICDE.2011.5767859","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767859","url":null,"abstract":"The Web has become the most popular place for people to acquire information. Unfortunately, it is widely recognized that the Web contains a significant amount of untruthful information. As a result, good tools are needed to help Web users determine the truthfulness of certain information. In this paper, we propose a two-step method that aims to determine whether a given statement is truthful, and if it is not, find out the truthful statement most related to the given statement. In the first step, we try to find a small number of alternative statements of the same topic as the given statement and make sure that one of these statements is truthful. In the second step, we identify the truthful statement from the given statement and the alternative statements. Both steps heavily rely on analysing various features extracted from the search results returned by a popular search engine for appropriate queries. Our experimental results show the best variation of the proposed method can achieve a precision of about 90%.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128038529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
PrefJoin: An efficient preference-aware join operator PrefJoin:一个有效的优先级感知连接操作符
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767894
Mohamed E. Khalefa, M. Mokbel, Justin J. Levandoski
Preference queries are essential to a wide spectrum of applications including multi-criteria decision-making tools and personalized databases. Unfortunately, most of the evaluation techniques for preference queries assume that the set of preferred attributes are stored in only one relation, waiving on a wide set of queries that include preference computations over multiple relations. This paper presents PrefJoin, an efficient preference-aware join query operator, designed specifically to deal with preference queries over multiple relations. PrefJoin consists of four main phases: Local Pruning, Data Preparation, Joining, and Refining that filter out, from each input relation, those tuples that are guaranteed not to be in the final preference set, associate meta data with each non-filtered tuple that will be used to optimize the execution of the next phases, produce a subset of join result that are relevant for the given preference function, and refine these tuples respectively. An interesting characteristic of PrefJoin is that it tightly integrates preference computation with join hence we can early prune those tuples that are guaranteed not to be an answer, and hence it saves significant unnecessary computations cost. PrefJoin supports a variety of preference function including skyline, multi-objective and k-dominance preference queries. We show the correctness of PrefJoin. Experimental evaluation based on a real system implementation inside PostgreSQL shows that PrefJoin consistently achieves from one to three orders of magnitude performance gain over its competitors in various scenarios.
偏好查询对于包括多标准决策工具和个性化数据库在内的广泛应用程序都是必不可少的。不幸的是,大多数首选项查询的评估技术都假定首选属性集仅存储在一个关系中,从而忽略了包含多个关系上的首选项计算的广泛查询集。PrefJoin是一种高效的优先级感知连接查询操作符,专门用于处理多个关系上的优先级查询。PrefJoin由四个主要阶段组成:Local Pruning、Data Preparation、Joining和refine,这些阶段从每个输入关系中过滤掉那些保证不在最终首选项集中的元组,将元数据与每个将用于优化下一阶段执行的未过滤元组关联起来,生成与给定首选项函数相关的连接结果子集,并分别对这些元组进行细化。PrefJoin的一个有趣的特点是,它将首选项计算与join紧密地集成在一起,因此我们可以提前修剪那些保证不是答案的元组,从而节省了大量不必要的计算成本。PrefJoin支持各种偏好函数,包括天际线,多目标和k-优势偏好查询。我们将展示PrefJoin的正确性。基于PostgreSQL内部真实系统实现的实验评估表明,在各种场景中,PrefJoin始终比其竞争对手获得一到三个数量级的性能提升。
{"title":"PrefJoin: An efficient preference-aware join operator","authors":"Mohamed E. Khalefa, M. Mokbel, Justin J. Levandoski","doi":"10.1109/ICDE.2011.5767894","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767894","url":null,"abstract":"Preference queries are essential to a wide spectrum of applications including multi-criteria decision-making tools and personalized databases. Unfortunately, most of the evaluation techniques for preference queries assume that the set of preferred attributes are stored in only one relation, waiving on a wide set of queries that include preference computations over multiple relations. This paper presents PrefJoin, an efficient preference-aware join query operator, designed specifically to deal with preference queries over multiple relations. PrefJoin consists of four main phases: Local Pruning, Data Preparation, Joining, and Refining that filter out, from each input relation, those tuples that are guaranteed not to be in the final preference set, associate meta data with each non-filtered tuple that will be used to optimize the execution of the next phases, produce a subset of join result that are relevant for the given preference function, and refine these tuples respectively. An interesting characteristic of PrefJoin is that it tightly integrates preference computation with join hence we can early prune those tuples that are guaranteed not to be an answer, and hence it saves significant unnecessary computations cost. PrefJoin supports a variety of preference function including skyline, multi-objective and k-dominance preference queries. We show the correctness of PrefJoin. Experimental evaluation based on a real system implementation inside PostgreSQL shows that PrefJoin consistently achieves from one to three orders of magnitude performance gain over its competitors in various scenarios.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128606246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Ontological queries: Rewriting and optimization 本体查询:重写和优化
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767965
G. Gottlob, G. Orsi, Andreas Pieris
Ontological queries are evaluated against an enterprise ontology rather than directly on a database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this paper we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent query against the underlying relational database. The focus here is on soundness and completeness. We review previous results and present a new rewriting algorithm for rather general types of ontological constraints (description logics). In particular, we show how a conjunctive query (CQ) against an enterprise ontology can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this process so to produce possibly small and cost-effective output UCQ. We review existing optimization methods, and propose an effective new method that works for Linear Datalog±, a description logic that encompasses well-known description logics of the DL-Lite family.
本体查询是根据企业本体而不是直接在数据库上进行计算的。这类查询的评估和优化是数据库研究中一个有趣的新问题。本文讨论了该问题的两个重要方面:查询重写和查询优化。查询重写包括将本体查询编译为针对底层关系数据库的等效查询。这里的重点是健全性和完整性。我们回顾了以前的结果,并提出了一种新的重写算法,用于相当一般类型的本体约束(描述逻辑)。特别是,我们展示了如何将针对企业本体的联合查询(CQ)编译为针对底层数据库的联合查询(UCQ)。在这种情况下,本体论查询优化试图改进这一过程,以产生尽可能小且具有成本效益的输出UCQ。我们回顾了现有的优化方法,并提出了一种有效的新方法,适用于线性Datalog±,线性Datalog±是一种包含DL-Lite家族中众所周知的描述逻辑的描述逻辑。
{"title":"Ontological queries: Rewriting and optimization","authors":"G. Gottlob, G. Orsi, Andreas Pieris","doi":"10.1109/ICDE.2011.5767965","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767965","url":null,"abstract":"Ontological queries are evaluated against an enterprise ontology rather than directly on a database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this paper we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent query against the underlying relational database. The focus here is on soundness and completeness. We review previous results and present a new rewriting algorithm for rather general types of ontological constraints (description logics). In particular, we show how a conjunctive query (CQ) against an enterprise ontology can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this process so to produce possibly small and cost-effective output UCQ. We review existing optimization methods, and propose an effective new method that works for Linear Datalog±, a description logic that encompasses well-known description logics of the DL-Lite family.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129008371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 164
期刊
2011 IEEE 27th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1