首页 > 最新文献

2011 IEEE 27th International Conference on Data Engineering最新文献

英文 中文
Stochastic skyline operator 随机天际线算子
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767896
Xuemin Lin, Ying Zhang, W. Zhang, M. A. Cheema
In many applications involving the multiple criteria optimal decision making, users may often want to make a personal trade-off among all optimal solutions. As a key feature, the skyline in a multi-dimensional space provides the minimum set of candidates for such purposes by removing all points not preferred by any (monotonic) utility/scoring functions; that is, the skyline removes all objects not preferred by any user no mater how their preferences vary. Driven by many applications with uncertain data, the probabilistic skyline model is proposed to retrieve uncertain objects based on skyline probabilities. Nevertheless, skyline probabilities cannot capture the preferences of monotonic utility functions. Motivated by this, in this paper we propose a novel skyline operator, namely stochastic skyline. In the light of the expected utility principle, stochastic skyline guarantees to provide the minimum set of candidates for the optimal solutions over all possible monotonic multiplicative utility functions. In contrast to the conventional skyline or the probabilistic skyline computation, we show that the problem of stochastic skyline is NP-complete with respect to the dimensionality. Novel and efficient algorithms are developed to efficiently compute stochastic skyline over multi-dimensional uncertain data, which run in polynomial time if the dimensionality is fixed. We also show, by theoretical analysis and experiments, that the size of stochastic skyline is quite similar to that of conventional skyline over certain data. Comprehensive experiments demonstrate that our techniques are efficient and scalable regarding both CPU and IO costs.
在许多涉及多标准最优决策的应用程序中,用户可能经常希望在所有最优解决方案中做出个人权衡。作为一个关键特征,多维空间中的天际线通过删除任何(单调的)效用/评分函数不喜欢的所有点,为这种目的提供了最小的候选集;也就是说,无论用户的偏好如何变化,天际线都会删除所有用户不喜欢的对象。在不确定数据应用的驱动下,提出了基于天际线概率的概率天际线模型来检索不确定对象。然而,天际线概率不能捕捉单调效用函数的偏好。基于此,本文提出了一种新的天际线算子,即随机天际线算子。根据期望效用原则,随机天际线保证在所有可能的单调乘法效用函数上提供最优解的最小候选集。与传统的天际线或概率天际线计算相比,我们证明了随机天际线问题在维数上是np完全的。本文提出了一种新颖有效的算法,用于在多维不确定数据上高效地计算随机天际线,如果维数固定,则计算时间为多项式。我们还通过理论分析和实验表明,在某些数据上,随机天际线的大小与常规天际线的大小相当相似。综合实验表明,我们的技术在CPU和IO成本方面都是高效和可扩展的。
{"title":"Stochastic skyline operator","authors":"Xuemin Lin, Ying Zhang, W. Zhang, M. A. Cheema","doi":"10.1109/ICDE.2011.5767896","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767896","url":null,"abstract":"In many applications involving the multiple criteria optimal decision making, users may often want to make a personal trade-off among all optimal solutions. As a key feature, the skyline in a multi-dimensional space provides the minimum set of candidates for such purposes by removing all points not preferred by any (monotonic) utility/scoring functions; that is, the skyline removes all objects not preferred by any user no mater how their preferences vary. Driven by many applications with uncertain data, the probabilistic skyline model is proposed to retrieve uncertain objects based on skyline probabilities. Nevertheless, skyline probabilities cannot capture the preferences of monotonic utility functions. Motivated by this, in this paper we propose a novel skyline operator, namely stochastic skyline. In the light of the expected utility principle, stochastic skyline guarantees to provide the minimum set of candidates for the optimal solutions over all possible monotonic multiplicative utility functions. In contrast to the conventional skyline or the probabilistic skyline computation, we show that the problem of stochastic skyline is NP-complete with respect to the dimensionality. Novel and efficient algorithms are developed to efficiently compute stochastic skyline over multi-dimensional uncertain data, which run in polynomial time if the dimensionality is fixed. We also show, by theoretical analysis and experiments, that the size of stochastic skyline is quite similar to that of conventional skyline over certain data. Comprehensive experiments demonstrate that our techniques are efficient and scalable regarding both CPU and IO costs.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128889020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 42
A new, highly efficient, and easy to implement top-down join enumeration algorithm 一种新的、高效的、易于实现的自顶向下的联接枚举算法
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767901
Pit Fender, G. Moerkotte
Finding an optimal execution order of join operations is a crucial task in every cost-based query optimizer. Since there are many possible join trees for a given query, the overhead of the join (tree) enumeration algorithm per valid join tree should be minimal. In the case of a clique-shaped query graph, the best known top-down algorithm has a complexity of Θ(n2) per join tree, where n is the number of relations. In this paper, we present an algorithm that has an according O(1) complexity in this case. We show experimentally that this more theoretical result has indeed a high impact on the performance in other non-clique settings. This is especially true for cyclic query graphs. Further, we evaluate the performance of our new algorithm and compare it with the best top-down and bottom-up algorithms described in the literature.
在每个基于成本的查询优化器中,找到连接操作的最佳执行顺序是一项至关重要的任务。由于给定查询有许多可能的连接树,因此每个有效连接树的连接(树)枚举算法的开销应该是最小的。对于团状查询图,最著名的自顶向下算法的复杂度为每个连接树Θ(n2),其中n是关系的数量。在本文中,我们给出了一个复杂度为0(1)的算法。我们通过实验证明,这个更具理论性的结果确实对其他非小团体设置中的表现有很高的影响。对于循环查询图来说尤其如此。此外,我们评估了新算法的性能,并将其与文献中描述的最佳自顶向下和自底向上算法进行了比较。
{"title":"A new, highly efficient, and easy to implement top-down join enumeration algorithm","authors":"Pit Fender, G. Moerkotte","doi":"10.1109/ICDE.2011.5767901","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767901","url":null,"abstract":"Finding an optimal execution order of join operations is a crucial task in every cost-based query optimizer. Since there are many possible join trees for a given query, the overhead of the join (tree) enumeration algorithm per valid join tree should be minimal. In the case of a clique-shaped query graph, the best known top-down algorithm has a complexity of Θ(n2) per join tree, where n is the number of relations. In this paper, we present an algorithm that has an according O(1) complexity in this case. We show experimentally that this more theoretical result has indeed a high impact on the performance in other non-clique settings. This is especially true for cyclic query graphs. Further, we evaluate the performance of our new algorithm and compare it with the best top-down and bottom-up algorithms described in the literature.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128919544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
RAFTing MapReduce: Fast recovery on the RAFT RAFTing MapReduce:在RAFT上快速恢复
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767877
Jorge-Arnulfo Quiané-Ruiz, C. Pinkel, Jörg Schad, J. Dittrich
MapReduce is a computing paradigm that has gained a lot of popularity as it allows non-expert users to easily run complex analytical tasks at very large-scale. At such scale, task and node failures are no longer an exception but rather a characteristic of large-scale systems. This makes fault-tolerance a critical issue for the efficient operation of any application. MapReduce automatically reschedules failed tasks to available nodes, which in turn recompute such tasks from scratch. However, this policy can significantly decrease performance of applications. In this paper, we propose a family of Recovery Algorithms for Fast-Tracking (RAFT) MapReduce. As ease-of-use is a major feature of MapReduce, RAFT focuses on simplicity and also non-intrusiveness, in order to be implementation-independent. To efficiently recover from task failures, RAFT exploits the fact that MapReduce produces and persists intermediate results at several points in time. RAFT piggy-backs checkpoints on the task progress computation. To deal with multiple node failures, we propose query metadata checkpointing. We keep track of the mapping between input key-value pairs and intermediate data for all reduce tasks. Thereby, RAFT does not need to re-execute completed map tasks entirely. Instead RAFT only recomputes intermediate data that were processed for local reduce tasks and hence not shipped to another node for processing. We also introduce a scheduling strategy taking full advantage of these recovery algorithms. We implemented RAFT on top of Hadoop and evaluated it on a 45-node cluster using three common analytical tasks. Overall, our experimental results demonstrate that RAFT outperforms Hadoop runtimes by 23% on average under task and node failures. The results also show that RAFT has negligible runtime overhead.
MapReduce是一种非常受欢迎的计算范例,因为它允许非专业用户轻松地在非常大规模的情况下运行复杂的分析任务。在这样的规模下,任务和节点故障不再是例外,而是大规模系统的一个特征。这使得容错成为任何应用程序有效操作的关键问题。MapReduce会自动将失败的任务重新调度到可用的节点上,这些节点会重新计算失败的任务。但是,此策略会显著降低应用程序的性能。在本文中,我们提出了一组用于快速跟踪(RAFT) MapReduce的恢复算法。由于易用性是MapReduce的一个主要特性,RAFT专注于简单性和非侵入性,以便独立于实现。为了有效地从任务失败中恢复,RAFT利用了MapReduce在几个时间点产生并持久保存中间结果的事实。RAFT在任务进度计算上附带检查点。为了处理多节点故障,我们提出了查询元数据检查点。我们跟踪所有reduce任务的输入键值对和中间数据之间的映射。因此,RAFT不需要完全重新执行已完成的映射任务。相反,RAFT只重新计算为本地reduce任务处理的中间数据,因此不会将其发送到另一个节点进行处理。我们还介绍了一种充分利用这些恢复算法的调度策略。我们在Hadoop之上实现了RAFT,并使用三个常见的分析任务在一个45节点的集群上对其进行了评估。总的来说,我们的实验结果表明,在任务和节点故障情况下,RAFT的运行时性能平均比Hadoop高出23%。结果还表明,RAFT的运行时开销可以忽略不计。
{"title":"RAFTing MapReduce: Fast recovery on the RAFT","authors":"Jorge-Arnulfo Quiané-Ruiz, C. Pinkel, Jörg Schad, J. Dittrich","doi":"10.1109/ICDE.2011.5767877","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767877","url":null,"abstract":"MapReduce is a computing paradigm that has gained a lot of popularity as it allows non-expert users to easily run complex analytical tasks at very large-scale. At such scale, task and node failures are no longer an exception but rather a characteristic of large-scale systems. This makes fault-tolerance a critical issue for the efficient operation of any application. MapReduce automatically reschedules failed tasks to available nodes, which in turn recompute such tasks from scratch. However, this policy can significantly decrease performance of applications. In this paper, we propose a family of Recovery Algorithms for Fast-Tracking (RAFT) MapReduce. As ease-of-use is a major feature of MapReduce, RAFT focuses on simplicity and also non-intrusiveness, in order to be implementation-independent. To efficiently recover from task failures, RAFT exploits the fact that MapReduce produces and persists intermediate results at several points in time. RAFT piggy-backs checkpoints on the task progress computation. To deal with multiple node failures, we propose query metadata checkpointing. We keep track of the mapping between input key-value pairs and intermediate data for all reduce tasks. Thereby, RAFT does not need to re-execute completed map tasks entirely. Instead RAFT only recomputes intermediate data that were processed for local reduce tasks and hence not shipped to another node for processing. We also introduce a scheduling strategy taking full advantage of these recovery algorithms. We implemented RAFT on top of Hadoop and evaluated it on a 45-node cluster using three common analytical tasks. Overall, our experimental results demonstrate that RAFT outperforms Hadoop runtimes by 23% on average under task and node failures. The results also show that RAFT has negligible runtime overhead.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129025255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 83
Monte Carlo query processing of uncertain multidimensional array data 蒙特卡罗查询处理不确定多维数组数据
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767887
Tingjian Ge, David J. Grabiner, S. Zdonik
Array database systems are architected for scientific and engineering applications. In these applications, the value of a cell is often imprecise and uncertain. There are at least two reasons that a Monte Carlo query processing algorithm is usually required for such uncertain data. Firstly, a probabilistic graphical model must often be used to model correlation, which requires a Monte Carlo inference algorithm for the operations in our database. Secondly, mathematical operators required by science and engineering domains are much more complex than those of SQL. State-of-the-art query processing uses Monte Carlo approximation. We give an example of using Markov Random Fields combined with an array's chunking or tiling mechanism to model correlated data. We then propose solutions for two of the most challenging problems in this framework, namely the expensive array join operation, and the determination and optimization of stopping conditions of Monte Carlo query processing. Finally, we perform an extensive empirical study on a real world application.
数组数据库系统是为科学和工程应用而设计的。在这些应用中,单元格的值通常是不精确和不确定的。对于这种不确定的数据,通常需要蒙特卡罗查询处理算法,至少有两个原因。首先,必须经常使用概率图模型来建模相关性,这需要对数据库中的操作使用蒙特卡罗推理算法。其次,科学和工程领域所需的数学运算符比SQL复杂得多。最先进的查询处理使用蒙特卡罗近似。我们给出了一个使用马尔可夫随机场结合数组的分块或平铺机制来建模相关数据的例子。然后,我们提出了该框架中两个最具挑战性的问题的解决方案,即昂贵的数组连接操作,以及蒙特卡罗查询处理停止条件的确定和优化。最后,我们对一个真实世界的应用程序进行了广泛的实证研究。
{"title":"Monte Carlo query processing of uncertain multidimensional array data","authors":"Tingjian Ge, David J. Grabiner, S. Zdonik","doi":"10.1109/ICDE.2011.5767887","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767887","url":null,"abstract":"Array database systems are architected for scientific and engineering applications. In these applications, the value of a cell is often imprecise and uncertain. There are at least two reasons that a Monte Carlo query processing algorithm is usually required for such uncertain data. Firstly, a probabilistic graphical model must often be used to model correlation, which requires a Monte Carlo inference algorithm for the operations in our database. Secondly, mathematical operators required by science and engineering domains are much more complex than those of SQL. State-of-the-art query processing uses Monte Carlo approximation. We give an example of using Markov Random Fields combined with an array's chunking or tiling mechanism to model correlated data. We then propose solutions for two of the most challenging problems in this framework, namely the expensive array join operation, and the determination and optimization of stopping conditions of Monte Carlo query processing. Finally, we perform an extensive empirical study on a real world application.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115425343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
LTS: Discriminative subgraph mining by learning from search history LTS:从搜索历史中学习判别子图挖掘
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767922
Ning Jin, Wei Wang
Discriminative subgraphs can be used to characterize complex graphs, construct graph classifiers and generate graph indices. The search space for discriminative subgraphs is usually prohibitively large. Most measurements of interestingness of discriminative subgraphs are neither monotonic nor antimonotonic with respect to subgraph frequencies. Therefore, branch-and-bound algorithms are unable to mine discriminative subgraphs efficiently. We discover that search history of discriminative subgraph mining is very useful in computing empirical upper-bounds of discrimination scores of subgraphs. We propose a novel discriminative subgraph mining method, LTS (Learning To Search), which begins with a greedy algorithm that first samples the search space through subgraph probing and then explores the search space in a branch and bound fashion leveraging the search history of these samples. Extensive experiments have been performed to analyze the gain in performance by taking into account search history and to demonstrate that LTS can significantly improve performance compared with the state-of-the-art discriminative subgraph mining algorithms.
判别子图可以用来描述复杂图,构造图分类器和生成图索引。判别子图的搜索空间通常非常大。大多数判别子图的兴趣度测量对于子图频率既不是单调的也不是反单调的。因此,分支定界算法无法有效地挖掘判别子图。我们发现判别子图挖掘的搜索历史对于计算子图判别分数的经验上界是非常有用的。我们提出了一种新的判别子图挖掘方法LTS (Learning To Search),它从贪婪算法开始,首先通过子图探测对搜索空间进行采样,然后利用这些样本的搜索历史以分支和界的方式探索搜索空间。已经进行了大量的实验,通过考虑搜索历史来分析性能的增益,并证明与最先进的判别子图挖掘算法相比,LTS可以显着提高性能。
{"title":"LTS: Discriminative subgraph mining by learning from search history","authors":"Ning Jin, Wei Wang","doi":"10.1109/ICDE.2011.5767922","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767922","url":null,"abstract":"Discriminative subgraphs can be used to characterize complex graphs, construct graph classifiers and generate graph indices. The search space for discriminative subgraphs is usually prohibitively large. Most measurements of interestingness of discriminative subgraphs are neither monotonic nor antimonotonic with respect to subgraph frequencies. Therefore, branch-and-bound algorithms are unable to mine discriminative subgraphs efficiently. We discover that search history of discriminative subgraph mining is very useful in computing empirical upper-bounds of discrimination scores of subgraphs. We propose a novel discriminative subgraph mining method, LTS (Learning To Search), which begins with a greedy algorithm that first samples the search space through subgraph probing and then explores the search space in a branch and bound fashion leveraging the search history of these samples. Extensive experiments have been performed to analyze the gain in performance by taking into account search history and to demonstrate that LTS can significantly improve performance compared with the state-of-the-art discriminative subgraph mining algorithms.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131448195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Efficient XQuery rewriting using multiple views 使用多个视图高效地重写XQuery
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767915
I. Manolescu, Konstantinos Karanasos, V. Vassalos, Spyros Zoupanos
We consider the problem of rewriting XQuery queries using multiple materialized XQuery views. The XQuery dialect we use to express views and queries corresponds to tree patterns (returning data from several nodes, at different granularities, ranging from node identifiers to full XML subtrees) with value joins. We provide correct and complete algorithms for finding minimal rewritings, in which no view is redundant. Our work extends the state of the art by considering more flexible views than the mostly XPath 1.0 dialects previously considered, and more powerful rewritings. We implemented our algorithms and assess their performance through a set of experiments.
我们考虑使用多个物化的XQuery视图重写XQuery查询的问题。我们用来表达视图和查询的XQuery方言对应于具有值连接的树模式(从不同粒度的多个节点返回数据,范围从节点标识符到完整的XML子树)。我们提供了正确和完整的算法来寻找最小的重写,其中没有视图是冗余的。我们的工作通过考虑比以前考虑的大多数XPath 1.0方言更灵活的视图和更强大的重写,扩展了目前的技术水平。我们实现了我们的算法,并通过一系列实验评估了它们的性能。
{"title":"Efficient XQuery rewriting using multiple views","authors":"I. Manolescu, Konstantinos Karanasos, V. Vassalos, Spyros Zoupanos","doi":"10.1109/ICDE.2011.5767915","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767915","url":null,"abstract":"We consider the problem of rewriting XQuery queries using multiple materialized XQuery views. The XQuery dialect we use to express views and queries corresponds to tree patterns (returning data from several nodes, at different granularities, ranging from node identifiers to full XML subtrees) with value joins. We provide correct and complete algorithms for finding minimal rewritings, in which no view is redundant. Our work extends the state of the art by considering more flexible views than the mostly XPath 1.0 dialects previously considered, and more powerful rewritings. We implemented our algorithms and assess their performance through a set of experiments.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126725230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Using Markov Chain Monte Carlo to play Trivia 使用马尔科夫链蒙特卡洛玩琐事
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767941
Daniel Deutch, Ohad Greenshpan, Boris Kostenko, T. Milo
We introduce in this Demonstration a system called Trivia Masster that generates a very large Database of facts in a variety of topics, and uses it for question answering. The facts are collected from human users (the “crowd”); the system motivates users to contribute to the Database by using a Trivia Game, where users gain points based on their contribution. A key challenge here is to provide a suitable Data Cleaning mechanism that allows to identify which of the facts (answers to Trivia questions) submitted by users are indeed correct / reliable, and consequently how many points to grant users, how to answer questions based on the collected data, and which questions to present to the Trivia players, in order to improve the data quality. As no existing single Data Cleaning technique provides a satisfactory solution to this challenge, we propose here a novel approach, based on a declarative framework for defining recursive and probabilistic Data Cleaning rules. Our solution employs an algorithm that is based on Markov Chain Monte Carlo Algorithms.
在这个演示中,我们介绍了一个名为Trivia master的系统,它生成了一个非常大的各种主题的事实数据库,并将其用于问答。事实是从人类用户(“人群”)那里收集的;该系统通过一个问答游戏来激励用户为数据库做出贡献,用户可以根据自己的贡献获得积分。这里的一个关键挑战是提供一种合适的数据清理机制,允许识别用户提交的哪些事实(对Trivia问题的回答)确实是正确/可靠的,因此可以给用户多少分,如何根据收集到的数据回答问题,以及向Trivia玩家呈现哪些问题,以提高数据质量。由于没有现有的单一数据清理技术为这一挑战提供满意的解决方案,我们在这里提出了一种新的方法,该方法基于定义递归和概率数据清理规则的声明性框架。我们的解决方案采用了一种基于马尔可夫链蒙特卡罗算法的算法。
{"title":"Using Markov Chain Monte Carlo to play Trivia","authors":"Daniel Deutch, Ohad Greenshpan, Boris Kostenko, T. Milo","doi":"10.1109/ICDE.2011.5767941","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767941","url":null,"abstract":"We introduce in this Demonstration a system called Trivia Masster that generates a very large Database of facts in a variety of topics, and uses it for question answering. The facts are collected from human users (the “crowd”); the system motivates users to contribute to the Database by using a Trivia Game, where users gain points based on their contribution. A key challenge here is to provide a suitable Data Cleaning mechanism that allows to identify which of the facts (answers to Trivia questions) submitted by users are indeed correct / reliable, and consequently how many points to grant users, how to answer questions based on the collected data, and which questions to present to the Trivia players, in order to improve the data quality. As no existing single Data Cleaning technique provides a satisfactory solution to this challenge, we propose here a novel approach, based on a declarative framework for defining recursive and probabilistic Data Cleaning rules. Our solution employs an algorithm that is based on Markov Chain Monte Carlo Algorithms.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128127267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins 特征集:对具有多个连接的RDF查询进行精确的基数估计
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767868
Thomas Neumann, G. Moerkotte
Accurate cardinality estimates are essential for a successful query optimization. This is not only true for relational DBMSs but also for RDF stores. An RDF database consists of a set of triples and, hence, can be seen as a relational database with a single table with three attributes. This makes RDF rather special in that queries typically contain many self joins. We show that relational DBMSs are not well-prepared to perform cardinality estimation in this context. Further, there are hardly any special cardinality estimation methods for RDF databases. To overcome this lack of appropriate cardinality estimation methods, we introduce characteristic sets together with new cardinality estimation methods based upon them. We then show experimentally that the new methods are-in the RDF context-highly superior to the estimation methods employed by commercial DBMSs and by the open-source RDF store RDF-3X.
准确的基数估计对于成功的查询优化至关重要。这不仅适用于关系dbms,也适用于RDF存储。RDF数据库由一组三元组组成,因此可以将其视为具有三个属性的单个表的关系数据库。这使得RDF非常特殊,因为查询通常包含许多自连接。我们表明关系dbms在这种情况下没有做好执行基数估计的准备。此外,对于RDF数据库几乎没有任何特殊的基数估计方法。为了克服缺乏合适的基数估计方法的问题,我们引入了特征集以及基于特征集的新的基数估计方法。然后,我们通过实验证明,在RDF上下文中,这些新方法比商业dbms和开源RDF存储RDF- 3x所使用的估计方法要优越得多。
{"title":"Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins","authors":"Thomas Neumann, G. Moerkotte","doi":"10.1109/ICDE.2011.5767868","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767868","url":null,"abstract":"Accurate cardinality estimates are essential for a successful query optimization. This is not only true for relational DBMSs but also for RDF stores. An RDF database consists of a set of triples and, hence, can be seen as a relational database with a single table with three attributes. This makes RDF rather special in that queries typically contain many self joins. We show that relational DBMSs are not well-prepared to perform cardinality estimation in this context. Further, there are hardly any special cardinality estimation methods for RDF databases. To overcome this lack of appropriate cardinality estimation methods, we introduce characteristic sets together with new cardinality estimation methods based upon them. We then show experimentally that the new methods are-in the RDF context-highly superior to the estimation methods employed by commercial DBMSs and by the open-source RDF store RDF-3X.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114615115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 251
Providing support for full relational algebra in probabilistic databases 在概率数据库中提供对完整关系代数的支持
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767912
Robert Fink, Dan Olteanu, Swaroop Rath
Extensive work has recently been done on the evaluation of positive queries on probabilistic databases. The case of queries with negation has notoriously been left out, since it raises serious additional challenges to efficient query evaluation.
最近在评估概率数据库的正查询方面做了大量的工作。众所周知,带有否定的查询被忽略了,因为它对有效的查询求值提出了严重的额外挑战。
{"title":"Providing support for full relational algebra in probabilistic databases","authors":"Robert Fink, Dan Olteanu, Swaroop Rath","doi":"10.1109/ICDE.2011.5767912","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767912","url":null,"abstract":"Extensive work has recently been done on the evaluation of positive queries on probabilistic databases. The case of queries with negation has notoriously been left out, since it raises serious additional challenges to efficient query evaluation.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124047275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Dynamic prioritization of database queries 数据库查询的动态优先级
Pub Date : 2011-04-11 DOI: 10.1109/ICDE.2011.5767836
S. Narayanan, F. Waas
Enterprise database systems handle a variety of diverse query workloads that are of different importance to the business. For example, periodic reporting queries are usually mission critical whereas ad-hoc queries by analysts tend to be less crucial. It is desirable to enable database administrators to express (and modify) the importance of queries at a simple and intuitive level. The mechanism used to enforce these priorities must be robust, adaptive and efficient.
企业数据库系统处理对业务具有不同重要性的各种查询工作负载。例如,定期报告查询通常是关键任务,而分析师的临时查询往往不那么重要。希望数据库管理员能够以简单和直观的方式表达(和修改)查询的重要性。用于执行这些优先事项的机制必须是健全的、适应性强的和高效的。
{"title":"Dynamic prioritization of database queries","authors":"S. Narayanan, F. Waas","doi":"10.1109/ICDE.2011.5767836","DOIUrl":"https://doi.org/10.1109/ICDE.2011.5767836","url":null,"abstract":"Enterprise database systems handle a variety of diverse query workloads that are of different importance to the business. For example, periodic reporting queries are usually mission critical whereas ad-hoc queries by analysts tend to be less crucial. It is desirable to enable database administrators to express (and modify) the importance of queries at a simple and intuitive level. The mechanism used to enforce these priorities must be robust, adaptive and efficient.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114988506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
期刊
2011 IEEE 27th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1