首页 > 最新文献

2012 IEEE 28th International Conference on Data Engineering最新文献

英文 中文
Optimizing Statistical Information Extraction Programs over Evolving Text 优化统计信息提取程序在不断发展的文本
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.60
Fei Chen, Xixuan Feng, C. Ré, Min Wang
Statistical information extraction (IE) programs are increasingly used to build real-world IE systems such as Alibaba, CiteSeer, Kylin, and YAGO. Current statistical IE approaches consider the text corpora underlying the extraction program to be static. However, many real-world text corpora are dynamic (documents are inserted, modified, and removed). As the corpus evolves, and IE programs must be applied repeatedly to consecutive corpus snapshots to keep extracted information up to date. Applying IE from scratch to each snapshot may be inefficient: a pair of consecutive snapshots may change very little, but unaware of this, the program must run again from scratch. In this paper, we present CRFlex, a system that efficiently executes such repeated statistical IE, by recycling previous IE results to enable incremental update. As the first step, CRFlex focuses on statistical IE programs which use a leading statistical model, Conditional Random Fields (CRFs). We show how to model properties of the CRF inference algorithms for incremental update and how to exploit them to correctly recycle previous inference results. Then we show how to efficiently capture and store intermediate results of IE programs for subsequent recycling. We find that there is a tradeoff between the I/O cost spent on reading and writing intermediate results, and CPU cost we can save from recycling those intermediate results. Therefore we present a cost-based solution to determine the most efficient recycling approach for any given CRF-based IE program and an evolving corpus. We conduct extensive experiments with CRF-based IE programs for 3 IE tasks over a real-world data set to demonstrate the utility of our approach.
统计信息提取(IE)程序越来越多地用于构建现实世界的IE系统,如阿里巴巴、CiteSeer、麒麟和YAGO。当前的统计IE方法认为文本语料库底层的提取程序是静态的。然而,许多现实世界的文本语料库是动态的(文档被插入、修改和删除)。随着语料库的发展,IE程序必须重复应用于连续的语料库快照,以保持提取的信息是最新的。从头开始对每个快照应用IE可能效率低下:一对连续的快照可能变化很小,但不知道这一点,程序必须从头开始再次运行。在本文中,我们提出了CRFlex,一个有效执行这种重复统计IE的系统,通过回收以前的IE结果来实现增量更新。作为第一步,CRFlex将重点放在使用领先统计模型条件随机场(CRFs)的统计IE程序上。我们展示了如何为增量更新的CRF推理算法的属性建模,以及如何利用它们来正确地回收以前的推理结果。然后,我们展示了如何有效地捕获和存储IE程序的中间结果,以便后续回收。我们发现在读写中间结果所花费的I/O成本与回收这些中间结果所节省的CPU成本之间存在权衡。因此,我们提出了一个基于成本的解决方案,以确定任何给定的基于crf的IE程序和不断发展的语料库的最有效的回收方法。我们对基于crf的IE程序在现实世界数据集上的3个IE任务进行了广泛的实验,以证明我们方法的实用性。
{"title":"Optimizing Statistical Information Extraction Programs over Evolving Text","authors":"Fei Chen, Xixuan Feng, C. Ré, Min Wang","doi":"10.1109/ICDE.2012.60","DOIUrl":"https://doi.org/10.1109/ICDE.2012.60","url":null,"abstract":"Statistical information extraction (IE) programs are increasingly used to build real-world IE systems such as Alibaba, CiteSeer, Kylin, and YAGO. Current statistical IE approaches consider the text corpora underlying the extraction program to be static. However, many real-world text corpora are dynamic (documents are inserted, modified, and removed). As the corpus evolves, and IE programs must be applied repeatedly to consecutive corpus snapshots to keep extracted information up to date. Applying IE from scratch to each snapshot may be inefficient: a pair of consecutive snapshots may change very little, but unaware of this, the program must run again from scratch. In this paper, we present CRFlex, a system that efficiently executes such repeated statistical IE, by recycling previous IE results to enable incremental update. As the first step, CRFlex focuses on statistical IE programs which use a leading statistical model, Conditional Random Fields (CRFs). We show how to model properties of the CRF inference algorithms for incremental update and how to exploit them to correctly recycle previous inference results. Then we show how to efficiently capture and store intermediate results of IE programs for subsequent recycling. We find that there is a tradeoff between the I/O cost spent on reading and writing intermediate results, and CPU cost we can save from recycling those intermediate results. Therefore we present a cost-based solution to determine the most efficient recycling approach for any given CRF-based IE program and an evolving corpus. We conduct extensive experiments with CRF-based IE programs for 3 IE tasks over a real-world data set to demonstrate the utility of our approach.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"357 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115469920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Processing of Rank Joins in Highly Distributed Systems 高度分布系统中秩连接的处理
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.108
C. Doulkeridis, Akrivi Vlachou, K. Nørvåg, Y. Kotidis, N. Polyzotis
In this paper, we study efficient processing of rank joins in highly distributed systems, where servers store fragments of relations in an autonomous manner. Existing rank-join algorithms exhibit poor performance in this setting due to excessive communication costs or high latency. We propose a novel distributed rank-join framework that employs data statistics, maintained as histograms, to determine the subset of each relational fragment that needs to be fetched to generate the top-k join results. At the heart of our framework lies a distributed score bound estimation algorithm that produces sufficient score bounds for each relation, that guarantee the correctness of the rank-join result set, when the histograms are accurate. Furthermore, we propose a generalization of our framework that supports approximate statistics, in the case that the exact statistical information is not available. An extensive experimental study validates the efficiency of our framework and demonstrates its advantages over existing methods.
在本文中,我们研究了高度分布式系统中秩连接的有效处理,其中服务器以自治的方式存储关系片段。由于通信成本过高或延迟高,现有的秩联接算法在这种情况下表现出较差的性能。我们提出了一种新的分布式排名连接框架,该框架使用数据统计(以直方图的形式维护)来确定需要提取的每个关系片段的子集,以生成top-k连接结果。我们框架的核心是一个分布式分数界估计算法,该算法为每个关系产生足够的分数界,当直方图准确时,它保证了排名连接结果集的正确性。此外,我们提出了一个支持近似统计的框架的泛化,在没有确切统计信息的情况下。广泛的实验研究验证了我们的框架的效率,并证明了它比现有方法的优势。
{"title":"Processing of Rank Joins in Highly Distributed Systems","authors":"C. Doulkeridis, Akrivi Vlachou, K. Nørvåg, Y. Kotidis, N. Polyzotis","doi":"10.1109/ICDE.2012.108","DOIUrl":"https://doi.org/10.1109/ICDE.2012.108","url":null,"abstract":"In this paper, we study efficient processing of rank joins in highly distributed systems, where servers store fragments of relations in an autonomous manner. Existing rank-join algorithms exhibit poor performance in this setting due to excessive communication costs or high latency. We propose a novel distributed rank-join framework that employs data statistics, maintained as histograms, to determine the subset of each relational fragment that needs to be fetched to generate the top-k join results. At the heart of our framework lies a distributed score bound estimation algorithm that produces sufficient score bounds for each relation, that guarantee the correctness of the rank-join result set, when the histograms are accurate. Furthermore, we propose a generalization of our framework that supports approximate statistics, in the case that the exact statistical information is not available. An extensive experimental study validates the efficiency of our framework and demonstrates its advantages over existing methods.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126019518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Viewing the Web as a Distributed Knowledge Base 将Web视为分布式知识库
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.150
S. Abiteboul, Émilien Antoine, Julia Stoyanovich
This paper addresses the challenges faced by everyday Web users, who interact with inherently heterogeneous and distributed information. Managing such data is currently beyond the skills of casual users. We describe ongoing work that has as its goal the development of foundations for declarative distributed data management. In this approach, we see the Web as a knowledge base consisting of distributed logical facts and rules. Our objective is to enable automated reasoning over this knowledge base, ultimately improving the quality of service and of data. For this, we use Webdamlog, a Datalog-style language with rule delegation. We outline ongoing efforts on the Web dam Exchange platform that combines Webdamlog evaluation with communication and security protocols.
本文解决了日常Web用户所面临的挑战,这些用户与固有的异构和分布式信息交互。管理此类数据目前超出了普通用户的技能范围。我们描述了正在进行的工作,其目标是开发声明式分布式数据管理的基础。在这种方法中,我们将Web视为一个由分布式逻辑事实和规则组成的知识库。我们的目标是在这个知识库上实现自动推理,最终提高服务和数据的质量。为此,我们使用Webdamlog,这是一种带有规则委托的datalog风格的语言。我们概述了Webdam Exchange平台上正在进行的工作,该平台将Webdamlog评估与通信和安全协议相结合。
{"title":"Viewing the Web as a Distributed Knowledge Base","authors":"S. Abiteboul, Émilien Antoine, Julia Stoyanovich","doi":"10.1109/ICDE.2012.150","DOIUrl":"https://doi.org/10.1109/ICDE.2012.150","url":null,"abstract":"This paper addresses the challenges faced by everyday Web users, who interact with inherently heterogeneous and distributed information. Managing such data is currently beyond the skills of casual users. We describe ongoing work that has as its goal the development of foundations for declarative distributed data management. In this approach, we see the Web as a knowledge base consisting of distributed logical facts and rules. Our objective is to enable automated reasoning over this knowledge base, ultimately improving the quality of service and of data. For this, we use Webdamlog, a Datalog-style language with rule delegation. We outline ongoing efforts on the Web dam Exchange platform that combines Webdamlog evaluation with communication and security protocols.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"64 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126087540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Cross Domain Search by Exploiting Wikipedia 利用维基百科进行跨域搜索
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.13
Chen Liu, Sai Wu, Shouxu Jiang, A. Tung
The abundance of Web 2.0 resources in various media formats calls for better resource integration to enrich user experience. This naturally leads to a new cross-modal resource search requirement, in which a query is a resource in one modal and the results are closely related resources in other modalities. With cross-modal search, we can better exploit existing resources. Tags associated with Web 2.0 resources are intuitive medium to link resources with different modality together. However, tagging is by nature an ad hoc activity. They often contain noises and are affected by the subjective inclination of the tagger. Consequently, linking resources simply by tags will not be reliable. In this paper, we propose an approach for linking tagged resources to concepts extracted from Wikipedia, which has become a fairly reliable reference over the last few years. Compared to the tags, the concepts are therefore of higher quality. We develop effective methods for cross-modal search based on the concepts associated with resources. Extensive experiments were conducted, and the results show that our solution achieves good performance.
各种媒体格式的丰富Web 2.0资源要求更好的资源集成,以丰富用户体验。这自然会导致新的跨模态资源搜索需求,其中查询是一种模态中的资源,而结果是其他模态中密切相关的资源。通过跨模式搜索,我们可以更好地利用现有资源。与Web 2.0资源关联的标记是将不同形态的资源链接在一起的直观媒介。然而,标记本质上是一种特别的活动。它们通常包含噪音,并受到贴标者主观倾向的影响。因此,简单地通过标签链接资源是不可靠的。在本文中,我们提出了一种将标记资源链接到从维基百科中提取的概念的方法,维基百科在过去几年中已经成为一个相当可靠的参考。因此,与标签相比,概念的质量更高。我们基于与资源相关的概念开发了有效的跨模式搜索方法。进行了大量的实验,结果表明我们的解决方案取得了良好的性能。
{"title":"Cross Domain Search by Exploiting Wikipedia","authors":"Chen Liu, Sai Wu, Shouxu Jiang, A. Tung","doi":"10.1109/ICDE.2012.13","DOIUrl":"https://doi.org/10.1109/ICDE.2012.13","url":null,"abstract":"The abundance of Web 2.0 resources in various media formats calls for better resource integration to enrich user experience. This naturally leads to a new cross-modal resource search requirement, in which a query is a resource in one modal and the results are closely related resources in other modalities. With cross-modal search, we can better exploit existing resources. Tags associated with Web 2.0 resources are intuitive medium to link resources with different modality together. However, tagging is by nature an ad hoc activity. They often contain noises and are affected by the subjective inclination of the tagger. Consequently, linking resources simply by tags will not be reliable. In this paper, we propose an approach for linking tagged resources to concepts extracted from Wikipedia, which has become a fairly reliable reference over the last few years. Compared to the tags, the concepts are therefore of higher quality. We develop effective methods for cross-modal search based on the concepts associated with resources. Extensive experiments were conducted, and the results show that our solution achieves good performance.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"4 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116627303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Accuracy-Aware Uncertain Stream Databases 准确性感知的不确定流数据库
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.96
Tingjian Ge, Fujun Liu
Previous work has introduced probability distributions as first-class components in uncertain stream database systems. A lacking element is the fact of how accurate these probability distributions are. This indeed has a profound impact on the accuracy of query results presented to end users. While there is some previous work that studies unreliable intermediate query results in the tuple uncertainty model, to the best of our knowledge, we are the first to consider an uncertain stream database in which accuracy is taken into consideration all the way from the learned distributions based on raw data samples to the query results. We perform an initial study of various components in an accuracy-aware uncertain stream database system, including the representation of accuracy information and how to obtain query results' accuracy. In addition, we propose novel predicates based on hypothesis testing for decision-making using data with limited accuracy. We augment our study with a comprehensive set of experimental evaluations.
以前的工作已经引入了概率分布作为不确定流数据库系统的一级组件。缺少的一个因素是这些概率分布有多精确。这确实对呈现给最终用户的查询结果的准确性有深远的影响。虽然之前有一些工作研究了元组不确定性模型中不可靠的中间查询结果,但据我们所知,我们是第一个考虑不确定流数据库的人,在这种数据库中,从基于原始数据样本的学习分布到查询结果都要考虑准确性。本文对一个具有精度感知的不确定流数据库系统的各个组成部分进行了初步研究,包括精度信息的表示和如何获得查询结果的精度。此外,我们提出了基于假设检验的新谓词,用于使用有限精度的数据进行决策。我们用一套全面的实验评估来加强我们的研究。
{"title":"Accuracy-Aware Uncertain Stream Databases","authors":"Tingjian Ge, Fujun Liu","doi":"10.1109/ICDE.2012.96","DOIUrl":"https://doi.org/10.1109/ICDE.2012.96","url":null,"abstract":"Previous work has introduced probability distributions as first-class components in uncertain stream database systems. A lacking element is the fact of how accurate these probability distributions are. This indeed has a profound impact on the accuracy of query results presented to end users. While there is some previous work that studies unreliable intermediate query results in the tuple uncertainty model, to the best of our knowledge, we are the first to consider an uncertain stream database in which accuracy is taken into consideration all the way from the learned distributions based on raw data samples to the query results. We perform an initial study of various components in an accuracy-aware uncertain stream database system, including the representation of accuracy information and how to obtain query results' accuracy. In addition, we propose novel predicates based on hypothesis testing for decision-making using data with limited accuracy. We augment our study with a comprehensive set of experimental evaluations.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114073853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Querying Uncertain Spatio-Temporal Data 查询不确定时空数据
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.94
Tobias Emrich, H. Kriegel, N. Mamoulis, M. Renz, Andreas Züfle
The problem of modeling and managing uncertain data has received a great deal of interest, due to its manifold applications in spatial, temporal, multimedia and sensor databases. There exists a wide range of work covering spatial uncertainty in the static (snapshot) case, where only one point of time is considered. In contrast, the problem of modeling and querying uncertain spatio-temporal data has only been treated as a simple extension of the spatial case, disregarding time dependencies between consecutive timestamps. In this work, we present a framework for efficiently modeling and querying uncertain spatio-temporal data. The key idea of our approach is to model possible object trajectories by stochastic processes. This approach has three major advantages over previous work. First it allows answering queries in accordance with the possible worlds model. Second, dependencies between object locations at consecutive points in time are taken into account. And third it is possible to reduce all queries on this model to simple matrix multiplications. Based on these concepts we propose efficient solutions for different probabilistic spatio-temporal queries. In an experimental evaluation we show that our approaches are several order of magnitudes faster than state-of-the-art competitors.
由于在空间、时间、多媒体和传感器数据库中的广泛应用,不确定数据的建模和管理问题引起了人们的极大兴趣。在只考虑一个时间点的静态(快照)情况下,存在广泛的涵盖空间不确定性的工作。相比之下,不确定时空数据的建模和查询问题仅被视为空间情况的简单扩展,而忽略了连续时间戳之间的时间依赖性。在这项工作中,我们提出了一个有效建模和查询不确定时空数据的框架。我们方法的关键思想是通过随机过程来模拟可能的物体轨迹。与以前的工作相比,这种方法有三个主要优点。首先,它允许根据可能世界模型回答查询。其次,考虑连续时间点上目标位置之间的依赖关系。第三,可以将该模型上的所有查询简化为简单的矩阵乘法。基于这些概念,我们提出了不同概率时空查询的有效解决方案。在实验评估中,我们表明我们的方法比最先进的竞争对手快几个数量级。
{"title":"Querying Uncertain Spatio-Temporal Data","authors":"Tobias Emrich, H. Kriegel, N. Mamoulis, M. Renz, Andreas Züfle","doi":"10.1109/ICDE.2012.94","DOIUrl":"https://doi.org/10.1109/ICDE.2012.94","url":null,"abstract":"The problem of modeling and managing uncertain data has received a great deal of interest, due to its manifold applications in spatial, temporal, multimedia and sensor databases. There exists a wide range of work covering spatial uncertainty in the static (snapshot) case, where only one point of time is considered. In contrast, the problem of modeling and querying uncertain spatio-temporal data has only been treated as a simple extension of the spatial case, disregarding time dependencies between consecutive timestamps. In this work, we present a framework for efficiently modeling and querying uncertain spatio-temporal data. The key idea of our approach is to model possible object trajectories by stochastic processes. This approach has three major advantages over previous work. First it allows answering queries in accordance with the possible worlds model. Second, dependencies between object locations at consecutive points in time are taken into account. And third it is possible to reduce all queries on this model to simple matrix multiplications. Based on these concepts we propose efficient solutions for different probabilistic spatio-temporal queries. In an experimental evaluation we show that our approaches are several order of magnitudes faster than state-of-the-art competitors.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133992827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 71
Effective Data Density Estimation in Ring-Based P2P Networks 基于环的P2P网络的有效数据密度估计
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.19
Minqi Zhou, Heng Tao Shen, Xiaofang Zhou, Weining Qian, Aoying Zhou
Estimating the global data distribution in Peer-to-Peer (P2P) networks is an important issue and has yet to be well addressed. It can benefit many P2P applications, such as load balancing analysis, query processing, and data mining. Inspired by the inversion method for random variate generation, in this paper we present a novel model named distribution-free data density estimation for dynamic ring-based P2P networks to achieve high estimation accuracy with low estimation cost regardless of distribution models of the underlying data. It generates random samples for any arbitrary distribution by sampling the global cumulative distribution function and is free from sampling bias. In P2P networks, the key idea for distribution-free estimation is to sample a small subset of peers for estimating the global data distribution over the data domain. Algorithms on computing and sampling the global cumulative distribution function based on which global data distribution is estimated are introduced with detailed theoretical analysis. Our extensive performance study confirms the effectiveness and efficiency of our methods in ring-based P2P networks.
估计点对点(P2P)网络中的全球数据分布是一个重要的问题,但尚未得到很好的解决。它可以使许多P2P应用程序受益,例如负载平衡分析、查询处理和数据挖掘。受随机变量生成反演方法的启发,本文提出了一种新的基于动态环的P2P网络的无分布数据密度估计模型,无论底层数据的分布模型如何,都能以较低的估计成本获得较高的估计精度。它通过对全局累积分布函数进行抽样,生成任意分布的随机样本,并且不存在抽样偏差。在P2P网络中,无分布估计的关键思想是对一小部分对等点进行采样,以估计数据域上的全局数据分布。介绍了用于估计全局数据分布的全局累积分布函数的计算和采样算法,并进行了详细的理论分析。我们广泛的性能研究证实了我们的方法在基于环的P2P网络中的有效性和效率。
{"title":"Effective Data Density Estimation in Ring-Based P2P Networks","authors":"Minqi Zhou, Heng Tao Shen, Xiaofang Zhou, Weining Qian, Aoying Zhou","doi":"10.1109/ICDE.2012.19","DOIUrl":"https://doi.org/10.1109/ICDE.2012.19","url":null,"abstract":"Estimating the global data distribution in Peer-to-Peer (P2P) networks is an important issue and has yet to be well addressed. It can benefit many P2P applications, such as load balancing analysis, query processing, and data mining. Inspired by the inversion method for random variate generation, in this paper we present a novel model named distribution-free data density estimation for dynamic ring-based P2P networks to achieve high estimation accuracy with low estimation cost regardless of distribution models of the underlying data. It generates random samples for any arbitrary distribution by sampling the global cumulative distribution function and is free from sampling bias. In P2P networks, the key idea for distribution-free estimation is to sample a small subset of peers for estimating the global data distribution over the data domain. Algorithms on computing and sampling the global cumulative distribution function based on which global data distribution is estimated are introduced with detailed theoretical analysis. Our extensive performance study confirms the effectiveness and efficiency of our methods in ring-based P2P networks.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"8 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130356811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Efficiently Monitoring Top-k Pairs over Sliding Windows 有效地监控Top-k对滑动窗口
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.89
Zhitao Shen, M. A. Cheema, Xuemin Lin, W. Zhang, Haixun Wang
Top-k pairs queries have received significant attention by the research community. k-closest pairs queries, k-furthest pairs queries and their variants are among the most well studied special cases of the top-k pairs queries. In this paper, we present the first approach to answer a broad class of top-k pairs queries over sliding windows. Our framework handles multiple top-k pairs queries and each query is allowed to use a different scoring function, a different value of k and a different size of the sliding window. Although the number of possible pairs in the sliding window is quadratic to the number of objects N in the sliding window, we efficiently answer the top-k pairs query by maintaining a small subset of pairs called K-sky band which is expected to consist of O(K log(N/K)) pairs. For all the queries that use the same scoring function, we need to maintain only one K-sky band. We present efficient techniques for the K-sky band maintenance and query answering. We conduct a detailed complexity analysis and show that the expected cost of our approach is reasonably close to the lower bound cost. We experimentally verify this by comparing our approach with a specially designed supreme algorithm that assumes the existence of an oracle and meets the lower bound cost.
Top-k对查询受到了研究界的极大关注。k-最近对查询,k-最远对查询及其变体是top-k对查询中研究得最充分的特殊情况。在本文中,我们提出了第一种方法来回答滑动窗口上的一类top-k对查询。我们的框架处理多个top-k对查询,每个查询允许使用不同的评分函数、不同的k值和不同大小的滑动窗口。虽然滑动窗口中可能的对的数量是滑动窗口中对象数量N的二次元,但我们通过保持一个称为K-sky波段的小子集来有效地回答top-k对查询,该子集预计由O(K log(N/K))对组成。对于使用相同评分函数的所有查询,我们只需要维护一个K-sky波段。提出了一种有效的k波段维护和查询应答技术。我们进行了详细的复杂性分析,并表明我们的方法的预期成本相当接近下限成本。我们通过实验验证了这一点,将我们的方法与一个特殊设计的最高算法进行了比较,该算法假设存在一个oracle并满足下界成本。
{"title":"Efficiently Monitoring Top-k Pairs over Sliding Windows","authors":"Zhitao Shen, M. A. Cheema, Xuemin Lin, W. Zhang, Haixun Wang","doi":"10.1109/ICDE.2012.89","DOIUrl":"https://doi.org/10.1109/ICDE.2012.89","url":null,"abstract":"Top-k pairs queries have received significant attention by the research community. k-closest pairs queries, k-furthest pairs queries and their variants are among the most well studied special cases of the top-k pairs queries. In this paper, we present the first approach to answer a broad class of top-k pairs queries over sliding windows. Our framework handles multiple top-k pairs queries and each query is allowed to use a different scoring function, a different value of k and a different size of the sliding window. Although the number of possible pairs in the sliding window is quadratic to the number of objects N in the sliding window, we efficiently answer the top-k pairs query by maintaining a small subset of pairs called K-sky band which is expected to consist of O(K log(N/K)) pairs. For all the queries that use the same scoring function, we need to maintain only one K-sky band. We present efficient techniques for the K-sky band maintenance and query answering. We conduct a detailed complexity analysis and show that the expected cost of our approach is reasonably close to the lower bound cost. We experimentally verify this by comparing our approach with a specially designed supreme algorithm that assumes the existence of an oracle and meets the lower bound cost.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"365 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132875048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Iterative Graph Feature Mining for Graph Indexing 面向图索引的迭代图特征挖掘
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.11
Dayu Yuan, P. Mitra, Huiwen Yu, C. Lee Giles
Sub graph search is a popular query scenario on graph databases. Given a query graph q, the sub graph search algorithm returns all database graphs having q as a sub graph. To efficiently implement a subgraph search, subgraph features are mined in order to index the graph database. Many subgraph feature mining approaches have been proposed. They are all "mine-at-once" algorithms in which the whole feature set is mined in one run before building a stable graph index. However, due to the change of environments (such as an update of the graph database and the increase of available memory), the index needs to be updated to accommodate such changes. Most of the "mine-at-once" algorithms involve frequent subgraph or subtree mining over the whole graph database. Also, constructing and deploying a new index involves an expensive disk operation such that it is inefficient to re-mine the features and rebuild the index from scratch. We observe that, under most cases, it is sufficient to update a small part of the graph index. Here we propose an "iterative subgraph mining" algorithm which iteratively finds one feature to insert into (or remove from) the index. Since the majority of indexing features and the index structure are not changed, the algorithm can be frequently invoked. We define an objective function that guides the feature mining. Next, we propose a basic branch and bound algorithm to mine the features. Finally, we design an advanced search algorithm, which quickly finds a near-optimum subgraph feature and reduces the search space. Experiments show that our feature mining algorithm is 5 times faster than the popular graph indexing algorithm gIndex, and that features mined by our iterative algorithm have a better filtering rate for the subgraph search problem.
子图搜索是图数据库中常用的查询场景。给定一个查询图q,子图搜索算法返回所有以q为子图的数据库图。为了有效地实现子图搜索,挖掘子图特征以便对图数据库进行索引。许多子图特征挖掘方法已经被提出。它们都是“一次挖掘”算法,在构建稳定的图索引之前,在一次运行中挖掘整个特征集。但是,由于环境的变化(例如图数据库的更新和可用内存的增加),需要更新索引以适应这些变化。大多数“一次挖掘”算法涉及在整个图数据库上频繁的子图或子树挖掘。此外,构造和部署新索引涉及到昂贵的磁盘操作,因此重新挖掘特性并从头构建索引的效率很低。我们观察到,在大多数情况下,更新一小部分图索引就足够了。在这里,我们提出了一个“迭代子图挖掘”算法,迭代地找到一个特征插入(或从)索引中删除。由于大多数索引特性和索引结构没有改变,因此可以频繁调用该算法。我们定义了一个目标函数来指导特征挖掘。接下来,我们提出了一种基本的分支定界算法来挖掘特征。最后,我们设计了一种先进的搜索算法,该算法可以快速找到接近最优的子图特征,并减少了搜索空间。实验表明,我们的特征挖掘算法比流行的图索引算法gIndex快5倍,并且我们的迭代算法挖掘的特征对子图搜索问题有更好的过滤率。
{"title":"Iterative Graph Feature Mining for Graph Indexing","authors":"Dayu Yuan, P. Mitra, Huiwen Yu, C. Lee Giles","doi":"10.1109/ICDE.2012.11","DOIUrl":"https://doi.org/10.1109/ICDE.2012.11","url":null,"abstract":"Sub graph search is a popular query scenario on graph databases. Given a query graph q, the sub graph search algorithm returns all database graphs having q as a sub graph. To efficiently implement a subgraph search, subgraph features are mined in order to index the graph database. Many subgraph feature mining approaches have been proposed. They are all \"mine-at-once\" algorithms in which the whole feature set is mined in one run before building a stable graph index. However, due to the change of environments (such as an update of the graph database and the increase of available memory), the index needs to be updated to accommodate such changes. Most of the \"mine-at-once\" algorithms involve frequent subgraph or subtree mining over the whole graph database. Also, constructing and deploying a new index involves an expensive disk operation such that it is inefficient to re-mine the features and rebuild the index from scratch. We observe that, under most cases, it is sufficient to update a small part of the graph index. Here we propose an \"iterative subgraph mining\" algorithm which iteratively finds one feature to insert into (or remove from) the index. Since the majority of indexing features and the index structure are not changed, the algorithm can be frequently invoked. We define an objective function that guides the feature mining. Next, we propose a basic branch and bound algorithm to mine the features. Finally, we design an advanced search algorithm, which quickly finds a near-optimum subgraph feature and reduces the search space. Experiments show that our feature mining algorithm is 5 times faster than the popular graph indexing algorithm gIndex, and that features mined by our iterative algorithm have a better filtering rate for the subgraph search problem.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"245 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132291337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Earlybird: Real-Time Search at Twitter Earlybird:实时搜索Twitter
Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.149
Michael Busch, Krishna Gade, B. Larson, Patrick Lok, Samuel B. Luckenbill, Jimmy J. Lin
The web today is increasingly characterized by social and real-time signals, which we believe represent two frontiers in information retrieval. In this paper, we present Early bird, the core retrieval engine that powers Twitter's real-time search service. Although Early bird builds and maintains inverted indexes like nearly all modern retrieval engines, its index structures differ from those built to support traditional web search. We describe these differences and present the rationale behind our design. A key requirement of real-time search is the ability to ingest content rapidly and make it searchable immediately, while concurrently supporting low-latency, high-throughput query evaluation. These demands are met with a single-writer, multiple-reader concurrency model and the targeted use of memory barriers. Early bird represents a point in the design space of real-time search engines that has worked well for Twitter's needs. By sharing our experiences, we hope to spur additional interest and innovation in this exciting space.
今天的网络越来越多地以社交和实时信号为特征,我们认为这代表了信息检索的两个前沿。在本文中,我们介绍了Early bird,这是Twitter实时搜索服务的核心检索引擎。尽管Early bird像几乎所有现代检索引擎一样构建和维护反向索引,但它的索引结构与支持传统网络搜索的索引结构不同。我们将描述这些差异,并介绍我们设计背后的基本原理。实时搜索的一个关键需求是能够快速摄取内容并使其立即可搜索,同时支持低延迟、高吞吐量的查询评估。这些需求可以通过单写入器、多读取器并发模型和有针对性地使用内存屏障来满足。Early bird代表了实时搜索引擎设计领域的一个观点,它很好地满足了Twitter的需求。通过分享我们的经验,我们希望在这个令人兴奋的领域激发更多的兴趣和创新。
{"title":"Earlybird: Real-Time Search at Twitter","authors":"Michael Busch, Krishna Gade, B. Larson, Patrick Lok, Samuel B. Luckenbill, Jimmy J. Lin","doi":"10.1109/ICDE.2012.149","DOIUrl":"https://doi.org/10.1109/ICDE.2012.149","url":null,"abstract":"The web today is increasingly characterized by social and real-time signals, which we believe represent two frontiers in information retrieval. In this paper, we present Early bird, the core retrieval engine that powers Twitter's real-time search service. Although Early bird builds and maintains inverted indexes like nearly all modern retrieval engines, its index structures differ from those built to support traditional web search. We describe these differences and present the rationale behind our design. A key requirement of real-time search is the ability to ingest content rapidly and make it searchable immediately, while concurrently supporting low-latency, high-throughput query evaluation. These demands are met with a single-writer, multiple-reader concurrency model and the targeted use of memory barriers. Early bird represents a point in the design space of real-time search engines that has worked well for Twitter's needs. By sharing our experiences, we hope to spur additional interest and innovation in this exciting space.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132296812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 173
期刊
2012 IEEE 28th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1