2013 IEEE 29th International Conference on Data Engineering (ICDE)最新文献_第5页

Predicting query execution time: Are optimizer cost models really unusable? 预测查询执行时间:优化器成本模型真的不可用吗?

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544899

Wentao Wu, Yun Chi, Shenghuo Zhu, J. Tatemura, Hakan Hacıgümüş, J. Naughton

Predicting query execution time is useful in many database management issues including admission control, query scheduling, progress monitoring, and system sizing. Recently the research community has been exploring the use of statistical machine learning approaches to build predictive models for this task. An implicit assumption behind this work is that the cost models used by query optimizers are insufficient for query execution time prediction. In this paper we challenge this assumption and show while the simple approach of scaling the optimizer's estimated cost indeed fails, a properly calibrated optimizer cost model is surprisingly effective. However, even a well-tuned optimizer cost model will fail in the presence of errors in cardinality estimates. Accordingly we investigate the novel idea of spending extra resources to refine estimates for the query plan after it has been chosen by the optimizer but before execution. In our experiments we find that a well calibrated query optimizer model along with cardinality estimation refinement provides a low overhead way to provide estimates that are always competitive and often much better than the best reported numbers from the machine learning approaches.

预测查询执行时间在许多数据库管理问题中都很有用，包括准入控制、查询调度、进度监控和系统大小调整。最近，研究界一直在探索使用统计机器学习方法来构建预测模型。这项工作背后的一个隐含假设是，查询优化器使用的成本模型不足以预测查询执行时间。在本文中，我们对这一假设提出了挑战，并表明尽管缩放优化器估计成本的简单方法确实失败了，但适当校准的优化器成本模型却出奇地有效。然而，即使是经过良好调优的优化器成本模型也会在基数估计中出现错误时失败。因此，我们研究了一种新颖的想法，即在优化器选择查询计划之后，但在执行之前，花费额外的资源来优化查询计划的估计。在我们的实验中，我们发现一个校准良好的查询优化器模型以及基数估计精化提供了一种低开销的方式来提供总是有竞争力的估计，并且通常比机器学习方法的最佳报告数字要好得多。

{"title":"Predicting query execution time: Are optimizer cost models really unusable?","authors":"Wentao Wu, Yun Chi, Shenghuo Zhu, J. Tatemura, Hakan Hacıgümüş, J. Naughton","doi":"10.1109/ICDE.2013.6544899","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544899","url":null,"abstract":"Predicting query execution time is useful in many database management issues including admission control, query scheduling, progress monitoring, and system sizing. Recently the research community has been exploring the use of statistical machine learning approaches to build predictive models for this task. An implicit assumption behind this work is that the cost models used by query optimizers are insufficient for query execution time prediction. In this paper we challenge this assumption and show while the simple approach of scaling the optimizer's estimated cost indeed fails, a properly calibrated optimizer cost model is surprisingly effective. However, even a well-tuned optimizer cost model will fail in the presence of errors in cardinality estimates. Accordingly we investigate the novel idea of spending extra resources to refine estimates for the query plan after it has been chosen by the optimizer but before execution. In our experiments we find that a well calibrated query optimizer model along with cardinality estimation refinement provides a low overhead way to provide estimates that are always competitive and often much better than the best reported numbers from the machine learning approaches.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125836514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 166

Similarity query processing for probabilistic sets 概率集的相似性查询处理

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544885

Ming Gao, Cheqing Jin, Wei Wang, Xuemin Lin, Aoying Zhou

Evaluating similarity between sets is a fundamental task in computer science. However, there are many applications in which elements in a set may be uncertain due to various reasons. Existing work on modeling such probabilistic sets and computing their similarities suffers from huge model sizes or significant similarity evaluation cost, and hence is only applicable to small probabilistic sets. In this paper, we propose a simple yet expressive model that supports many applications where one probabilistic set may have thousands of elements. We define two types of similarities between two probabilistic sets using the possible world semantics; they complement each other in capturing the similarity distributions in the cross product of possible worlds. We design efficient dynamic programming-based algorithms to calculate both types of similarities. Novel individual and batch pruning techniques based on upper bounding the similarity values are also proposed. To accommodate extremely large probabilistic sets, we also design sampling-based approximate query processing methods with strong probabilistic guarantees. We have conducted extensive experiments using both synthetic and real datasets, and demonstrated the effectiveness and efficiency of our proposed methods.

评估集合之间的相似性是计算机科学中的一项基本任务。然而，在许多应用中，由于各种原因，集合中的元素可能是不确定的。现有的对这类概率集进行建模和计算其相似度的工作，由于模型规模大或相似度评估成本高，因此只适用于小概率集。在本文中，我们提出了一个简单而富有表现力的模型，该模型支持许多应用，其中一个概率集可能有数千个元素。我们用可能世界语义定义了两个概率集之间的两种相似性;它们在获取可能世界叉积的相似性分布方面是互补的。我们设计了高效的基于动态规划的算法来计算这两种类型的相似度。提出了基于相似性值上界的单个和批量剪枝技术。为了适应非常大的概率集，我们还设计了具有强概率保证的基于抽样的近似查询处理方法。我们已经使用合成和真实数据集进行了大量的实验，并证明了我们提出的方法的有效性和效率。

{"title":"Similarity query processing for probabilistic sets","authors":"Ming Gao, Cheqing Jin, Wei Wang, Xuemin Lin, Aoying Zhou","doi":"10.1109/ICDE.2013.6544885","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544885","url":null,"abstract":"Evaluating similarity between sets is a fundamental task in computer science. However, there are many applications in which elements in a set may be uncertain due to various reasons. Existing work on modeling such probabilistic sets and computing their similarities suffers from huge model sizes or significant similarity evaluation cost, and hence is only applicable to small probabilistic sets. In this paper, we propose a simple yet expressive model that supports many applications where one probabilistic set may have thousands of elements. We define two types of similarities between two probabilistic sets using the possible world semantics; they complement each other in capturing the similarity distributions in the cross product of possible worlds. We design efficient dynamic programming-based algorithms to calculate both types of similarities. Novel individual and batch pruning techniques based on upper bounding the similarity values are also proposed. To accommodate extremely large probabilistic sets, we also design sampling-based approximate query processing methods with strong probabilistic guarantees. We have conducted extensive experiments using both synthetic and real datasets, and demonstrated the effectiveness and efficiency of our proposed methods.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121979885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

LSII: An indexing structure for exact real-time search on microblogs LSII:用于微博上精确实时搜索的索引结构

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544849

Lingkun Wu, Wenqing Lin, Xiaokui Xiao, Yabo Xu

Indexing microblogs for real-time search is challenging given the efficiency issue caused by the tremendous speed at which new microblogs are created by users. Existing approaches address this efficiency issue at the cost of query accuracy, as they either (i) exclude a significant portion of microblogs from the index to reduce update cost or (ii) rank microblogs mostly by their timestamps (without sufficient consideration of their relevance to the queries) to enable append-only index insertion. As a consequence, the search results returned by the existing approaches do not satisfy the users who demand timely and high-quality search results. To remedy this deficiency, we propose the Log-Structured Inverted Indices (LSII), a structure for exact real-time search on microblogs. The core of LSII is a sequence of inverted indices with exponentially increasing sizes, such that new microblogs are (i) first inserted into the smallest index and (ii) later moved into the larger indices in a batch manner. The batch insertion mechanism leads to a small amortize update cost for each new microblog, without significantly degrading query performance. We present a comprehensive study on LSII, exploring various design options to strike a good balance between query and update performance. In addition, we propose extensions of LSII to support personalized search and to exploit multi-threading for performance improvement. Extensive experiments demonstrate the efficiency of LSII with experiments on real data.

由于用户创建新微博的速度非常快，因此对微博进行索引以进行实时搜索是一项挑战。现有的方法以牺牲查询准确性为代价来解决这个效率问题，因为它们要么(i)从索引中排除很大一部分微博以降低更新成本，要么(ii)主要根据微博的时间戳(没有充分考虑它们与查询的相关性)对微博进行排序，以实现仅追加索引插入。因此，现有方法返回的搜索结果不能满足用户对搜索结果的及时性和高质量的要求。为了弥补这一缺陷，我们提出了日志结构倒转索引(LSII)，这是一种精确实时搜索微博的结构。LSII的核心是一系列大小呈指数增长的倒排索引，这样新的微博(i)首先插入到最小的索引中，(ii)随后以批处理的方式移动到较大的索引中。批量插入机制导致每个新微博的摊销更新成本很小，而不会显著降低查询性能。我们对LSII进行了全面的研究，探索了在查询和更新性能之间取得良好平衡的各种设计选项。此外，我们建议对LSII进行扩展，以支持个性化搜索，并利用多线程来提高性能。大量实验证明了LSII在实际数据上的有效性。

{"title":"LSII: An indexing structure for exact real-time search on microblogs","authors":"Lingkun Wu, Wenqing Lin, Xiaokui Xiao, Yabo Xu","doi":"10.1109/ICDE.2013.6544849","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544849","url":null,"abstract":"Indexing microblogs for real-time search is challenging given the efficiency issue caused by the tremendous speed at which new microblogs are created by users. Existing approaches address this efficiency issue at the cost of query accuracy, as they either (i) exclude a significant portion of microblogs from the index to reduce update cost or (ii) rank microblogs mostly by their timestamps (without sufficient consideration of their relevance to the queries) to enable append-only index insertion. As a consequence, the search results returned by the existing approaches do not satisfy the users who demand timely and high-quality search results. To remedy this deficiency, we propose the Log-Structured Inverted Indices (LSII), a structure for exact real-time search on microblogs. The core of LSII is a sequence of inverted indices with exponentially increasing sizes, such that new microblogs are (i) first inserted into the smallest index and (ii) later moved into the larger indices in a batch manner. The batch insertion mechanism leads to a small amortize update cost for each new microblog, without significantly degrading query performance. We present a comprehensive study on LSII, exploring various design options to strike a good balance between query and update performance. In addition, we propose extensions of LSII to support personalized search and to exploit multi-threading for performance improvement. Extensive experiments demonstrate the efficiency of LSII with experiments on real data.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122065622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

SELECT triggers for data auditing SELECT触发器用于数据审计

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544904

Daniela Fabbri, Ravishankar Ramamurthy, R. Kaushik

Auditing is a key part of the security infrastructure in a database system. While commercial database systems provide mechanisms such as triggers that can be used to track and log any changes made to “sensitive” data using UPDATE queries, they are not useful for tracking accesses to sensitive data using complex SQL queries, which is important for many applications given recent laws such as HIPAA. In this paper, we propose the notion of SELECT triggers that extends triggers to work for SELECT queries in order to facilitate data auditing. We discuss the challenges in integrating SELECT triggers in a database system including specification, semantics as well as efficient implementation techniques. We have prototyped our framework in a commercial database system and present an experimental evaluation of our framework using the TPC-H benchmark.

审计是数据库系统中安全基础设施的关键部分。虽然商业数据库系统提供了诸如触发器之类的机制，可用于跟踪和记录使用UPDATE查询对“敏感”数据所做的任何更改，但它们对于使用复杂的SQL查询跟踪对敏感数据的访问并不有用，而这对于考虑到最新法律(如HIPAA)的许多应用程序来说是很重要的。在本文中，我们提出了SELECT触发器的概念，它扩展了SELECT查询的触发器，以便于数据审计。我们讨论了在数据库系统中集成SELECT触发器的挑战，包括规范、语义以及有效的实现技术。我们已经在一个商业数据库系统中对我们的框架进行了原型化，并使用TPC-H基准对我们的框架进行了实验评估。

引用次数: 17

EAGRE: Towards scalable I/O efficient SPARQL query evaluation on the cloud eagle:实现云上可伸缩的I/O高效SPARQL查询评估

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544856

Xiaofei Zhang, Lei Chen, Yongxin Tong, Min Wang

To benefit from the Cloud platform's unlimited resources, managing and evaluating huge volume of RDF data in a scalable manner has attracted intensive research efforts recently. Progresses have been made on evaluating SPARQL queries with either high-level declarative programming languages, like Pig [1], or a sequence of sophisticated designed MapReduce jobs, both of which tend to answer the query with multiple join operations. However, due to the simplicity of Cloud storage and the coarse organization of RDF data in existing solutions, multiple join operations easily bring significant I/O and network traffic which can severely degrade the system performance. In this work, we first propose EAGRE, an Entity-Aware Graph compREssion technique to form a new representation of RDF data on Cloud platforms, based on which we propose an I/O efficient strategy to evaluate SPARQL queries as quickly as possible, especially queries with specified solution sequence modifiers, e.g., PROJECTION, ORDER BY, etc. We implement a prototype system and conduct extensive experiments over both real and synthetic datasets on an in-house cluster. The experimental results show that our solution can achieve over an order of magnitude of time saving for the SPARQL query evaluation compared to the state-of-art MapReduce-based solutions.

为了从云平台的无限资源中获益，以可伸缩的方式管理和评估大量RDF数据最近吸引了大量的研究工作。在使用高级声明性编程语言(如Pig[1])或一系列设计复杂的MapReduce作业来评估SPARQL查询方面已经取得了进展，这两种语言都倾向于使用多个连接操作来回答查询。然而，由于云存储的简单性和现有解决方案中RDF数据的粗糙组织，多次连接操作很容易带来大量的I/O和网络流量，从而严重降低系统性能。在这项工作中，我们首先提出了EAGRE，一种实体感知图压缩技术，用于在云平台上形成RDF数据的新表示，在此基础上，我们提出了一种高效的I/O策略，以尽可能快地评估SPARQL查询，特别是具有指定解决方案序列修饰符的查询，例如，PROJECTION, ORDER BY等。我们实现了一个原型系统，并在内部集群上对真实和合成数据集进行了广泛的实验。实验结果表明，与最先进的基于mapreduce的解决方案相比，我们的解决方案可以为SPARQL查询评估节省超过一个数量级的时间。

{"title":"EAGRE: Towards scalable I/O efficient SPARQL query evaluation on the cloud","authors":"Xiaofei Zhang, Lei Chen, Yongxin Tong, Min Wang","doi":"10.1109/ICDE.2013.6544856","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544856","url":null,"abstract":"To benefit from the Cloud platform's unlimited resources, managing and evaluating huge volume of RDF data in a scalable manner has attracted intensive research efforts recently. Progresses have been made on evaluating SPARQL queries with either high-level declarative programming languages, like Pig [1], or a sequence of sophisticated designed MapReduce jobs, both of which tend to answer the query with multiple join operations. However, due to the simplicity of Cloud storage and the coarse organization of RDF data in existing solutions, multiple join operations easily bring significant I/O and network traffic which can severely degrade the system performance. In this work, we first propose EAGRE, an Entity-Aware Graph compREssion technique to form a new representation of RDF data on Cloud platforms, based on which we propose an I/O efficient strategy to evaluate SPARQL queries as quickly as possible, especially queries with specified solution sequence modifiers, e.g., PROJECTION, ORDER BY, etc. We implement a prototype system and conduct extensive experiments over both real and synthetic datasets on an in-house cluster. The experimental results show that our solution can achieve over an order of magnitude of time saving for the SPARQL query evaluation compared to the state-of-art MapReduce-based solutions.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115101964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 90

Presenting diverse location views with real-time near-duplicate photo elimination 呈现不同的位置视图与实时近重复的照片消除

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544851

Jiajun Liu, Zi Huang, Hong Cheng, Yueguo Chen, Heng Tao Shen, Yanchun Zhang

Supported by the technical advances and the commercial success of GPS-enabled mobile devices, geo-tagged photos have drawn plenteous attention in research community. The explosive growth of geo-tagged photos enables many large-scale applications, such as location-based photo browsing, landmark recognition, etc. Meanwhile, as the number of geo-tagged photos continues to climb, new challenges are brought to various applications. The existence of massive near-duplicate geo-tagged photos jeopardizes the effective presentation for the above applications. A new dimension in the search and presentation of geo-tagged photos is urgently demanded. In this paper, we devise a location visualization framework to efficiently retrieve and present diverse views captured within a local proximity. Novel photos, in terms of capture locations and visual content, are identified and returned in response to a query location for diverse visualization. For real-time response and good scalability, a new Hybrid Index structure which integrates R-tree and Geographic Grid is proposed to quickly identify the Maximal Near-duplicate Photo Groups (MNPG) in the query proximity. The most novel photos from different groups are then returned to generate diverse views on the location. Extensive experiments on synthetic and real-life photo datasets prove the novelty and efficiency of our methods.

在gps移动设备的技术进步和商业成功的支持下，地理标记照片已经引起了研究界的广泛关注。地理标记照片的爆炸式增长使许多大规模应用成为可能，如基于位置的照片浏览、地标识别等。同时，随着地理标记照片数量的不断攀升，给各种应用带来了新的挑战。大量几乎重复的地理标记照片的存在危及上述应用程序的有效呈现。在搜索和呈现带有地理标记的照片方面，迫切需要一个新的维度。在本文中，我们设计了一个位置可视化框架，以有效地检索和呈现在局部邻近范围内捕获的不同视图。根据捕获位置和视觉内容，识别和返回新的照片，以响应查询位置，实现多样化的可视化。为了满足实时响应和良好的可扩展性，提出了一种将r树和地理网格相结合的混合索引结构，以快速识别查询邻近的最大近重复照片组(MNPG)。然后将来自不同组的最新颖的照片返回，以生成对该地点的不同看法。在合成和真实照片数据集上的大量实验证明了我们方法的新颖性和有效性。

{"title":"Presenting diverse location views with real-time near-duplicate photo elimination","authors":"Jiajun Liu, Zi Huang, Hong Cheng, Yueguo Chen, Heng Tao Shen, Yanchun Zhang","doi":"10.1109/ICDE.2013.6544851","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544851","url":null,"abstract":"Supported by the technical advances and the commercial success of GPS-enabled mobile devices, geo-tagged photos have drawn plenteous attention in research community. The explosive growth of geo-tagged photos enables many large-scale applications, such as location-based photo browsing, landmark recognition, etc. Meanwhile, as the number of geo-tagged photos continues to climb, new challenges are brought to various applications. The existence of massive near-duplicate geo-tagged photos jeopardizes the effective presentation for the above applications. A new dimension in the search and presentation of geo-tagged photos is urgently demanded. In this paper, we devise a location visualization framework to efficiently retrieve and present diverse views captured within a local proximity. Novel photos, in terms of capture locations and visual content, are identified and returned in response to a query location for diverse visualization. For real-time response and good scalability, a new Hybrid Index structure which integrates R-tree and Geographic Grid is proposed to quickly identify the Maximal Near-duplicate Photo Groups (MNPG) in the query proximity. The most novel photos from different groups are then returned to generate diverse views on the location. Extensive experiments on synthetic and real-life photo datasets prove the novelty and efficiency of our methods.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130034552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Coupled clustering ensemble: Incorporating coupling relationships both between base clusterings and objects 耦合聚类集成:结合基本聚类和对象之间的耦合关系

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544840

Can Wang, Zhong She, Longbing Cao

Clustering ensemble is a powerful approach for improving the accuracy and stability of individual (base) clustering algorithms. Most of the existing clustering ensemble methods obtain the final solutions by assuming that base clusterings perform independently with one another and all objects are independent too. However, in real-world data sources, objects are more or less associated in terms of certain coupling relationships. Base clusterings trained on the source data are complementary to one another since each of them may only capture some specific rather than full picture of the data. In this paper, we discuss the problem of explicating the dependency between base clusterings and between objects in clustering ensembles, and propose a framework for coupled clustering ensembles (CCE). CCE not only considers but also integrates the coupling relationships between base clusterings and between objects. Specifically, we involve both the intra-coupling within one base clustering (i.e., cluster label frequency distribution) and the inter-coupling between different base clusterings (i.e., cluster label co-occurrence dependency). Furthermore, we engage both the intra-coupling between two objects in terms of the base clustering aggregation and the inter-coupling among other objects in terms of neighborhood relationship. This is the first work which explicitly addresses the dependency between base clusterings and between objects, verified by the application of such couplings in three types of consensus functions: clustering-based, object-based and cluster-based. Substantial experiments on synthetic and UCI data sets demonstrate that the CCE framework can effectively capture the interactions embedded in base clusterings and objects with higher clustering accuracy and stability compared to several state-of-the-art techniques, which is also supported by statistical analysis.

聚类集成是提高单个(基)聚类算法的准确性和稳定性的一种有效方法。现有的聚类集成方法大多假设基本聚类相互独立，所有对象也独立，从而得到最终解。然而，在真实的数据源中，对象或多或少是根据某些耦合关系相关联的。在源数据上训练的基本聚类是相互补充的，因为它们中的每一个可能只捕获一些特定的而不是数据的全貌。本文讨论了聚类集成中基本聚类之间和对象之间的依赖关系的解释问题，并提出了耦合聚类集成的框架。CCE不仅考虑而且集成了基本聚类之间和对象之间的耦合关系。具体来说，我们既涉及一个基聚类内部的耦合(即聚类标签频率分布)，也涉及不同基聚类之间的相互耦合(即聚类标签共现依赖)。此外，我们利用基础聚类聚合来处理两个对象之间的内部耦合，并利用邻域关系来处理其他对象之间的内部耦合。这是第一个明确解决基本聚类之间和对象之间依赖关系的工作，通过在三种类型的共识函数中应用这种耦合来验证:基于聚类、基于对象和基于聚类。在合成和UCI数据集上的大量实验表明，CCE框架可以有效捕获嵌入在基本聚类和对象中的相互作用，与几种最先进的聚类技术相比，具有更高的聚类精度和稳定性，这也得到了统计分析的支持。

{"title":"Coupled clustering ensemble: Incorporating coupling relationships both between base clusterings and objects","authors":"Can Wang, Zhong She, Longbing Cao","doi":"10.1109/ICDE.2013.6544840","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544840","url":null,"abstract":"Clustering ensemble is a powerful approach for improving the accuracy and stability of individual (base) clustering algorithms. Most of the existing clustering ensemble methods obtain the final solutions by assuming that base clusterings perform independently with one another and all objects are independent too. However, in real-world data sources, objects are more or less associated in terms of certain coupling relationships. Base clusterings trained on the source data are complementary to one another since each of them may only capture some specific rather than full picture of the data. In this paper, we discuss the problem of explicating the dependency between base clusterings and between objects in clustering ensembles, and propose a framework for coupled clustering ensembles (CCE). CCE not only considers but also integrates the coupling relationships between base clusterings and between objects. Specifically, we involve both the intra-coupling within one base clustering (i.e., cluster label frequency distribution) and the inter-coupling between different base clusterings (i.e., cluster label co-occurrence dependency). Furthermore, we engage both the intra-coupling between two objects in terms of the base clustering aggregation and the inter-coupling among other objects in terms of neighborhood relationship. This is the first work which explicitly addresses the dependency between base clusterings and between objects, verified by the application of such couplings in three types of consensus functions: clustering-based, object-based and cluster-based. Substantial experiments on synthetic and UCI data sets demonstrate that the CCE framework can effectively capture the interactions embedded in base clusterings and objects with higher clustering accuracy and stability compared to several state-of-the-art techniques, which is also supported by statistical analysis.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114344510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Learning to rank from distant supervision: Exploiting noisy redundancy for relational entity search 从远程监督学习排序:利用关系实体搜索的噪声冗余

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544878

Mianwei Zhou, Hongning Wang, K. Chang

In this paper, we study the task of relational entity search which aims at automatically learning an entity ranking function for a desired relation. To rank entities, we exploit the redundancy abound in their snippets; however, such redundancy is noisy as not all the snippets represent information relevant to the desired relation. To explore useful information from such noisy redundancy, we abstract the task as a distantly supervised ranking problem - based on coarse entity-level annotations, deriving a relation-specific ranking function for the purpose of online searching. As the key challenge, without detailed snippet-level annotations, we have to learn an entity ranking function that can effectively filter noise; furthermore, the ranking function should also be online executable. We develop Pattern-based Filter Network (PFNet), a novel probabilistic graphical model, as our solution. To balance the accuracy and efficiency requirements, PFNet selects a limited size of indicative patterns to filter noisy snippets, and inverted indexes are utilized to retrieve required features. Experiments on the large scale CuleWeb09 data set for six different relations confirm the effectiveness of the proposed PFNet model, which outperforms five state-of-the-art relational entity ranking methods.

在本文中，我们研究了关系实体搜索的任务，其目的是为期望的关系自动学习实体排序函数。为了对实体进行排序，我们利用它们片段中的大量冗余;然而，这种冗余是有噪声的，因为并非所有的片段都表示与期望关系相关的信息。为了从这些噪声冗余中挖掘有用的信息，我们将该任务抽象为基于粗实体级注释的远程监督排序问题，推导出用于在线搜索的关系特定排序函数。关键的挑战是，在没有详细的片段级注释的情况下，我们必须学习一种能够有效过滤噪声的实体排序函数;此外，排名功能也应该是在线可执行的。我们开发了一种新的基于模式的滤波网络(PFNet)，作为我们的解决方案。为了平衡准确性和效率要求，PFNet选择有限大小的指示模式来过滤噪声片段，并使用倒排索引来检索所需的特征。在culleweb09大型数据集上对6种不同关系的实验验证了PFNet模型的有效性，该模型优于5种最先进的关系实体排序方法。

{"title":"Learning to rank from distant supervision: Exploiting noisy redundancy for relational entity search","authors":"Mianwei Zhou, Hongning Wang, K. Chang","doi":"10.1109/ICDE.2013.6544878","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544878","url":null,"abstract":"In this paper, we study the task of relational entity search which aims at automatically learning an entity ranking function for a desired relation. To rank entities, we exploit the redundancy abound in their snippets; however, such redundancy is noisy as not all the snippets represent information relevant to the desired relation. To explore useful information from such noisy redundancy, we abstract the task as a distantly supervised ranking problem - based on coarse entity-level annotations, deriving a relation-specific ranking function for the purpose of online searching. As the key challenge, without detailed snippet-level annotations, we have to learn an entity ranking function that can effectively filter noise; furthermore, the ranking function should also be online executable. We develop Pattern-based Filter Network (PFNet), a novel probabilistic graphical model, as our solution. To balance the accuracy and efficiency requirements, PFNet selects a limited size of indicative patterns to filter noisy snippets, and inverted indexes are utilized to retrieve required features. Experiments on the large scale CuleWeb09 data set for six different relations confirm the effectiveness of the proposed PFNet model, which outperforms five state-of-the-art relational entity ranking methods.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114877542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

SociaLite: Datalog extensions for efficient social network analysis SociaLite:数据扩展有效的社会网络分析

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544832

Jiwon Seo, Stephen D. Guo, M. Lam

With the rise of social networks, large-scale graph analysis becomes increasingly important. Because SQL lacks the expressiveness and performance needed for graph algorithms, lower-level, general-purpose languages are often used instead. For greater ease of use and efficiency, we propose SociaLite, a high-level graph query language based on Datalog. As a logic programming language, Datalog allows many graph algorithms to be expressed succinctly. However, its performance has not been competitive when compared to low-level languages. With SociaLite, users can provide high-level hints on the data layout and evaluation order; they can also define recursive aggregate functions which, as long as they are meet operations, can be evaluated incrementally and efficiently. We evaluated SociaLite by running eight graph algorithms (shortest paths, PageRank, hubs and authorities, mutual neighbors, connected components, triangles, clustering coefficients, and betweenness centrality) on two real-life social graphs, Live-Journal and Last.fm. The optimizations proposed in this paper speed up almost all the algorithms by 3 to 22 times. SociaLite even outperforms typical Java implementations by an average of 50% for the graph algorithms tested. When compared to highly optimized Java implementations, SociaLite programs are an order of magnitude more succinct and easier to write. Its performance is competitive, giving up only 16% for the largest benchmark. Most importantly, being a query language, SociaLite enables many more users who are not proficient in software engineering to make social network queries easily and efficiently.

随着社交网络的兴起，大规模图分析变得越来越重要。由于SQL缺乏图形算法所需的表达能力和性能，因此通常使用较低级别的通用语言。为了提高易用性和效率，我们提出了SociaLite，一种基于Datalog的高级图形查询语言。Datalog作为一种逻辑程序设计语言，可以简洁地表达许多图算法。然而，与低级语言相比，它的性能并不具有竞争力。通过SociaLite，用户可以对数据布局和评价顺序提供高层次的提示;它们还可以定义递归聚合函数，只要它们是满足操作，就可以有效地增量求值。我们通过在Live-Journal和Last.fm两个现实社交图上运行八种图算法(最短路径、PageRank、枢纽和权威、相互邻居、连接组件、三角形、聚类系数和中间性)来评估SociaLite。本文提出的优化方法几乎使所有算法的速度提高了3到22倍。在图算法测试中，SociaLite甚至比典型的Java实现平均高出50%。与高度优化的Java实现相比，SociaLite程序更简洁，更容易编写。它的表现很有竞争力，在最大的基准上只放弃了16%。最重要的是，作为一种查询语言，SociaLite使更多不精通软件工程的用户能够轻松高效地进行社交网络查询。

{"title":"SociaLite: Datalog extensions for efficient social network analysis","authors":"Jiwon Seo, Stephen D. Guo, M. Lam","doi":"10.1109/ICDE.2013.6544832","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544832","url":null,"abstract":"With the rise of social networks, large-scale graph analysis becomes increasingly important. Because SQL lacks the expressiveness and performance needed for graph algorithms, lower-level, general-purpose languages are often used instead. For greater ease of use and efficiency, we propose SociaLite, a high-level graph query language based on Datalog. As a logic programming language, Datalog allows many graph algorithms to be expressed succinctly. However, its performance has not been competitive when compared to low-level languages. With SociaLite, users can provide high-level hints on the data layout and evaluation order; they can also define recursive aggregate functions which, as long as they are meet operations, can be evaluated incrementally and efficiently. We evaluated SociaLite by running eight graph algorithms (shortest paths, PageRank, hubs and authorities, mutual neighbors, connected components, triangles, clustering coefficients, and betweenness centrality) on two real-life social graphs, Live-Journal and Last.fm. The optimizations proposed in this paper speed up almost all the algorithms by 3 to 22 times. SociaLite even outperforms typical Java implementations by an average of 50% for the graph algorithms tested. When compared to highly optimized Java implementations, SociaLite programs are an order of magnitude more succinct and easier to write. Its performance is competitive, giving up only 16% for the largest benchmark. Most importantly, being a query language, SociaLite enables many more users who are not proficient in software engineering to make social network queries easily and efficiently.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115108399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 112

Ontology-based subgraph querying 基于本体的子图查询

2013 IEEE 29th International Conference on Data Engineering (ICDE)

Pub Date : 2013-04-08 DOI: 10.1109/ICDE.2013.6544867

Yinghui Wu, Shengqi Yang, Xifeng Yan

Subgraph querying has been applied in a variety of emerging applications. Traditional subgraph querying based on subgraph isomorphism requires identical label matching, which is often too restrictive to capture the matches that are semantically close to the query graphs. This paper extends subgraph querying to identify semantically related matches by leveraging ontology information. (1) We introduce the ontology-based subgraph querying, which revises subgraph isomorphism by mapping a query to semantically related subgraphs in terms of a given ontology graph. We introduce a metric to measure the similarity of the matches. Based on the metric, we introduce an optimization problem to find top K best matches. (2) We provide a filtering-and-verification framework to identify (top-K) matches for ontology-based subgraph queries. The framework efficiently extracts a small subgraph of the data graph from an ontology index, and further computes the matches by only accessing the extracted subgraph. (3) In addition, we show that the ontology index can be efficiently updated upon the changes to the data graphs, enabling the framework to cope with dynamic data graphs. (4) We experimentally verify the effectiveness and efficiency of our framework using both synthetic and real life graphs, comparing with traditional subgraph querying methods.

子图查询已被应用于各种新兴的应用中。传统的基于子图同构的子图查询需要相同的标签匹配，这对于捕获语义上接近查询图的匹配来说限制太大。本文扩展子图查询，利用本体信息识别语义相关匹配。(1)引入了基于本体的子图查询，它通过将查询映射到给定本体图的语义相关子图来修正子图同构。我们引入一个度量来度量匹配的相似性。在此基础上，引入了一个寻找K个最优匹配的优化问题。(2)我们提供了一个过滤和验证框架来识别基于本体的子图查询的(top-K)匹配。该框架有效地从本体索引中提取数据图的小子图，并通过仅访问提取的子图来进一步计算匹配。(3)此外，我们还证明了本体索引可以随着数据图的变化而有效地更新，使框架能够应对动态数据图。(4)与传统子图查询方法相比，我们用合成图和真实图验证了该框架的有效性和效率。

{"title":"Ontology-based subgraph querying","authors":"Yinghui Wu, Shengqi Yang, Xifeng Yan","doi":"10.1109/ICDE.2013.6544867","DOIUrl":"https://doi.org/10.1109/ICDE.2013.6544867","url":null,"abstract":"Subgraph querying has been applied in a variety of emerging applications. Traditional subgraph querying based on subgraph isomorphism requires identical label matching, which is often too restrictive to capture the matches that are semantically close to the query graphs. This paper extends subgraph querying to identify semantically related matches by leveraging ontology information. (1) We introduce the ontology-based subgraph querying, which revises subgraph isomorphism by mapping a query to semantically related subgraphs in terms of a given ontology graph. We introduce a metric to measure the similarity of the matches. Based on the metric, we introduce an optimization problem to find top K best matches. (2) We provide a filtering-and-verification framework to identify (top-K) matches for ontology-based subgraph queries. The framework efficiently extracts a small subgraph of the data graph from an ontology index, and further computes the matches by only accessing the extracted subgraph. (3) In addition, we show that the ontology index can be efficiently updated upon the changes to the data graphs, enabling the framework to cope with dynamic data graphs. (4) We experimentally verify the effectiveness and efficiency of our framework using both synthetic and real life graphs, comparing with traditional subgraph querying methods.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"316 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133048741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43