2012 IEEE 28th International Conference on Data Engineering最新文献_第7页

Efficient Threshold Monitoring for Distributed Probabilistic Data 分布式概率数据的有效阈值监测

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.34

Mingwang Tang, Feifei Li, J. M. Phillips, Jeffrey Jestes

In distributed data management, a primary concern is monitoring the distributed data and generating an alarm when a user specified constraint is violated. A particular useful instance is the threshold based constraint, which is commonly known as the distributed threshold monitoring problem [4], [16], [19], [29]. This work extends this useful and fundamental study to distributed probabilistic data that emerge in a lot of applications, where uncertainty naturally exists when massive amounts of data are produced at multiple sources in distributed, networked locations. Examples include distributed observing stations, large sensor fields, geographically separate scientific institutes/units and many more. When dealing with probabilistic data, there are two thresholds involved, the score and the probability thresholds. One must monitor both simultaneously, as such, techniques developed for deterministic data are no longer directly applicable. This work presents a comprehensive study to this problem. Our algorithms have significantly outperformed the baseline method in terms of both the communication cost (number of messages and bytes) and the running time, as shown by an extensive experimental evaluation using several, real large datasets.

在分布式数据管理中，主要关注的是监视分布式数据，并在违反用户指定的约束时生成警报。一个特别有用的例子是基于阈值的约束，它通常被称为分布式阈值监控问题[4]，[16]，[19]，[29]。这项工作将这一有用的基础研究扩展到分布式概率数据，这些数据出现在许多应用程序中，当大量数据在分布式、网络位置的多个来源产生时，不确定性自然存在。例子包括分布式观测站、大型传感器场、地理上独立的科学研究所/单位等等。在处理概率数据时，涉及到两个阈值，分数和概率阈值。必须同时监测两者，因此，为确定性数据开发的技术不再直接适用。这项工作对这个问题进行了全面的研究。我们的算法在通信成本(消息数和字节数)和运行时间方面都明显优于基线方法，正如使用几个真实的大型数据集进行的广泛实验评估所示。

{"title":"Efficient Threshold Monitoring for Distributed Probabilistic Data","authors":"Mingwang Tang, Feifei Li, J. M. Phillips, Jeffrey Jestes","doi":"10.1109/ICDE.2012.34","DOIUrl":"https://doi.org/10.1109/ICDE.2012.34","url":null,"abstract":"In distributed data management, a primary concern is monitoring the distributed data and generating an alarm when a user specified constraint is violated. A particular useful instance is the threshold based constraint, which is commonly known as the distributed threshold monitoring problem [4], [16], [19], [29]. This work extends this useful and fundamental study to distributed probabilistic data that emerge in a lot of applications, where uncertainty naturally exists when massive amounts of data are produced at multiple sources in distributed, networked locations. Examples include distributed observing stations, large sensor fields, geographically separate scientific institutes/units and many more. When dealing with probabilistic data, there are two thresholds involved, the score and the probability thresholds. One must monitor both simultaneously, as such, techniques developed for deterministic data are no longer directly applicable. This work presents a comprehensive study to this problem. Our algorithms have significantly outperformed the baseline method in terms of both the communication cost (number of messages and bytes) and the running time, as shown by an extensive experimental evaluation using several, real large datasets.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117246786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Effective and Robust Pruning for Top-Down Join Enumeration Algorithms 自顶向下连接枚举算法的有效鲁棒剪枝

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.27

Pit Fender, G. Moerkotte, Thomas Neumann, Viktor Leis

Finding the optimal execution order of join operations is a crucial task of today's cost-based query optimizers. There are two approaches to identify the best plan: bottom-up and top-down join enumeration. For both optimization strategies efficient algorithms have been published. However, only the top-down approach allows for branch-and-bound pruning. Two pruning techniques can be found in the literature. We add six new ones. Combined, they improve performance roughly by an average factor of 2 - 5. Even more important, our techniques improve the worst case by two orders of magnitude. Additionally, we introduce a new, very efficient, and easy to implement top-down join enumeration algorithm. This algorithm, together with our improved pruning techniques, yields a performance which is by an average factor of 6 - 9 higher than the performance of the original top-down enumeration algorithm with the original pruning methods.

查找连接操作的最佳执行顺序是当今基于成本的查询优化器的一项关键任务。有两种方法可以确定最佳计划:自底向上和自顶向下的连接枚举。对于这两种优化策略，已经发表了高效的算法。然而，只有自顶向下的方法才允许分支绑定修剪。在文献中可以找到两种修剪技术。我们加了六个新的。综合起来，它们大致提高了2 - 5倍的平均性能。更重要的是，我们的技术将最坏的情况提高了两个数量级。此外，我们还引入了一种新的、非常高效且易于实现的自顶向下连接枚举算法。该算法与我们改进的剪枝技术一起，产生的性能比原始的自顶向下枚举算法使用原始剪枝方法的性能平均高出6 - 9倍。

引用次数: 20

Upgrading Uncompetitive Products Economically 经济升级无竞争力产品

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.92

Hua Lu, Christian S. Jensen

The skyline of a multidimensional point set consists of the points that are not dominated by other points. In a scenario where product features are represented by multidimensional points, the skyline points may be viewed as representing competitive products. A product provider may wish to upgrade uncompetitive products to become competitive, but wants to take into account the upgrading cost. We study the top-k product upgrading problem. Given a set P of competitor products, a set T of products that are candidates for upgrade, and an upgrading cost function f that applies to T, the problem is to return the k products in T that can be upgraded to not be dominated by any products in P at the lowest cost. This problem is non-trivial due to not only the large data set sizes, but also to the many possibilities for upgrading a product. We identify and provide solutions for the different options for upgrading an uncompetitive product, and combine the solutions into a single solution. We also propose a spatial join-based solution that assumes P and T are indexed by an R-tree. Given a set of products in the same R-tree node, we derive three lower bounds on their upgrading costs. These bounds are employed by the join approach to prune upgrade candidates with uncompetitive upgrade costs. Empirical studies with synthetic and real data show that the join approach is efficient and scalable.

多维点集的天际线由不受其他点支配的点组成。在产品特征由多维点表示的场景中，天际线点可能被视为代表竞争产品。产品供应商可能希望将没有竞争力的产品升级为具有竞争力的产品，但希望考虑到升级成本。我们研究了top-k产品升级问题。给定一个竞争产品集合P，一个候选产品集合T，以及一个适用于T的升级成本函数f，问题是返回T中的k个产品，这些产品可以以最低的成本升级到不被P中的任何产品占主导地位。这个问题并不简单，因为不仅数据集规模大，而且升级产品的可能性也很多。我们确定并提供不同的解决方案来升级一个没有竞争力的产品，并将这些解决方案组合成一个单一的解决方案。我们还提出了一个基于空间连接的解决方案，假设P和T由r树索引。给定同一r树节点上的一组产品，我们推导出它们升级成本的三个下界。连接方法使用这些边界来修剪具有非竞争性升级成本的候选升级。综合数据和实际数据的实证研究表明，该连接方法具有高效和可扩展性。

{"title":"Upgrading Uncompetitive Products Economically","authors":"Hua Lu, Christian S. Jensen","doi":"10.1109/ICDE.2012.92","DOIUrl":"https://doi.org/10.1109/ICDE.2012.92","url":null,"abstract":"The skyline of a multidimensional point set consists of the points that are not dominated by other points. In a scenario where product features are represented by multidimensional points, the skyline points may be viewed as representing competitive products. A product provider may wish to upgrade uncompetitive products to become competitive, but wants to take into account the upgrading cost. We study the top-k product upgrading problem. Given a set P of competitor products, a set T of products that are candidates for upgrade, and an upgrading cost function f that applies to T, the problem is to return the k products in T that can be upgraded to not be dominated by any products in P at the lowest cost. This problem is non-trivial due to not only the large data set sizes, but also to the many possibilities for upgrading a product. We identify and provide solutions for the different options for upgrading an uncompetitive product, and combine the solutions into a single solution. We also propose a spatial join-based solution that assumes P and T are indexed by an R-tree. Given a set of products in the same R-tree node, we derive three lower bounds on their upgrading costs. These bounds are employed by the join approach to prune upgrade candidates with uncompetitive upgrade costs. Empirical studies with synthetic and real data show that the join approach is efficient and scalable.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"1156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131365457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Efficient Dual-Resolution Layer Indexing for Top-k Queries Top-k查询的高效双分辨率层索引

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.73

Jongwuk Lee, Hyunsouk Cho, Seung-won Hwang

Top-k queries have gained considerable attention as an effective means for narrowing down the overwhelming amount of data. This paper studies the problem of constructing an indexing structure that efficiently supports top-k queries for varying scoring functions and retrieval sizes. The existing work can be categorized into three classes: list-, layer-, and view-based approaches. This paper focuses on the layer-based approach, pre-materializing tuples into consecutive multiple layers. The layer-based index enables us to return top-k answers efficiently by restricting access to tuples in the k layers. However, we observe that the number of tuples accessed in each layer can be reduced further. For this purpose, we propose a dual-resolution layer structure. Specifically, we iteratively build coarse-level layers using skylines, and divide each coarse-level layer into fine-level sub layers using convex skylines. The dual-resolution layer is able to leverage not only the dominance relationship between coarse-level layers, named for all-dominance, but also a relaxed dominance relationship between fine-level sub layers, named exists-dominance. Our extensive evaluation results demonstrate that our proposed method significantly reduces the number of tuples accessed than the state-of-the-art methods.

Top-k查询作为一种缩小大量数据的有效方法已经获得了相当多的关注。本文研究了构建一个索引结构的问题，该索引结构能够有效地支持不同评分函数和检索大小的top-k查询。现有的工作可以分为三类:基于列表的、基于层的和基于视图的方法。本文主要研究基于层的方法，将元组预物化成连续的多层。基于层的索引通过限制对k层中的元组的访问，使我们能够有效地返回前k个答案。然而，我们观察到在每层中访问的元组的数量可以进一步减少。为此，我们提出了一种双分辨率层结构。具体而言，我们使用天际线迭代构建粗层，并使用凸天际线将每个粗层划分为细层子层。双分辨层既能利用粗层间的优势关系(all-dominance)，又能利用细层间的宽松优势关系(exists-dominance)。我们广泛的评估结果表明，我们提出的方法比最先进的方法显著减少了访问元组的数量。

{"title":"Efficient Dual-Resolution Layer Indexing for Top-k Queries","authors":"Jongwuk Lee, Hyunsouk Cho, Seung-won Hwang","doi":"10.1109/ICDE.2012.73","DOIUrl":"https://doi.org/10.1109/ICDE.2012.73","url":null,"abstract":"Top-k queries have gained considerable attention as an effective means for narrowing down the overwhelming amount of data. This paper studies the problem of constructing an indexing structure that efficiently supports top-k queries for varying scoring functions and retrieval sizes. The existing work can be categorized into three classes: list-, layer-, and view-based approaches. This paper focuses on the layer-based approach, pre-materializing tuples into consecutive multiple layers. The layer-based index enables us to return top-k answers efficiently by restricting access to tuples in the k layers. However, we observe that the number of tuples accessed in each layer can be reduced further. For this purpose, we propose a dual-resolution layer structure. Specifically, we iteratively build coarse-level layers using skylines, and divide each coarse-level layer into fine-level sub layers using convex skylines. The dual-resolution layer is able to leverage not only the dominance relationship between coarse-level layers, named for all-dominance, but also a relaxed dominance relationship between fine-level sub layers, named exists-dominance. Our extensive evaluation results demonstrate that our proposed method significantly reduces the number of tuples accessed than the state-of-the-art methods.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127197591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Reducing Uncertainty of Low-Sampling-Rate Trajectories 降低低采样率轨迹的不确定性

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/icde.2012.42

Kai Zheng, Yu Zheng, Xing Xie, Xiaofang Zhou

The increasing availability of GPS-embedded mobile devices has given rise to a new spectrum of location-based services, which have accumulated a huge collection of location trajectories. In practice, a large portion of these trajectories are of low-sampling-rate. For instance, the time interval between consecutive GPS points of some trajectories can be several minutes or even hours. With such a low sampling rate, most details of their movement are lost, which makes them difficult to process effectively. In this work, we investigate how to reduce the uncertainty in such kind of trajectories. Specifically, given a low-sampling-rate trajectory, we aim to infer its possible routes. The methodology adopted in our work is to take full advantage of the rich information extracted from the historical trajectories. We propose a systematic solution, History based Route Inference System (HRIS), which covers a series of novel algorithms that can derive the travel pattern from historical data and incorporate it into the route inference process. To validate the effectiveness of the system, we apply our solution to the map-matching problem which is an important application scenario of this work, and conduct extensive experiments on a real taxi trajectory dataset. The experiment results demonstrate that HRIS can achieve higher accuracy than the existing map-matching algorithms for low-sampling-rate trajectories.

嵌入式gps移动设备的日益普及，催生了一系列新的基于位置的服务，这些服务积累了大量的位置轨迹。在实际应用中，这些轨迹有很大一部分是低采样率的。例如，某些轨迹的连续GPS点之间的时间间隔可能是几分钟甚至几小时。在如此低的采样率下，它们的大部分运动细节都丢失了，这使得它们难以有效地处理。在这项工作中，我们研究了如何减少这类轨迹的不确定性。具体来说，给定一个低采样率的轨迹，我们的目标是推断其可能的路径。在我们的工作中采用的方法是充分利用从历史轨迹中提取的丰富信息。我们提出了一个系统的解决方案，基于历史的路线推理系统(HRIS)，它涵盖了一系列新的算法，可以从历史数据中导出旅行模式并将其纳入路线推理过程。为了验证系统的有效性，我们将我们的解决方案应用于地图匹配问题，这是本工作的一个重要应用场景，并在真实的出租车轨迹数据集上进行了大量的实验。实验结果表明，对于低采样率轨迹，HRIS比现有的地图匹配算法具有更高的精度。

{"title":"Reducing Uncertainty of Low-Sampling-Rate Trajectories","authors":"Kai Zheng, Yu Zheng, Xing Xie, Xiaofang Zhou","doi":"10.1109/icde.2012.42","DOIUrl":"https://doi.org/10.1109/icde.2012.42","url":null,"abstract":"The increasing availability of GPS-embedded mobile devices has given rise to a new spectrum of location-based services, which have accumulated a huge collection of location trajectories. In practice, a large portion of these trajectories are of low-sampling-rate. For instance, the time interval between consecutive GPS points of some trajectories can be several minutes or even hours. With such a low sampling rate, most details of their movement are lost, which makes them difficult to process effectively. In this work, we investigate how to reduce the uncertainty in such kind of trajectories. Specifically, given a low-sampling-rate trajectory, we aim to infer its possible routes. The methodology adopted in our work is to take full advantage of the rich information extracted from the historical trajectories. We propose a systematic solution, History based Route Inference System (HRIS), which covers a series of novel algorithms that can derive the travel pattern from historical data and incorporate it into the route inference process. To validate the effectiveness of the system, we apply our solution to the map-matching problem which is an important application scenario of this work, and conduct extensive experiments on a real taxi trajectory dataset. The experiment results demonstrate that HRIS can achieve higher accuracy than the existing map-matching algorithms for low-sampling-rate trajectories.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133086060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 209

Predicting Approximate Protein-DNA Binding Cores Using Association Rule Mining 利用关联规则挖掘预测蛋白质- dna结合核心

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.86

Po-Yuen Wong, Tak-Ming Chan, M. Wong, K. Leung

The studies of protein-DNA bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs) are important bioinformatics topics. High-resolution (length<;10) TF-TFBS binding cores are discovered by expensive and time-consuming 3D structure experiments. Recent association rule mining approaches on low-resolution binding sequences (TF length>;490) are shown promising in identifying accurate binding cores without using any 3D structures. While the current association rule mining method on this problem addresses exact sequences only, the most recent ad hoc method for approximation does not establish any formal model and is limited by experimentally known patterns. As biological mutations are common, it is desirable to formally extend the exact model into an approximate one. In this paper, we formalize the problem of mining approximate protein-DNA association rules from sequence data and propose a novel efficient algorithm to predict protein-DNA binding cores. Our two-phase algorithm first constructs two compact intermediate structures called frequent sequence tree (FS-Tree) and frequent sequence class tree (FSCTree). Approximate association rules are efficiently generated from the structures and bioinformatics concepts (position weight matrix and information content) are further employed to prune meaningless rules. Experimental results on real data show the performance and applicability of the proposed algorithm.

转录因子(tf)与转录因子结合位点(TFBSs)之间蛋白质- dna结合的研究是生物信息学的重要课题。高分辨率(长度;490)在不使用任何3D结构的情况下识别准确的绑定核心方面显示出前景。当前的关联规则挖掘方法只处理精确的序列，而最新的临时逼近方法不建立任何正式模型，并且受实验已知模式的限制。由于生物突变是常见的，因此有必要将精确模型正式扩展为近似模型。本文形式化了从序列数据中挖掘蛋白质- dna近似关联规则的问题，并提出了一种新的预测蛋白质- dna结合核心的高效算法。两阶段算法首先构建了频繁序列树(FS-Tree)和频繁序列类树(FSCTree)两个紧凑的中间结构。从结构中有效地生成近似关联规则，并进一步利用生物信息学概念(位置权重矩阵和信息内容)对无意义规则进行删减。实际数据的实验结果表明了该算法的性能和适用性。

{"title":"Predicting Approximate Protein-DNA Binding Cores Using Association Rule Mining","authors":"Po-Yuen Wong, Tak-Ming Chan, M. Wong, K. Leung","doi":"10.1109/ICDE.2012.86","DOIUrl":"https://doi.org/10.1109/ICDE.2012.86","url":null,"abstract":"The studies of protein-DNA bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs) are important bioinformatics topics. High-resolution (length<;10) TF-TFBS binding cores are discovered by expensive and time-consuming 3D structure experiments. Recent association rule mining approaches on low-resolution binding sequences (TF length>;490) are shown promising in identifying accurate binding cores without using any 3D structures. While the current association rule mining method on this problem addresses exact sequences only, the most recent ad hoc method for approximation does not establish any formal model and is limited by experimentally known patterns. As biological mutations are common, it is desirable to formally extend the exact model into an approximate one. In this paper, we formalize the problem of mining approximate protein-DNA association rules from sequence data and propose a novel efficient algorithm to predict protein-DNA binding cores. Our two-phase algorithm first constructs two compact intermediate structures called frequent sequence tree (FS-Tree) and frequent sequence class tree (FSCTree). Approximate association rules are efficiently generated from the structures and bioinformatics concepts (position weight matrix and information content) are further employed to prune meaningless rules. Experimental results on real data show the performance and applicability of the proposed algorithm.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124265540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints 一种基于尝试的编辑距离约束近似实体提取方法

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.29

Dong Deng, Guoliang Li, Jianhua Feng

Dictionary-based entity extraction has attracted much attention from the database community recently, which locates sub strings in a document into predefined entities (e.g., person names or locations). To improve extraction recall, a recent trend is to provide approximate matching between sub strings of the document and entities by tolerating minor errors. In this paper we study dictionary-based approximate entity extraction with edit-distance constraints. Existing methods have several limitations. First, they need to tune many parameters to achieve high performance. Second, they are inefficient for large edit-distance thresholds. We propose a trie-based method to address these problems. We first partition each entity into a set of segments, and then use a trie structure to index segments. To extract similar entities, we search segments from the document, and extend the matching segments in both entities and the document to find similar pairs. We develop an extension-based method to efficiently find similar string pairs by extending the matching segments. We optimize our partition scheme and select the best partition strategy to improve the extraction performance. Experimental results show that our method achieves much higher performance compared with state-of-the-art studies.

基于字典的实体提取最近引起了数据库界的广泛关注，它将文档中的子字符串定位到预定义的实体中(例如，人名或位置)。为了提高提取召回率，最近的一个趋势是通过容忍小错误在文档的子字符串和实体之间提供近似匹配。本文研究了具有编辑距离约束的基于字典的近似实体抽取。现有的方法有一些局限性。首先，它们需要调优许多参数以实现高性能。其次，对于较大的编辑距离阈值，它们是低效的。我们提出了一种基于尝试的方法来解决这些问题。我们首先将每个实体划分为一组段，然后使用trie结构对段进行索引。为了提取相似的实体，我们从文档中搜索片段，并扩展实体和文档中的匹配片段以找到相似的对。我们开发了一种基于扩展的方法，通过扩展匹配段来有效地找到相似的字符串对。我们对分区方案进行了优化，选择了最佳的分区策略来提高提取性能。实验结果表明，与目前的研究相比，我们的方法取得了更高的性能。

{"title":"An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints","authors":"Dong Deng, Guoliang Li, Jianhua Feng","doi":"10.1109/ICDE.2012.29","DOIUrl":"https://doi.org/10.1109/ICDE.2012.29","url":null,"abstract":"Dictionary-based entity extraction has attracted much attention from the database community recently, which locates sub strings in a document into predefined entities (e.g., person names or locations). To improve extraction recall, a recent trend is to provide approximate matching between sub strings of the document and entities by tolerating minor errors. In this paper we study dictionary-based approximate entity extraction with edit-distance constraints. Existing methods have several limitations. First, they need to tune many parameters to achieve high performance. Second, they are inefficient for large edit-distance thresholds. We propose a trie-based method to address these problems. We first partition each entity into a set of segments, and then use a trie structure to index segments. To extract similar entities, we search segments from the document, and extend the matching segments in both entities and the document to find similar pairs. We develop an extension-based method to efficiently find similar string pairs by extending the matching segments. We optimize our partition scheme and select the best partition strategy to improve the extraction performance. Experimental results show that our method achieves much higher performance compared with state-of-the-art studies.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125069197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Relevance Matters: Capitalizing on Less (Top-k Matching in Publish/Subscribe) 相关事项:利用较少(发布/订阅中的Top-k匹配)

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.38

Mohammad Sadoghi, H. Jacobsen

The efficient processing of large collections of Boolean expressions plays a central role in major data intensive applications ranging from user-centric processing and personalization to real-time data analysis. Emerging applications such as computational advertising and selective information dissemination demand determining and presenting to an end-user only the most relevant content that is both user-consumable and suitable for limited screen real estate of target devices. To retrieve the most relevant content, we present BE*-Tree, a novel indexing data structure designed for effective hierarchical top-k pattern matching, which as its by-product also reduces the operational cost of processing millions of patterns. To further reduce processing cost, BE*-Tree employs an adaptive and non-rigid space-cutting technique designed to efficiently index Boolean expressions over a high-dimensional continuous space. At the core of BE*-Tree lie two innovative ideas: (1) a bi-directional tree expansion build as a top-down (data and space clustering) and a bottom-up growths (space clustering), which together enable indexing only non-empty continuous sub-spaces, and (2) an overlap-free splitting strategy. Finally, the performance of BE*-Tree is proven through a comprehensive experimental comparison against state-of-the-art index structures for matching Boolean expressions.

从以用户为中心的处理和个性化到实时数据分析，大量布尔表达式集合的高效处理在主要的数据密集型应用程序中起着核心作用。诸如计算广告和选择性信息传播等新兴应用程序要求确定并向最终用户呈现最相关的内容，这些内容既可用于用户消费，又适合目标设备的有限屏幕空间。为了检索最相关的内容，我们提出了BE*-Tree，这是一种新颖的索引数据结构，设计用于有效的分层top-k模式匹配，其副产品还降低了处理数百万模式的操作成本。为了进一步降低处理成本，BE*-Tree采用了一种自适应和非刚性的空间切割技术，旨在有效地索引高维连续空间上的布尔表达式。BE*-Tree的核心是两个创新思想:(1)双向树扩展构建为自顶向下(数据和空间聚类)和自底向上增长(空间聚类)，它们共同实现仅索引非空连续子空间;(2)无重叠分割策略。最后，通过与最先进的索引结构进行匹配布尔表达式的综合实验比较，证明了BE*-Tree的性能。

{"title":"Relevance Matters: Capitalizing on Less (Top-k Matching in Publish/Subscribe)","authors":"Mohammad Sadoghi, H. Jacobsen","doi":"10.1109/ICDE.2012.38","DOIUrl":"https://doi.org/10.1109/ICDE.2012.38","url":null,"abstract":"The efficient processing of large collections of Boolean expressions plays a central role in major data intensive applications ranging from user-centric processing and personalization to real-time data analysis. Emerging applications such as computational advertising and selective information dissemination demand determining and presenting to an end-user only the most relevant content that is both user-consumable and suitable for limited screen real estate of target devices. To retrieve the most relevant content, we present BE*-Tree, a novel indexing data structure designed for effective hierarchical top-k pattern matching, which as its by-product also reduces the operational cost of processing millions of patterns. To further reduce processing cost, BE*-Tree employs an adaptive and non-rigid space-cutting technique designed to efficiently index Boolean expressions over a high-dimensional continuous space. At the core of BE*-Tree lie two innovative ideas: (1) a bi-directional tree expansion build as a top-down (data and space clustering) and a bottom-up growths (space clustering), which together enable indexing only non-empty continuous sub-spaces, and (2) an overlap-free splitting strategy. Finally, the performance of BE*-Tree is proven through a comprehensive experimental comparison against state-of-the-art index structures for matching Boolean expressions.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123581426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

PRAGUE: Towards Blending Practical Visual Subgraph Query Formulation and Query Processing 迈向混合实用的视觉子图查询公式和查询处理

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.49

Changjiu Jin, S. Bhowmick, Byron Choi, Shuigeng Zhou

In a previous paper, we laid out the vision of a novel graph query processing paradigm where instead of processing a visual query graph after its construction, it interleaves visual query formulation and processing by exploiting the latency offered by the GUI to filter irrelevant matches and prefetch partial query results [8]. Our first attempt at implementing this vision, called GBLENDER [8], shows significant improvement in system response time (SRT) for sub graph containment queries. However, GBLENDER suffers from two key drawbacks, namely inability to handle visual sub graph similarity queries and inefficient support for visual query modification, limiting its usage in practical environment. In this paper, we propose a novel algorithm called PRAGUE (Practical visu Al Graph QUery Blender), that addresses these limitations by exploiting a novel data structure called spindle-shaped graphs (SPIG). A SPIG succinctly records various information related to the set of super graphs of a newly added edge in the visual query fragment. Specifically, PRAGUE realizes a unified visual framework to support SPIG-based processing of modification-efficient sub graph containment and similarity queries. Extensive experiments on real-world and synthetic datasets demonstrate effectiveness of PRAGUE.

在之前的一篇论文中，我们提出了一种新的图形查询处理范式的愿景，在这种范式中，可视化查询图不是在构建之后才进行处理，而是利用GUI提供的延迟来过滤不相关的匹配并预取部分查询结果，从而将可视化查询的制定和处理交织在一起[8]。我们实现这一愿景的第一次尝试，称为GBLENDER[8]，显示了子图包含查询在系统响应时间(SRT)方面的显著改进。然而，GBLENDER有两个主要缺点，即无法处理可视化子图相似性查询和对可视化查询修改的低效支持，限制了它在实际环境中的使用。在本文中，我们提出了一种名为PRAGUE (Practical visual Al Graph QUery Blender)的新算法，该算法通过利用一种名为纺锤形图(SPIG)的新型数据结构来解决这些限制。SPIG简洁地记录了与视觉查询片段中新增边的超图集相关的各种信息。具体来说，PRAGUE实现了一个统一的可视化框架，以支持基于spig的修改高效子图包含和相似性查询的处理。在真实世界和合成数据集上进行的大量实验证明了PRAGUE的有效性。

{"title":"PRAGUE: Towards Blending Practical Visual Subgraph Query Formulation and Query Processing","authors":"Changjiu Jin, S. Bhowmick, Byron Choi, Shuigeng Zhou","doi":"10.1109/ICDE.2012.49","DOIUrl":"https://doi.org/10.1109/ICDE.2012.49","url":null,"abstract":"In a previous paper, we laid out the vision of a novel graph query processing paradigm where instead of processing a visual query graph after its construction, it interleaves visual query formulation and processing by exploiting the latency offered by the GUI to filter irrelevant matches and prefetch partial query results [8]. Our first attempt at implementing this vision, called GBLENDER [8], shows significant improvement in system response time (SRT) for sub graph containment queries. However, GBLENDER suffers from two key drawbacks, namely inability to handle visual sub graph similarity queries and inefficient support for visual query modification, limiting its usage in practical environment. In this paper, we propose a novel algorithm called PRAGUE (Practical visu Al Graph QUery Blender), that addresses these limitations by exploiting a novel data structure called spindle-shaped graphs (SPIG). A SPIG succinctly records various information related to the set of super graphs of a newly added edge in the visual query fragment. Specifically, PRAGUE realizes a unified visual framework to support SPIG-based processing of modification-efficient sub graph containment and similarity queries. Extensive experiments on real-world and synthetic datasets demonstrate effectiveness of PRAGUE.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121210230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data 在概率数据上发现基于阈值的频繁封闭项目集

2012 IEEE 28th International Conference on Data Engineering

Pub Date : 2012-04-01 DOI: 10.1109/ICDE.2012.51

Yongxin Tong, Lei Chen, Bolin Ding

In recent years, many new applications, such as sensor network monitoring and moving object search, show a growing amount of importance of uncertain data management and mining. In this paper, we study the problem of discovering threshold-based frequent closed item sets over probabilistic data. Frequent item set mining over probabilistic database has attracted much attention recently. However, existing solutions may lead an exponential number of results due to the downward closure property over probabilistic data. Moreover, it is hard to directly extend the successful experiences from mining exact data to a probabilistic environment due to the inherent uncertainty of data. Thus, in order to obtain a reasonable result set with small size, we study discovering frequent closed item sets over probabilistic data. We prove that even a sub-problem of this problem, computing the frequent closed probability of an item set, is #P-Hard. Therefore, we develop an efficient mining algorithm based on depth-first search strategy to obtain all probabilistic frequent closed item sets. To reduce the search space and avoid redundant computation, we further design several probabilistic pruning and bounding techniques. Finally, we verify the effectiveness and efficiency of the proposed methods through extensive experiments.

近年来，许多新的应用，如传感器网络监测和运动目标搜索，显示不确定数据的管理和挖掘越来越重要。本文研究了概率数据上基于阈值的频繁闭项集的发现问题。基于概率数据库的频繁项集挖掘是近年来备受关注的问题。然而，由于对概率数据的向下闭合性，现有的解可能导致指数数量的结果。此外，由于数据固有的不确定性，很难将挖掘精确数据的成功经验直接推广到概率环境中。因此，为了得到一个合理的小尺寸结果集，我们研究了在概率数据上发现频繁闭项集的问题。我们证明了这个问题的一个子问题，即计算项目集的频繁闭概率，是#P-Hard。因此，我们开发了一种基于深度优先搜索策略的高效挖掘算法，以获得所有概率频繁闭项集。为了减少搜索空间和避免冗余计算，我们进一步设计了几种概率剪枝和边界技术。最后，我们通过大量的实验验证了所提出方法的有效性和效率。

{"title":"Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data","authors":"Yongxin Tong, Lei Chen, Bolin Ding","doi":"10.1109/ICDE.2012.51","DOIUrl":"https://doi.org/10.1109/ICDE.2012.51","url":null,"abstract":"In recent years, many new applications, such as sensor network monitoring and moving object search, show a growing amount of importance of uncertain data management and mining. In this paper, we study the problem of discovering threshold-based frequent closed item sets over probabilistic data. Frequent item set mining over probabilistic database has attracted much attention recently. However, existing solutions may lead an exponential number of results due to the downward closure property over probabilistic data. Moreover, it is hard to directly extend the successful experiences from mining exact data to a probabilistic environment due to the inherent uncertainty of data. Thus, in order to obtain a reasonable result set with small size, we study discovering frequent closed item sets over probabilistic data. We prove that even a sub-problem of this problem, computing the frequent closed probability of an item set, is #P-Hard. Therefore, we develop an efficient mining algorithm based on depth-first search strategy to obtain all probabilistic frequent closed item sets. To reduce the search space and avoid redundant computation, we further design several probabilistic pruning and bounding techniques. Finally, we verify the effectiveness and efficiency of the proposed methods through extensive experiments.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121079838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 72