首页 > 最新文献

Proceedings 18th International Conference on Data Engineering最新文献

英文 中文
Design and evaluation of alternative selection placement strategies in optimizing continuous queries 优化连续查询的备选选择放置策略的设计和评估
Pub Date : 2002-08-07 DOI: 10.1109/ICDE.2002.994749
Jianjun Chen, D. DeWitt, J. Naughton
We design and evaluate alternative selection placement strategies for optimizing a very large number of continuous queries in an Internet environment. Two grouping strategies, PushDown and PullUp, in which selections are either pushed below, or pulled above, joins are proposed and investigated. While our earlier research has demonstrated that the incremental group optimization can significantly outperform an ungrouped approach, the results from the paper show that different incremental group optimization strategies can have significantly different performance characteristics. Surprisingly, in our studies, PullUp, in which selections are pulled above joins, is often better and achieves an average 10 fold performance improvement over PushDown (occasionally 100 times faster). Furthermore, a revised algorithm of PullUp, termed filtered PullUp is proposed that is able to further reduce the cost of PullUp by 75% when the union of the selection predicates is selective. Detailed cost models, which consider several special parameters, including (1) characteristics of queries to be grouped, and (2) characteristics of data changes, are presented. Preliminary experiments using an implementation of both strategies show that our models are fairly accurate in predicting the results obtained from the implementation of these techniques in the Niagara system.
我们设计并评估了在Internet环境中优化大量连续查询的备选选择放置策略。提出并研究了两种分组策略,PushDown和PullUp,其中选择在下面推或在上面拉,连接。虽然我们之前的研究表明,增量组优化可以显著优于非分组方法,但本文的结果表明,不同的增量组优化策略可以具有显著不同的性能特征。令人惊讶的是,在我们的研究中,将选择从连接上拉出的PullUp通常更好,并且实现了比PushDown平均10倍的性能提升(有时快100倍)。此外,提出了一种改进的PullUp算法,称为过滤PullUp,当选择谓词的联合是选择性的时,能够进一步降低75%的PullUp成本。提出了详细的成本模型,该模型考虑了几个特殊参数,包括(1)分组查询的特征和(2)数据变化的特征。使用这两种策略实施的初步实验表明,我们的模型在预测这些技术在尼亚加拉系统实施所获得的结果方面相当准确。
{"title":"Design and evaluation of alternative selection placement strategies in optimizing continuous queries","authors":"Jianjun Chen, D. DeWitt, J. Naughton","doi":"10.1109/ICDE.2002.994749","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994749","url":null,"abstract":"We design and evaluate alternative selection placement strategies for optimizing a very large number of continuous queries in an Internet environment. Two grouping strategies, PushDown and PullUp, in which selections are either pushed below, or pulled above, joins are proposed and investigated. While our earlier research has demonstrated that the incremental group optimization can significantly outperform an ungrouped approach, the results from the paper show that different incremental group optimization strategies can have significantly different performance characteristics. Surprisingly, in our studies, PullUp, in which selections are pulled above joins, is often better and achieves an average 10 fold performance improvement over PushDown (occasionally 100 times faster). Furthermore, a revised algorithm of PullUp, termed filtered PullUp is proposed that is able to further reduce the cost of PullUp by 75% when the union of the selection predicates is selective. Detailed cost models, which consider several special parameters, including (1) characteristics of queries to be grouped, and (2) characteristics of data changes, are presented. Preliminary experiments using an implementation of both strategies show that our models are fairly accurate in predicting the results obtained from the implementation of these techniques in the Niagara system.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127204877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 114
Efficient evaluation of queries with mining predicates 使用挖掘谓词对查询进行有效评估
Pub Date : 2002-06-03 DOI: 10.1109/ICDE.2002.994772
S. Chaudhuri, Vivek R. Narasayya, Sunita Sarawagi
Modern relational database systems are beginning to support ad-hoc queries on data mining models. In this paper, we explore novel techniques for optimizing queries that apply mining models to relational data. For such queries, we use the internal structure of the mining model to automatically derive traditional database predicates. We present algorithms for deriving such predicates for some popular discrete mining models: decision trees, naive Bayes, and clustering. Our experiments on a Microsoft SQL Server 2000 demonstrate that these derived predicates can significantly reduce the cost of evaluating such queries.
现代关系数据库系统开始支持对数据挖掘模型的特别查询。在本文中,我们探索了将挖掘模型应用于关系数据的查询优化的新技术。对于这样的查询,我们使用挖掘模型的内部结构来自动派生传统的数据库谓词。我们提出了为一些流行的离散挖掘模型推导这些谓词的算法:决策树、朴素贝叶斯和聚类。我们在Microsoft SQL Server 2000上的实验表明,这些派生谓词可以显著降低计算此类查询的成本。
{"title":"Efficient evaluation of queries with mining predicates","authors":"S. Chaudhuri, Vivek R. Narasayya, Sunita Sarawagi","doi":"10.1109/ICDE.2002.994772","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994772","url":null,"abstract":"Modern relational database systems are beginning to support ad-hoc queries on data mining models. In this paper, we explore novel techniques for optimizing queries that apply mining models to relational data. For such queries, we use the internal structure of the mining model to automatically derive traditional database predicates. We present algorithms for deriving such predicates for some popular discrete mining models: decision trees, naive Bayes, and clustering. Our experiments on a Microsoft SQL Server 2000 demonstrate that these derived predicates can significantly reduce the cost of evaluating such queries.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"26 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114023941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Similarity flooding: a versatile graph matching algorithm and its application to schema matching 相似泛洪:一种通用的图匹配算法及其在模式匹配中的应用
Pub Date : 2002-02-26 DOI: 10.1109/ICDE.2002.994702
S. Melnik, H. Garcia-Molina, E. Rahm
Matching elements of two data schemas or two data instances plays a key role in data warehousing, e-business, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using filters. After our algorithm runs, we expect a human to check and if necessary adjust the results. As a matter of fact, we evaluate the 'accuracy' of the algorithm by counting the number of needed adjustments. We conducted a user study, in which our accuracy metric was used to estimate the labor savings that the users could obtain by utilizing our algorithm to obtain an initial matching. Finally, we illustrate how our matching algorithm is deployed as one of several high-level operators in an implemented testbed for managing information models and mappings.
匹配两个数据模式或两个数据实例的元素在数据仓库、电子商务甚至生化应用程序中起着关键作用。本文提出了一种基于不动点计算的匹配算法,该算法可用于不同的场景。该算法将两个图(模式、目录或其他数据结构)作为输入,并在图的相应节点之间生成映射作为输出。根据匹配目标,使用过滤器选择映射的子集。在我们的算法运行之后,我们希望有人来检查并在必要时调整结果。事实上,我们通过计算需要调整的次数来评估算法的“准确性”。我们进行了一项用户研究,其中使用我们的精度度量来估计用户通过使用我们的算法获得初始匹配可以节省的劳动力。最后,我们将说明如何将我们的匹配算法部署为管理信息模型和映射的已实现测试平台中的几个高级操作符之一。
{"title":"Similarity flooding: a versatile graph matching algorithm and its application to schema matching","authors":"S. Melnik, H. Garcia-Molina, E. Rahm","doi":"10.1109/ICDE.2002.994702","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994702","url":null,"abstract":"Matching elements of two data schemas or two data instances plays a key role in data warehousing, e-business, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using filters. After our algorithm runs, we expect a human to check and if necessary adjust the results. As a matter of fact, we evaluate the 'accuracy' of the algorithm by counting the number of needed adjustments. We conducted a user study, in which our accuracy metric was used to estimate the labor savings that the users could obtain by utilizing our algorithm to obtain an initial matching. Finally, we illustrate how our matching algorithm is deployed as one of several high-level operators in an implemented testbed for managing information models and mappings.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122927054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1641
Evaluating top-k queries over Web-accessible databases 评估web可访问数据库上的top-k查询
Pub Date : 2002-02-26 DOI: 10.1109/ICDE.2002.994751
Nicolas Bruno, L. Gravano, A. Marian
A query to a Web search engine usually consists of a list of keywords, to which the search engine responds with the best or "top" k pages for the query. This top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. For example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. A user who queries such a relation might simply specify the user's location and target price range, and expect in return the best 10 restaurants in terms of some combination-of proximity to the user, closeness of match to the target price range, and overall food rating. Processing such top-k queries efficiently is challenging for a number of reasons. One critical such reason is that, in many Web applications, the relation attributes might not be available other than through external Web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. In this paper, we study how to process top-k queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. We present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real Web-accessible data.
对Web搜索引擎的查询通常由关键字列表组成,搜索引擎对关键字列表做出响应,给出该查询的最佳或“最热门”k个页面。这种top-k查询模型一般用于多媒体集合,但也用于某些应用程序的普通关系数据。例如,考虑与可用餐馆信息的关系,包括它们的位置、用餐者的价格范围和总体食物评级。查询这种关系的用户可以简单地指定用户的位置和目标价格范围,并期望根据与用户的接近程度、与目标价格范围的匹配程度以及总体食物评级等组合获得最好的10家餐馆。由于许多原因,高效地处理此类top-k查询具有挑战性。其中一个关键的原因是,在许多Web应用程序中,关系属性可能只能通过外部Web可访问的表单接口使用,我们将不得不反复查询这些接口以获取潜在的大量候选对象。在本文中,我们研究了如何在这种设置中有效地处理top-k查询,其中用户指定目标值的属性可能由具有各种访问接口的外部自治源处理。我们提出了几种处理此类查询的算法,并使用合成数据和真实的web可访问数据对它们进行了全面评估。
{"title":"Evaluating top-k queries over Web-accessible databases","authors":"Nicolas Bruno, L. Gravano, A. Marian","doi":"10.1109/ICDE.2002.994751","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994751","url":null,"abstract":"A query to a Web search engine usually consists of a list of keywords, to which the search engine responds with the best or \"top\" k pages for the query. This top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. For example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. A user who queries such a relation might simply specify the user's location and target price range, and expect in return the best 10 restaurants in terms of some combination-of proximity to the user, closeness of match to the target price range, and overall food rating. Processing such top-k queries efficiently is challenging for a number of reasons. One critical such reason is that, in many Web applications, the relation attributes might not be available other than through external Web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. In this paper, we study how to process top-k queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. We present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real Web-accessible data.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126063201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 559
Sequenced subset operators: definition and implementation 序列子集操作符:定义和实现
Pub Date : 2002-02-26 DOI: 10.1109/ICDE.2002.994699
Joseph Dunn, S. Davey, A. Descour, R. Snodgrass
Difference, intersection, semi join and anti-semi-join may be considered binary subset operators, in that they all return a subset of their left-hand argument. These operators are useful for implementing SQL's EXCEPT, INTERSECT, NOT IN and NOT EXISTS, distributed queries and referential integrity. Difference-all and intersection-all operate on multi-sets and track the number of duplicates in both argument relations; they are used to implement SQL's EXCEPT ALL and INTERSECT ALL. Their temporally sequenced analogues, which effectively apply the subset operator at each point in time, are needed for implementing these constructs in temporal databases. These SQL expressions are complex; most necessitate at least a three-way join, with nested NOT EXISTS clauses. We consider how to implement these operators directly in a DBMS. These operators are interesting in that they can fragment the left-hand validity periods (sequenced difference-all also fragments the right-hand periods) and thus introduce memory complications found neither in their non-temporal counterparts nor in temporal joins and semijoins. We introduce novel algorithms for implementing these operators by ordering the computation so that fragments need not be retained in main memory. We evaluate these algorithms and demonstrate that they are no more expensive than a single conventional join.
差、交、半连接和反半连接可以被认为是二元子集操作符,因为它们都返回其左参数的子集。这些操作符对于实现SQL的EXCEPT、INTERSECT、NOT IN和NOT EXISTS、分布式查询和引用完整性非常有用。全差和全交在多集合上操作,并跟踪两个参数关系中重复的个数;它们被用来实现SQL的EXCEPT ALL和INTERSECT ALL。在时态数据库中实现这些结构需要它们的时间序列类似物,它们在每个时间点有效地应用子集运算符。这些SQL表达式很复杂;大多数都至少需要一个三向连接,并使用嵌套的NOT EXISTS子句。我们考虑如何在DBMS中直接实现这些操作符。这些操作符的有趣之处在于,它们可以分割左边的有效周期(按顺序排列的差异—所有的都可以分割右边的有效周期),从而引入在非时间连接和时间连接和半连接中都找不到的内存复杂性。我们引入了新的算法来实现这些运算符,通过排序计算,使片段不需要保留在主存中。我们对这些算法进行了评估,并证明它们并不比单个传统连接更昂贵。
{"title":"Sequenced subset operators: definition and implementation","authors":"Joseph Dunn, S. Davey, A. Descour, R. Snodgrass","doi":"10.1109/ICDE.2002.994699","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994699","url":null,"abstract":"Difference, intersection, semi join and anti-semi-join may be considered binary subset operators, in that they all return a subset of their left-hand argument. These operators are useful for implementing SQL's EXCEPT, INTERSECT, NOT IN and NOT EXISTS, distributed queries and referential integrity. Difference-all and intersection-all operate on multi-sets and track the number of duplicates in both argument relations; they are used to implement SQL's EXCEPT ALL and INTERSECT ALL. Their temporally sequenced analogues, which effectively apply the subset operator at each point in time, are needed for implementing these constructs in temporal databases. These SQL expressions are complex; most necessitate at least a three-way join, with nested NOT EXISTS clauses. We consider how to implement these operators directly in a DBMS. These operators are interesting in that they can fragment the left-hand validity periods (sequenced difference-all also fragments the right-hand periods) and thus introduce memory complications found neither in their non-temporal counterparts nor in temporal joins and semijoins. We introduce novel algorithms for implementing these operators by ordering the computation so that fragments need not be retained in main memory. We evaluate these algorithms and demonstrate that they are no more expensive than a single conventional join.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132946532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
StreamCorder: fast trial-and-error analysis in scientific databases StreamCorder:在科学数据库中快速试错分析
Pub Date : 2002-02-26 DOI: 10.1109/ICDE.2002.994769
E. Stolte, G. Alonso
We have implemented a client/server system for fast trial-and-error analysis: the StreamCorder. The server streams wavelet-encoded views to the clients, where they are cached, decoded and processed. Low-quality decoding is beneficial for slow network connections. Low-resolution decoding greatly accelerates decoding and analysis. Depending on the system resources, cached data and analysis requirements, the user may alter the minimum analysis quality at any time.
我们已经实现了一个用于快速试错分析的客户端/服务器系统:StreamCorder。服务器将小波编码的视图流式传输到客户端,在客户端对其进行缓存、解码和处理。低质量解码有利于慢速网络连接。低分辨率解码大大加快了解码和分析速度。根据系统资源、缓存数据和分析需求,用户可以随时更改最低分析质量。
{"title":"StreamCorder: fast trial-and-error analysis in scientific databases","authors":"E. Stolte, G. Alonso","doi":"10.1109/ICDE.2002.994769","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994769","url":null,"abstract":"We have implemented a client/server system for fast trial-and-error analysis: the StreamCorder. The server streams wavelet-encoded views to the clients, where they are cached, decoded and processed. Low-quality decoding is beneficial for slow network connections. Low-resolution decoding greatly accelerates decoding and analysis. Depending on the system resources, cached data and analysis requirements, the user may alter the minimum analysis quality at any time.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128063996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Managing complex and varied data with the IndexFabric/sup TM/ 使用IndexFabric/sup TM/管理复杂多变的数据
Pub Date : 2002-02-26 DOI: 10.1109/ICDE.2002.994765
N. Sample, Brian F. Cooper, M. Franklin, Gísli R. Hjaltason, Moshe Shadmon, Levy Cohe
Emerging networked applications present significant challenges for traditional data management techniques for two reasons. First, they are based on data encoded in XML, LDAP directories, etc. that typically have complex inter-relationships. Second, the dynamic nature of networked applications and the need to integrate data from multiple sources results in data that is semior irregularly structured. The IndexFabric has been developed to meet both these challenges. In this demonstration, we show how the IndexFabric efficiently encodes and indexes very large collections of irregular, semistructured, and complex data.
新兴的网络应用程序对传统的数据管理技术提出了重大挑战,原因有两个。首先,它们基于XML、LDAP目录等编码的数据,这些数据通常具有复杂的相互关系。其次,网络应用程序的动态特性和集成来自多个来源的数据的需要导致数据的结构非常不规则。IndexFabric的开发就是为了应对这两个挑战。在这个演示中,我们将展示IndexFabric如何有效地对不规则、半结构化和复杂数据的非常大的集合进行编码和索引。
{"title":"Managing complex and varied data with the IndexFabric/sup TM/","authors":"N. Sample, Brian F. Cooper, M. Franklin, Gísli R. Hjaltason, Moshe Shadmon, Levy Cohe","doi":"10.1109/ICDE.2002.994765","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994765","url":null,"abstract":"Emerging networked applications present significant challenges for traditional data management techniques for two reasons. First, they are based on data encoded in XML, LDAP directories, etc. that typically have complex inter-relationships. Second, the dynamic nature of networked applications and the need to integrate data from multiple sources results in data that is semior irregularly structured. The IndexFabric has been developed to meet both these challenges. In this demonstration, we show how the IndexFabric efficiently encodes and indexes very large collections of irregular, semistructured, and complex data.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115980197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Efficient indexing structures for mining frequent patterns 用于挖掘频繁模式的高效索引结构
Pub Date : 2002-02-26 DOI: 10.1109/ICDE.2002.994758
Bin Lan, B. Ooi, K. Tan
In this paper, we propose a variant of the signature file, called bit-sliced bloom-filtered signature file (BBS), as the basis for implementing filter-and-refine strategies for mining frequent patterns. In the filtering step, the candidate patterns are obtained by scanning BBS instead of the database. The resultant candidate set contains a superset of the frequent patterns. In the refinement phase, each algorithm refines the candidate set to prune away the false drops. Based on this indexing structure, we study two filtering (single and dual filter) and two refinement (sequential scan and probe) mechanisms, thus giving rise to four different strategies. We conducted an extensive performance study to study the effectiveness of BBS, and compared the four proposed processing schemes with the traditional a priori algorithm and the recently proposed FP-tree scheme. Our results show that BBS, as a whole, outperforms the a priori strategy. Moreover, one of the schemes that is based on dual filter and probe refinement performs the best in all cases.
在本文中,我们提出了签名文件的一种变体,称为位切片开花过滤签名文件(BBS),作为实现过滤和细化策略以挖掘频繁模式的基础。在过滤步骤中,候选模式是通过扫描BBS而不是数据库来获得的。结果候选集包含频繁模式的超集。在细化阶段,每个算法对候选集进行细化,以去除假滴。基于这种索引结构,我们研究了两种过滤机制(单过滤和双过滤)和两种细化机制(顺序扫描和探测),从而产生了四种不同的策略。我们进行了广泛的性能研究来研究BBS的有效性,并将四种提出的处理方案与传统的先验算法和最近提出的FP-tree方案进行了比较。我们的研究结果表明,BBS作为一个整体,优于先验策略。此外,其中一种基于双滤波器和探测细化的方案在所有情况下都表现最好。
{"title":"Efficient indexing structures for mining frequent patterns","authors":"Bin Lan, B. Ooi, K. Tan","doi":"10.1109/ICDE.2002.994758","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994758","url":null,"abstract":"In this paper, we propose a variant of the signature file, called bit-sliced bloom-filtered signature file (BBS), as the basis for implementing filter-and-refine strategies for mining frequent patterns. In the filtering step, the candidate patterns are obtained by scanning BBS instead of the database. The resultant candidate set contains a superset of the frequent patterns. In the refinement phase, each algorithm refines the candidate set to prune away the false drops. Based on this indexing structure, we study two filtering (single and dual filter) and two refinement (sequential scan and probe) mechanisms, thus giving rise to four different strategies. We conducted an extensive performance study to study the effectiveness of BBS, and compared the four proposed processing schemes with the traditional a priori algorithm and the recently proposed FP-tree scheme. Our results show that BBS, as a whole, outperforms the a priori strategy. Moreover, one of the schemes that is based on dual filter and probe refinement performs the best in all cases.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116484616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
An efficient index structure for shift and scale invariant search of mufti-attribute time sequences 一种有效的多属性时间序列移位和尺度不变搜索索引结构
Pub Date : 2002-02-26 DOI: 10.1109/ICDE.2002.994720
Tamer Kahveci, Ambuj K. Singh, Aliekber Gürel
We consider the problem of shift and scale invariant search for multi-attribute time sequences. Our work fills a void in the existing literature for time sequence similarity since the existing techniques do not consider the general symmetric formulation of the problem. We define a new distance function for mufti-attribute time sequences that is symmetric: the distance between two time sequences is defined to be the smallest Euclidean distance after scaling and shifting either one of the sequences to be as close to the other. We define two models for comparing mufti-attribute time sequences: in the first model, the scaling and shifting of the component sequences are dependent, and in the second model they are independent. We propose a novel index structure called CS-Index (cone slice) for shift and scale invariant comparison of time sequences.
研究了多属性时间序列的平移和尺度不变搜索问题。我们的工作填补了现有文献中关于时间序列相似性的空白,因为现有技术没有考虑问题的一般对称公式。我们定义了一个新的多属性时间序列的对称距离函数:两个时间序列之间的距离定义为任意一个序列缩放和移动后的最小欧氏距离。我们定义了两种多属性时间序列比较模型:第一种模型中,分量序列的尺度和位移是相关的,第二种模型中,分量序列的尺度和位移是独立的。提出了一种新的索引结构CS-Index(锥片),用于时间序列的平移和尺度不变比较。
{"title":"An efficient index structure for shift and scale invariant search of mufti-attribute time sequences","authors":"Tamer Kahveci, Ambuj K. Singh, Aliekber Gürel","doi":"10.1109/ICDE.2002.994720","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994720","url":null,"abstract":"We consider the problem of shift and scale invariant search for multi-attribute time sequences. Our work fills a void in the existing literature for time sequence similarity since the existing techniques do not consider the general symmetric formulation of the problem. We define a new distance function for mufti-attribute time sequences that is symmetric: the distance between two time sequences is defined to be the smallest Euclidean distance after scaling and shifting either one of the sequences to be as close to the other. We define two models for comparing mufti-attribute time sequences: in the first model, the scaling and shifting of the component sequences are dependent, and in the second model they are independent. We propose a novel index structure called CS-Index (cone slice) for shift and scale invariant comparison of time sequences.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126093998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Detecting changes in XML documents 检测XML文档中的更改
Pub Date : 2002-02-26 DOI: 10.1109/ICDE.2002.994696
G. Cobena, S. Abiteboul, A. Marian
We present a diff algorithm for XML data. This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volumes of XML data. Because of the context, our algorithm has to be very efficient in terms of speed and memory space even at the cost of some loss of quality. Also, it considers, besides insertions, deletions and updates (standard in diffs), a move operation on subtrees that is essential in the context of XML. Intuitively, our diff algorithm uses signatures to match (large) subtrees that were left unchanged between the old and new versions. Such exact matchings are then possibly propagated to ancestors and descendants to obtain more matchings. It also uses XML specific information such as ID attributes. We provide a performance analysis of the algorithm. We show that it runs in average in linear time vs. quadratic time for previous algorithms. We present experiments on synthetic data that confirm the analysis. Since this problem is NP-hard, the linear time is obtained by trading some quality. We present experiments (again on synthetic data) that show that the output of our algorithm is reasonably close to the optimal in terms of quality. Finally we present experiments on a small sample of XML pages found on the Web.
提出了一种XML数据的diff算法。这项工作的动机是对Xyleme项目上下文中的变更控制的支持,Xyleme项目正在研究能够存储大量XML数据的动态仓库。由于上下文的关系,我们的算法必须在速度和内存空间方面非常高效,即使以一些质量损失为代价。此外,除了插入、删除和更新(差异中的标准)之外,它还考虑了子树上的移动操作,这在XML上下文中是必不可少的。直观地说,我们的diff算法使用签名来匹配旧版本和新版本之间保持不变的(大)子树。这种精确的匹配然后可能传播给祖先和后代,以获得更多的匹配。它还使用特定于XML的信息,如ID属性。我们提供了算法的性能分析。我们证明了它在线性时间内的平均运行时间与之前算法的二次时间相比。我们在合成数据上做了实验来证实这一分析。由于这个问题是np困难的,线性时间是通过交换一些质量来获得的。我们提供的实验(同样是在合成数据上)表明,我们的算法的输出在质量方面相当接近最佳。最后,我们在Web上找到的一个小样本XML页面上进行了实验。
{"title":"Detecting changes in XML documents","authors":"G. Cobena, S. Abiteboul, A. Marian","doi":"10.1109/ICDE.2002.994696","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994696","url":null,"abstract":"We present a diff algorithm for XML data. This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volumes of XML data. Because of the context, our algorithm has to be very efficient in terms of speed and memory space even at the cost of some loss of quality. Also, it considers, besides insertions, deletions and updates (standard in diffs), a move operation on subtrees that is essential in the context of XML. Intuitively, our diff algorithm uses signatures to match (large) subtrees that were left unchanged between the old and new versions. Such exact matchings are then possibly propagated to ancestors and descendants to obtain more matchings. It also uses XML specific information such as ID attributes. We provide a performance analysis of the algorithm. We show that it runs in average in linear time vs. quadratic time for previous algorithms. We present experiments on synthetic data that confirm the analysis. Since this problem is NP-hard, the linear time is obtained by trading some quality. We present experiments (again on synthetic data) that show that the output of our algorithm is reasonably close to the optimal in terms of quality. Finally we present experiments on a small sample of XML pages found on the Web.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122790654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 533
期刊
Proceedings 18th International Conference on Data Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1