Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994749
Jianjun Chen, D. DeWitt, J. Naughton
We design and evaluate alternative selection placement strategies for optimizing a very large number of continuous queries in an Internet environment. Two grouping strategies, PushDown and PullUp, in which selections are either pushed below, or pulled above, joins are proposed and investigated. While our earlier research has demonstrated that the incremental group optimization can significantly outperform an ungrouped approach, the results from the paper show that different incremental group optimization strategies can have significantly different performance characteristics. Surprisingly, in our studies, PullUp, in which selections are pulled above joins, is often better and achieves an average 10 fold performance improvement over PushDown (occasionally 100 times faster). Furthermore, a revised algorithm of PullUp, termed filtered PullUp is proposed that is able to further reduce the cost of PullUp by 75% when the union of the selection predicates is selective. Detailed cost models, which consider several special parameters, including (1) characteristics of queries to be grouped, and (2) characteristics of data changes, are presented. Preliminary experiments using an implementation of both strategies show that our models are fairly accurate in predicting the results obtained from the implementation of these techniques in the Niagara system.
{"title":"Design and evaluation of alternative selection placement strategies in optimizing continuous queries","authors":"Jianjun Chen, D. DeWitt, J. Naughton","doi":"10.1109/ICDE.2002.994749","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994749","url":null,"abstract":"We design and evaluate alternative selection placement strategies for optimizing a very large number of continuous queries in an Internet environment. Two grouping strategies, PushDown and PullUp, in which selections are either pushed below, or pulled above, joins are proposed and investigated. While our earlier research has demonstrated that the incremental group optimization can significantly outperform an ungrouped approach, the results from the paper show that different incremental group optimization strategies can have significantly different performance characteristics. Surprisingly, in our studies, PullUp, in which selections are pulled above joins, is often better and achieves an average 10 fold performance improvement over PushDown (occasionally 100 times faster). Furthermore, a revised algorithm of PullUp, termed filtered PullUp is proposed that is able to further reduce the cost of PullUp by 75% when the union of the selection predicates is selective. Detailed cost models, which consider several special parameters, including (1) characteristics of queries to be grouped, and (2) characteristics of data changes, are presented. Preliminary experiments using an implementation of both strategies show that our models are fairly accurate in predicting the results obtained from the implementation of these techniques in the Niagara system.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127204877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-06-03DOI: 10.1109/ICDE.2002.994772
S. Chaudhuri, Vivek R. Narasayya, Sunita Sarawagi
Modern relational database systems are beginning to support ad-hoc queries on data mining models. In this paper, we explore novel techniques for optimizing queries that apply mining models to relational data. For such queries, we use the internal structure of the mining model to automatically derive traditional database predicates. We present algorithms for deriving such predicates for some popular discrete mining models: decision trees, naive Bayes, and clustering. Our experiments on a Microsoft SQL Server 2000 demonstrate that these derived predicates can significantly reduce the cost of evaluating such queries.
现代关系数据库系统开始支持对数据挖掘模型的特别查询。在本文中,我们探索了将挖掘模型应用于关系数据的查询优化的新技术。对于这样的查询,我们使用挖掘模型的内部结构来自动派生传统的数据库谓词。我们提出了为一些流行的离散挖掘模型推导这些谓词的算法:决策树、朴素贝叶斯和聚类。我们在Microsoft SQL Server 2000上的实验表明,这些派生谓词可以显著降低计算此类查询的成本。
{"title":"Efficient evaluation of queries with mining predicates","authors":"S. Chaudhuri, Vivek R. Narasayya, Sunita Sarawagi","doi":"10.1109/ICDE.2002.994772","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994772","url":null,"abstract":"Modern relational database systems are beginning to support ad-hoc queries on data mining models. In this paper, we explore novel techniques for optimizing queries that apply mining models to relational data. For such queries, we use the internal structure of the mining model to automatically derive traditional database predicates. We present algorithms for deriving such predicates for some popular discrete mining models: decision trees, naive Bayes, and clustering. Our experiments on a Microsoft SQL Server 2000 demonstrate that these derived predicates can significantly reduce the cost of evaluating such queries.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"26 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114023941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994702
S. Melnik, H. Garcia-Molina, E. Rahm
Matching elements of two data schemas or two data instances plays a key role in data warehousing, e-business, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using filters. After our algorithm runs, we expect a human to check and if necessary adjust the results. As a matter of fact, we evaluate the 'accuracy' of the algorithm by counting the number of needed adjustments. We conducted a user study, in which our accuracy metric was used to estimate the labor savings that the users could obtain by utilizing our algorithm to obtain an initial matching. Finally, we illustrate how our matching algorithm is deployed as one of several high-level operators in an implemented testbed for managing information models and mappings.
{"title":"Similarity flooding: a versatile graph matching algorithm and its application to schema matching","authors":"S. Melnik, H. Garcia-Molina, E. Rahm","doi":"10.1109/ICDE.2002.994702","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994702","url":null,"abstract":"Matching elements of two data schemas or two data instances plays a key role in data warehousing, e-business, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using filters. After our algorithm runs, we expect a human to check and if necessary adjust the results. As a matter of fact, we evaluate the 'accuracy' of the algorithm by counting the number of needed adjustments. We conducted a user study, in which our accuracy metric was used to estimate the labor savings that the users could obtain by utilizing our algorithm to obtain an initial matching. Finally, we illustrate how our matching algorithm is deployed as one of several high-level operators in an implemented testbed for managing information models and mappings.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122927054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994751
Nicolas Bruno, L. Gravano, A. Marian
A query to a Web search engine usually consists of a list of keywords, to which the search engine responds with the best or "top" k pages for the query. This top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. For example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. A user who queries such a relation might simply specify the user's location and target price range, and expect in return the best 10 restaurants in terms of some combination-of proximity to the user, closeness of match to the target price range, and overall food rating. Processing such top-k queries efficiently is challenging for a number of reasons. One critical such reason is that, in many Web applications, the relation attributes might not be available other than through external Web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. In this paper, we study how to process top-k queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. We present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real Web-accessible data.
{"title":"Evaluating top-k queries over Web-accessible databases","authors":"Nicolas Bruno, L. Gravano, A. Marian","doi":"10.1109/ICDE.2002.994751","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994751","url":null,"abstract":"A query to a Web search engine usually consists of a list of keywords, to which the search engine responds with the best or \"top\" k pages for the query. This top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. For example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. A user who queries such a relation might simply specify the user's location and target price range, and expect in return the best 10 restaurants in terms of some combination-of proximity to the user, closeness of match to the target price range, and overall food rating. Processing such top-k queries efficiently is challenging for a number of reasons. One critical such reason is that, in many Web applications, the relation attributes might not be available other than through external Web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. In this paper, we study how to process top-k queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. We present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real Web-accessible data.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126063201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994699
Joseph Dunn, S. Davey, A. Descour, R. Snodgrass
Difference, intersection, semi join and anti-semi-join may be considered binary subset operators, in that they all return a subset of their left-hand argument. These operators are useful for implementing SQL's EXCEPT, INTERSECT, NOT IN and NOT EXISTS, distributed queries and referential integrity. Difference-all and intersection-all operate on multi-sets and track the number of duplicates in both argument relations; they are used to implement SQL's EXCEPT ALL and INTERSECT ALL. Their temporally sequenced analogues, which effectively apply the subset operator at each point in time, are needed for implementing these constructs in temporal databases. These SQL expressions are complex; most necessitate at least a three-way join, with nested NOT EXISTS clauses. We consider how to implement these operators directly in a DBMS. These operators are interesting in that they can fragment the left-hand validity periods (sequenced difference-all also fragments the right-hand periods) and thus introduce memory complications found neither in their non-temporal counterparts nor in temporal joins and semijoins. We introduce novel algorithms for implementing these operators by ordering the computation so that fragments need not be retained in main memory. We evaluate these algorithms and demonstrate that they are no more expensive than a single conventional join.
{"title":"Sequenced subset operators: definition and implementation","authors":"Joseph Dunn, S. Davey, A. Descour, R. Snodgrass","doi":"10.1109/ICDE.2002.994699","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994699","url":null,"abstract":"Difference, intersection, semi join and anti-semi-join may be considered binary subset operators, in that they all return a subset of their left-hand argument. These operators are useful for implementing SQL's EXCEPT, INTERSECT, NOT IN and NOT EXISTS, distributed queries and referential integrity. Difference-all and intersection-all operate on multi-sets and track the number of duplicates in both argument relations; they are used to implement SQL's EXCEPT ALL and INTERSECT ALL. Their temporally sequenced analogues, which effectively apply the subset operator at each point in time, are needed for implementing these constructs in temporal databases. These SQL expressions are complex; most necessitate at least a three-way join, with nested NOT EXISTS clauses. We consider how to implement these operators directly in a DBMS. These operators are interesting in that they can fragment the left-hand validity periods (sequenced difference-all also fragments the right-hand periods) and thus introduce memory complications found neither in their non-temporal counterparts nor in temporal joins and semijoins. We introduce novel algorithms for implementing these operators by ordering the computation so that fragments need not be retained in main memory. We evaluate these algorithms and demonstrate that they are no more expensive than a single conventional join.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132946532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994769
E. Stolte, G. Alonso
We have implemented a client/server system for fast trial-and-error analysis: the StreamCorder. The server streams wavelet-encoded views to the clients, where they are cached, decoded and processed. Low-quality decoding is beneficial for slow network connections. Low-resolution decoding greatly accelerates decoding and analysis. Depending on the system resources, cached data and analysis requirements, the user may alter the minimum analysis quality at any time.
{"title":"StreamCorder: fast trial-and-error analysis in scientific databases","authors":"E. Stolte, G. Alonso","doi":"10.1109/ICDE.2002.994769","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994769","url":null,"abstract":"We have implemented a client/server system for fast trial-and-error analysis: the StreamCorder. The server streams wavelet-encoded views to the clients, where they are cached, decoded and processed. Low-quality decoding is beneficial for slow network connections. Low-resolution decoding greatly accelerates decoding and analysis. Depending on the system resources, cached data and analysis requirements, the user may alter the minimum analysis quality at any time.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128063996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994765
N. Sample, Brian F. Cooper, M. Franklin, Gísli R. Hjaltason, Moshe Shadmon, Levy Cohe
Emerging networked applications present significant challenges for traditional data management techniques for two reasons. First, they are based on data encoded in XML, LDAP directories, etc. that typically have complex inter-relationships. Second, the dynamic nature of networked applications and the need to integrate data from multiple sources results in data that is semior irregularly structured. The IndexFabric has been developed to meet both these challenges. In this demonstration, we show how the IndexFabric efficiently encodes and indexes very large collections of irregular, semistructured, and complex data.
{"title":"Managing complex and varied data with the IndexFabric/sup TM/","authors":"N. Sample, Brian F. Cooper, M. Franklin, Gísli R. Hjaltason, Moshe Shadmon, Levy Cohe","doi":"10.1109/ICDE.2002.994765","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994765","url":null,"abstract":"Emerging networked applications present significant challenges for traditional data management techniques for two reasons. First, they are based on data encoded in XML, LDAP directories, etc. that typically have complex inter-relationships. Second, the dynamic nature of networked applications and the need to integrate data from multiple sources results in data that is semior irregularly structured. The IndexFabric has been developed to meet both these challenges. In this demonstration, we show how the IndexFabric efficiently encodes and indexes very large collections of irregular, semistructured, and complex data.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115980197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994758
Bin Lan, B. Ooi, K. Tan
In this paper, we propose a variant of the signature file, called bit-sliced bloom-filtered signature file (BBS), as the basis for implementing filter-and-refine strategies for mining frequent patterns. In the filtering step, the candidate patterns are obtained by scanning BBS instead of the database. The resultant candidate set contains a superset of the frequent patterns. In the refinement phase, each algorithm refines the candidate set to prune away the false drops. Based on this indexing structure, we study two filtering (single and dual filter) and two refinement (sequential scan and probe) mechanisms, thus giving rise to four different strategies. We conducted an extensive performance study to study the effectiveness of BBS, and compared the four proposed processing schemes with the traditional a priori algorithm and the recently proposed FP-tree scheme. Our results show that BBS, as a whole, outperforms the a priori strategy. Moreover, one of the schemes that is based on dual filter and probe refinement performs the best in all cases.
{"title":"Efficient indexing structures for mining frequent patterns","authors":"Bin Lan, B. Ooi, K. Tan","doi":"10.1109/ICDE.2002.994758","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994758","url":null,"abstract":"In this paper, we propose a variant of the signature file, called bit-sliced bloom-filtered signature file (BBS), as the basis for implementing filter-and-refine strategies for mining frequent patterns. In the filtering step, the candidate patterns are obtained by scanning BBS instead of the database. The resultant candidate set contains a superset of the frequent patterns. In the refinement phase, each algorithm refines the candidate set to prune away the false drops. Based on this indexing structure, we study two filtering (single and dual filter) and two refinement (sequential scan and probe) mechanisms, thus giving rise to four different strategies. We conducted an extensive performance study to study the effectiveness of BBS, and compared the four proposed processing schemes with the traditional a priori algorithm and the recently proposed FP-tree scheme. Our results show that BBS, as a whole, outperforms the a priori strategy. Moreover, one of the schemes that is based on dual filter and probe refinement performs the best in all cases.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116484616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994720
Tamer Kahveci, Ambuj K. Singh, Aliekber Gürel
We consider the problem of shift and scale invariant search for multi-attribute time sequences. Our work fills a void in the existing literature for time sequence similarity since the existing techniques do not consider the general symmetric formulation of the problem. We define a new distance function for mufti-attribute time sequences that is symmetric: the distance between two time sequences is defined to be the smallest Euclidean distance after scaling and shifting either one of the sequences to be as close to the other. We define two models for comparing mufti-attribute time sequences: in the first model, the scaling and shifting of the component sequences are dependent, and in the second model they are independent. We propose a novel index structure called CS-Index (cone slice) for shift and scale invariant comparison of time sequences.
{"title":"An efficient index structure for shift and scale invariant search of mufti-attribute time sequences","authors":"Tamer Kahveci, Ambuj K. Singh, Aliekber Gürel","doi":"10.1109/ICDE.2002.994720","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994720","url":null,"abstract":"We consider the problem of shift and scale invariant search for multi-attribute time sequences. Our work fills a void in the existing literature for time sequence similarity since the existing techniques do not consider the general symmetric formulation of the problem. We define a new distance function for mufti-attribute time sequences that is symmetric: the distance between two time sequences is defined to be the smallest Euclidean distance after scaling and shifting either one of the sequences to be as close to the other. We define two models for comparing mufti-attribute time sequences: in the first model, the scaling and shifting of the component sequences are dependent, and in the second model they are independent. We propose a novel index structure called CS-Index (cone slice) for shift and scale invariant comparison of time sequences.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126093998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994696
G. Cobena, S. Abiteboul, A. Marian
We present a diff algorithm for XML data. This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volumes of XML data. Because of the context, our algorithm has to be very efficient in terms of speed and memory space even at the cost of some loss of quality. Also, it considers, besides insertions, deletions and updates (standard in diffs), a move operation on subtrees that is essential in the context of XML. Intuitively, our diff algorithm uses signatures to match (large) subtrees that were left unchanged between the old and new versions. Such exact matchings are then possibly propagated to ancestors and descendants to obtain more matchings. It also uses XML specific information such as ID attributes. We provide a performance analysis of the algorithm. We show that it runs in average in linear time vs. quadratic time for previous algorithms. We present experiments on synthetic data that confirm the analysis. Since this problem is NP-hard, the linear time is obtained by trading some quality. We present experiments (again on synthetic data) that show that the output of our algorithm is reasonably close to the optimal in terms of quality. Finally we present experiments on a small sample of XML pages found on the Web.
{"title":"Detecting changes in XML documents","authors":"G. Cobena, S. Abiteboul, A. Marian","doi":"10.1109/ICDE.2002.994696","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994696","url":null,"abstract":"We present a diff algorithm for XML data. This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volumes of XML data. Because of the context, our algorithm has to be very efficient in terms of speed and memory space even at the cost of some loss of quality. Also, it considers, besides insertions, deletions and updates (standard in diffs), a move operation on subtrees that is essential in the context of XML. Intuitively, our diff algorithm uses signatures to match (large) subtrees that were left unchanged between the old and new versions. Such exact matchings are then possibly propagated to ancestors and descendants to obtain more matchings. It also uses XML specific information such as ID attributes. We provide a performance analysis of the algorithm. We show that it runs in average in linear time vs. quadratic time for previous algorithms. We present experiments on synthetic data that confirm the analysis. Since this problem is NP-hard, the linear time is obtained by trading some quality. We present experiments (again on synthetic data) that show that the output of our algorithm is reasonably close to the optimal in terms of quality. Finally we present experiments on a small sample of XML pages found on the Web.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122790654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}