Although wrapper generation work has been reported in the literature, there seem no standard ways to evaluate the performance of such systems. We conducted a series of experiments to evaluate the usability, correctness and efficiency of SG-WRAP. The usability tests selected a number of users to use the system. The results indicated that, with minimal introduction of the system, DTD definition and structure of HTML pages, even naive users could quickly generate wrappers without much difficulty. For correctness, we adapted the precision and recall metrics in information retrieval to data extraction. The results show that, with the refining process, the system can generate wrappers with very high accuracy. Finally, the efficiency tests indicated that the wrapper generation process is fast enough even with large size Web pages.
{"title":"SG-WRAP: a schema-guided wrapper generator","authors":"Xiaofeng Meng, Hongjun Lu, Haiyan Wang, Mingzhe Gu","doi":"10.1109/ICDE.2002.994743","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994743","url":null,"abstract":"Although wrapper generation work has been reported in the literature, there seem no standard ways to evaluate the performance of such systems. We conducted a series of experiments to evaluate the usability, correctness and efficiency of SG-WRAP. The usability tests selected a number of users to use the system. The results indicated that, with minimal introduction of the system, DTD definition and structure of HTML pages, even naive users could quickly generate wrappers without much difficulty. For correctness, we adapted the precision and recall metrics in information retrieval to data extraction. The results show that, with the refining process, the system can generate wrappers with very high accuracy. Finally, the efficiency tests indicated that the wrapper generation process is fast enough even with large size Web pages.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117342658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994738
B. Benatallah, Quan Z. Sheng, A. Ngu, M. Dumas
The development of new services through the integration of existing ones has gained a considerable momentum as a means to create and streamline business-to-business collaborations. Unfortunately, as Web services are often autonomous and heterogeneous entities, connecting and coordinating them in order to build integrated services is a delicate and time-consuming task. In this paper, we describe the design and implementation of a system through which existing Web services can be declaratively composed, and the resulting composite services can be executed following a peer-to-peer paradigm, within a dynamic environment. This system provides tools for specifying composite services through. statecharts, data conversion rules, and provider selection, policies. These specifications are then translated into XML documents that can be interpreted by peer-to-peer inter-connected software components, in order to provision the composite service without requiring a central authority.
{"title":"Declarative composition and peer-to-peer provisioning of dynamic Web services","authors":"B. Benatallah, Quan Z. Sheng, A. Ngu, M. Dumas","doi":"10.1109/ICDE.2002.994738","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994738","url":null,"abstract":"The development of new services through the integration of existing ones has gained a considerable momentum as a means to create and streamline business-to-business collaborations. Unfortunately, as Web services are often autonomous and heterogeneous entities, connecting and coordinating them in order to build integrated services is a delicate and time-consuming task. In this paper, we describe the design and implementation of a system through which existing Web services can be declaratively composed, and the resulting composite services can be executed following a peer-to-peer paradigm, within a dynamic environment. This system provides tools for specifying composite services through. statecharts, data conversion rules, and provider selection, policies. These specifications are then translated into XML documents that can be interpreted by peer-to-peer inter-connected software components, in order to provision the composite service without requiring a central authority.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114260338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994781
Yi-Leh Wu, D. Agrawal, A. E. Abbadi
The ability to provide accurate and efficient result estimations of user queries is very important for the query optimizer in database systems. In this paper, we show that the traditional estimation techniques with data reduction points of view do not produce satisfiable estimation results if the query patterns are dynamically changing. We further show that to reduce query estimation error, instead of accurately capturing the data distribution, it is more effective to capture the user query patterns. In this paper, we propose query estimation techniques that can adapt to user query patterns for more accurate estimates of the size of selection or range queries over databases.
{"title":"Query estimation by adaptive sampling","authors":"Yi-Leh Wu, D. Agrawal, A. E. Abbadi","doi":"10.1109/ICDE.2002.994781","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994781","url":null,"abstract":"The ability to provide accurate and efficient result estimations of user queries is very important for the query optimizer in database systems. In this paper, we show that the traditional estimation techniques with data reduction points of view do not produce satisfiable estimation results if the query patterns are dynamically changing. We further show that to reduce query estimation error, instead of accurately capturing the data distribution, it is more effective to capture the user query patterns. In this paper, we propose query estimation techniques that can adapt to user query patterns for more accurate estimates of the size of selection or range queries over databases.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128229702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994698
P. Bohannon, J. Freire, Prasan Roy, Jérôme Siméon
As Web applications manipulate an increasing amount of XML, there is a growing interest in storing XML data in relational databases. Due to the mismatch between the complexity of XML's tree structure and the simplicity of flat relational tables, there are many ways to store the same document in an RDBMS, and a number of heuristic techniques have been proposed. These techniques typically define fixed mappings and do not take application characteristics into account. However, a fixed mapping is unlikely to work well for all possible applications. In contrast, LegoDB is a cost-based XML storage mapping engine that explores a space of possible XML-to-relational mappings and selects the best mapping for a given application. LegoDB leverages current XML and relational technologies: (1) it models the target application with an XML Schema, XML data statistics, and an XQuery workload; (2) the space of configurations is generated through XML-Schema rewritings; and (3) the best among the derived configurations is selected using cost estimates obtained through a standard relational optimizer. We describe the LegoDB storage engine and provide experimental results that demonstrate the effectiveness of this approach.
{"title":"From XML schema to relations: a cost-based approach to XML storage","authors":"P. Bohannon, J. Freire, Prasan Roy, Jérôme Siméon","doi":"10.1109/ICDE.2002.994698","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994698","url":null,"abstract":"As Web applications manipulate an increasing amount of XML, there is a growing interest in storing XML data in relational databases. Due to the mismatch between the complexity of XML's tree structure and the simplicity of flat relational tables, there are many ways to store the same document in an RDBMS, and a number of heuristic techniques have been proposed. These techniques typically define fixed mappings and do not take application characteristics into account. However, a fixed mapping is unlikely to work well for all possible applications. In contrast, LegoDB is a cost-based XML storage mapping engine that explores a space of possible XML-to-relational mappings and selects the best mapping for a given application. LegoDB leverages current XML and relational technologies: (1) it models the target application with an XML Schema, XML data statistics, and an XQuery workload; (2) the space of configurations is generated through XML-Schema rewritings; and (3) the best among the derived configurations is selected using cost estimates obtained through a standard relational optimizer. We describe the LegoDB storage engine and provide experimental results that demonstrate the effectiveness of this approach.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128466405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994709
Yufei Tao, D. Papadias, Jun Zhang
Overlapping and multi-version techniques are two popular frameworks that transform an ephemeral index into a multiple logical-tree structure in order to support versioning databases. Although both frameworks have produced numerous efficient indexing methods, their performance analysis is rather limited; as a result, there is no clear understanding about the behavior of the alternative structures and the choice of the best one, given the data and query characteristics. Furthermore, query optimization based on these methods is currently impossible. These are serious problems due to the incorporation of overlapping and multi-version techniques in several traditional (e.g. banking) and emerging (e.g. spatio-temporal) applications. In this paper, we propose frameworks for reducing the performance analysis of overlapping and multi-version structures to that of the corresponding ephemeral structures, thus simplifying the problem significantly. The frameworks lead to accurate cost models that predict the sizes of the trees, the node accesses and query selectivity. Although we focus on B-tree-based structures, the proposed models can be employed with a variety of indexes.
{"title":"Cost models for overlapping and multi-version B-trees","authors":"Yufei Tao, D. Papadias, Jun Zhang","doi":"10.1109/ICDE.2002.994709","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994709","url":null,"abstract":"Overlapping and multi-version techniques are two popular frameworks that transform an ephemeral index into a multiple logical-tree structure in order to support versioning databases. Although both frameworks have produced numerous efficient indexing methods, their performance analysis is rather limited; as a result, there is no clear understanding about the behavior of the alternative structures and the choice of the best one, given the data and query characteristics. Furthermore, query optimization based on these methods is currently impossible. These are serious problems due to the incorporation of overlapping and multi-version techniques in several traditional (e.g. banking) and emerging (e.g. spatio-temporal) applications. In this paper, we propose frameworks for reducing the performance analysis of overlapping and multi-version structures to that of the corresponding ephemeral structures, thus simplifying the problem significantly. The frameworks lead to accurate cost models that predict the sizes of the trees, the node accesses and query selectivity. Although we focus on B-tree-based structures, the proposed models can be employed with a variety of indexes.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128723598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994717
Bin Chen, P. Haas, P. Scheuermann
We present FAST (finding associations from sampled transactions), a refined sampling-based mining algorithm that is distinguished from prior algorithms by its novel two-phase approach to sample collection. In phase I a large sample is collected to quickly and accurately estimate the support of each item in the database. In phase II, a small final sample is obtained by excluding "outlier" transactions in such a manner that the support of each item in the final sample is as close as possible to the estimated support of the item in the entire database. We propose two approaches to obtaining the final sample in phase II: trimming and growing. The trimming procedure starts from the large initial sample and removes outlier transactions until a specified stopping criterion is satisfied. In contrast, the growing procedure selects representative transactions from the initial sample and adds them to an initially empty data set.
{"title":"FAST: a new sampling-based algorithm for discovering association rules","authors":"Bin Chen, P. Haas, P. Scheuermann","doi":"10.1109/ICDE.2002.994717","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994717","url":null,"abstract":"We present FAST (finding associations from sampled transactions), a refined sampling-based mining algorithm that is distinguished from prior algorithms by its novel two-phase approach to sample collection. In phase I a large sample is collected to quickly and accurately estimate the support of each item in the database. In phase II, a small final sample is obtained by excluding \"outlier\" transactions in such a manner that the support of each item in the final sample is as close as possible to the estimated support of the item in the entire database. We propose two approaches to obtaining the final sample in phase II: trimming and growing. The trimming procedure starts from the large initial sample and removes outlier transactions until a specified stopping criterion is satisfied. In contrast, the growing procedure selects representative transactions from the initial sample and adds them to an initially empty data set.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126670014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994722
B. Liu, Jing Liu
In this paper, we study a special form of time-series prediction, viz. the prediction of a dependent variable taking discrete values. Although in a real application this variable may take numeric values, the users are usually only interested in its value ranges, e.g. normal or abnormal, not its actual values. In this work, we extended two traditional classification techniques, namely the naive Bayesian classifier and decision trees, to suit temporal prediction. This results in two new techniques: a temporal naive Bayesian (T-NB) model and a temporal decision tree (T-DT). T-NB and T-DT have been tested on seven real-life data sets from an oil refinery. Experimental results show that they perform very accurate predictions.
{"title":"Multivariate time series prediction via temporal classification","authors":"B. Liu, Jing Liu","doi":"10.1109/ICDE.2002.994722","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994722","url":null,"abstract":"In this paper, we study a special form of time-series prediction, viz. the prediction of a dependent variable taking discrete values. Although in a real application this variable may take numeric values, the users are usually only interested in its value ranges, e.g. normal or abnormal, not its actual values. In this work, we extended two traditional classification techniques, namely the naive Bayesian classifier and decision trees, to suit temporal prediction. This results in two new techniques: a temporal naive Bayesian (T-NB) model and a temporal decision tree (T-DT). T-NB and T-DT have been tested on seven real-life data sets from an oil refinery. Experimental results show that they perform very accurate predictions.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114353069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994773
R. Barga, D. Lomet, G. Weikum
Database recovery does not mask failures to applications and users. Recovery is needed that considers data, messages and application components. Special cases have been studied, but clear principles for recovery guarantees in general multi-tier applications such as Web-based e-services are missing. We develop a framework for recovery guarantees that masks almost all failures. The main concept is an interaction contract between two components, a pledge as to message and state persistence, and contract release. Contracts are composed into system-wide agreements so that a set of components is provably recoverable with exactly-once message delivery and execution, except perhaps for crash-interrupted user input or output. Our implementation techniques reduce the data logging cost, allow effective log truncation, and provide independent recovery for critical server components. Interaction contracts form the basis for our Phoenix/COM project on persistent components. Our framework's utility is demonstrated with a case study of a web-based e-service.
{"title":"Recovery guarantees for general multi-tier applications","authors":"R. Barga, D. Lomet, G. Weikum","doi":"10.1109/ICDE.2002.994773","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994773","url":null,"abstract":"Database recovery does not mask failures to applications and users. Recovery is needed that considers data, messages and application components. Special cases have been studied, but clear principles for recovery guarantees in general multi-tier applications such as Web-based e-services are missing. We develop a framework for recovery guarantees that masks almost all failures. The main concept is an interaction contract between two components, a pledge as to message and state persistence, and contract release. Contracts are composed into system-wide agreements so that a set of components is provably recoverable with exactly-once message delivery and execution, except perhaps for crash-interrupted user input or output. Our implementation techniques reduce the data logging cost, allow effective log truncation, and provide independent recovery for critical server components. Interaction contracts form the basis for our Phoenix/COM project on persistent components. Our framework's utility is demonstrated with a case study of a web-based e-service.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134490232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994725
Felix Naumann, C. T. H. Ho, Xuqing Tian, L. Haas, N. Megiddo
The basis of many systems that integrate data from multiple sources is a set of correspondences between source schemata and a target schema. Correspondences express a relationship between sets of source attributes, possibly from multiple sources, and a set of target attributes. Clio is an integration tool that assists users in defining value correspondences between attributes. In real life scenarios there may be many sources and the source relations may have many attributes. Users can get lost and might miss or be unable to find some correspondences. Also, in many real life schemata the attribute names reveal little or nothing about the semantics of the data values. Only the data values in the attribute columns can convey the semantic meaning of the attribute. Our work relieves users of the problems of too many attributes and meaningless attribute names, by automatically suggesting correspondences between source and target attributes. For each attribute, we analyze the data values and derive a set of features.
{"title":"Attribute classification using feature analysis","authors":"Felix Naumann, C. T. H. Ho, Xuqing Tian, L. Haas, N. Megiddo","doi":"10.1109/ICDE.2002.994725","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994725","url":null,"abstract":"The basis of many systems that integrate data from multiple sources is a set of correspondences between source schemata and a target schema. Correspondences express a relationship between sets of source attributes, possibly from multiple sources, and a set of target attributes. Clio is an integration tool that assists users in defining value correspondences between attributes. In real life scenarios there may be many sources and the source relations may have many attributes. Users can get lost and might miss or be unable to find some correspondences. Also, in many real life schemata the attribute names reveal little or nothing about the semantics of the data values. Only the data values in the attribute columns can convey the semantic meaning of the attribute. Our work relieves users of the problems of too many attributes and meaningless attribute names, by automatically suggesting correspondences between source and target attributes. For each attribute, we analyze the data values and derive a set of features.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"278 1-2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131662327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-08-07DOI: 10.1109/ICDE.2002.994715
C. Aggarwal
Many organizations today store large streams of transactional data in real time. This data can often show important changes in trends over time. In many commercial applications, it may be valuable to provide the user with an understanding of the nature of changes occuring over time in the data stream. In this paper, we discuss the process of analysing the significant changes and trends in data streams in a way which is understandable, intuitive and user-friendly.
{"title":"An intuitive framework for understanding changes in evolving data streams","authors":"C. Aggarwal","doi":"10.1109/ICDE.2002.994715","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994715","url":null,"abstract":"Many organizations today store large streams of transactional data in real time. This data can often show important changes in trends over time. In many commercial applications, it may be valuable to provide the user with an understanding of the nature of changes occuring over time in the data stream. In this paper, we discuss the process of analysing the significant changes and trends in data streams in a way which is understandable, intuitive and user-friendly.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115399766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}