Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994768
Mauricio A. Hernández, Lucian Popa, Yannis Velegrakis, Renée J. Miller, Felix Naumann, C. T. H. Ho
Merging and coalescing data from multiple and diverse sources into different data formats continues to be an important problem in modern information systems. Schema matching (the process of matching elements of a source schema with elements of a target schema) and schema mapping (the process of creating a query that maps between two disparate schemas) are at the heart of data integration systems. We demonstrate Clio, a semi-automatic schema mapping tool developed at the IBM Almaden Research Center. In this paper, we showcase Clio's mapping engine which allows mapping to and from relational and XML schemas, and takes advantage of data constraints in order to preserve data associations.
{"title":"Mapping XML and relational schemas with Clio","authors":"Mauricio A. Hernández, Lucian Popa, Yannis Velegrakis, Renée J. Miller, Felix Naumann, C. T. H. Ho","doi":"10.1109/ICDE.2002.994768","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994768","url":null,"abstract":"Merging and coalescing data from multiple and diverse sources into different data formats continues to be an important problem in modern information systems. Schema matching (the process of matching elements of a source schema with elements of a target schema) and schema mapping (the process of creating a query that maps between two disparate schemas) are at the heart of data integration systems. We demonstrate Clio, a semi-automatic schema mapping tool developed at the IBM Almaden Research Center. In this paper, we showcase Clio's mapping engine which allows mapping to and from relational and XML schemas, and takes advantage of data constraints in order to preserve data associations.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134156352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994710
Anton Faradjian, J. Gehrke, Philippe Bonnet
Large sensor networks are being widely deployed for measurement, detection and monitoring applications. Many of these applications involve database systems to store and process data from the physical world. This data has inherent measurement uncertainties that are properly represented by continuous probability distribution functions (PDFs). We introduce a new object-relational abstract data type (ADT) - the Gaussian ADT (GADT) - that models physical data as Gaussian PDFs, and we show that existing index structures can be used as fast access methods for GADT data. We also present a measurement-theoretic model of probabilistic data and evaluate GADT in its light.
{"title":"GADT: a probability space ADT for representing and querying the physical world","authors":"Anton Faradjian, J. Gehrke, Philippe Bonnet","doi":"10.1109/ICDE.2002.994710","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994710","url":null,"abstract":"Large sensor networks are being widely deployed for measurement, detection and monitoring applications. Many of these applications involve database systems to store and process data from the physical world. This data has inherent measurement uncertainties that are properly represented by continuous probability distribution functions (PDFs). We introduce a new object-relational abstract data type (ADT) - the Gaussian ADT (GADT) - that models physical data as Gaussian PDFs, and we show that existing index structures can be used as fast access methods for GADT data. We also present a measurement-theoretic model of probabilistic data and evaluate GADT in its light.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125772940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994784
M. Vlachos, D. Gunopulos, G. Kollios
We investigate techniques for analysis and retrieval of object trajectories in two or three dimensional space. Such data usually contain a large amount of noise, that has made previously used metrics fail. Therefore, we formalize non-metric similarity functions based on the longest common subsequence (LCSS), which are very robust to noise and furthermore provide an intuitive notion of similarity between trajectories by giving more weight to similar portions of the sequences. Stretching of sequences in time is allowed, as well as global translation of the sequences in space. Efficient approximate algorithms that compute these similarity measures are also provided. We compare these new methods to the widely used Euclidean and time warping distance functions (for real and synthetic data) and show the superiority of our approach, especially in the strong presence of noise. We prove a weaker version of the triangle inequality and employ it in an indexing structure to answer nearest neighbor queries. Finally, we present experimental results that validate the accuracy and efficiency of our approach.
{"title":"Discovering similar multidimensional trajectories","authors":"M. Vlachos, D. Gunopulos, G. Kollios","doi":"10.1109/ICDE.2002.994784","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994784","url":null,"abstract":"We investigate techniques for analysis and retrieval of object trajectories in two or three dimensional space. Such data usually contain a large amount of noise, that has made previously used metrics fail. Therefore, we formalize non-metric similarity functions based on the longest common subsequence (LCSS), which are very robust to noise and furthermore provide an intuitive notion of similarity between trajectories by giving more weight to similar portions of the sequences. Stretching of sequences in time is allowed, as well as global translation of the sequences in space. Efficient approximate algorithms that compute these similarity measures are also provided. We compare these new methods to the widely used Euclidean and time warping distance functions (for real and synthetic data) and show the superiority of our approach, especially in the strong presence of noise. We prove a weaker version of the triangle inequality and employ it in an indexing structure to answer nearest neighbor queries. Finally, we present experimental results that validate the accuracy and efficiency of our approach.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123128557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994731
E. Schallehn, K. Sattler, G. Saake
The general concept of grouping and aggregation appears to be a fitting paradigm for various issues in data integration, but in its common form of equality-based grouping, a number of problems remain unsolved. We propose a generic approach to user-defined grouping as part of a SQL extension, allowing for more complex functions, for instance integration of data mining algorithms. Furthermore, we discuss high-level language primitives for common applications.
{"title":"Extensible and similarity-based grouping for data integration","authors":"E. Schallehn, K. Sattler, G. Saake","doi":"10.1109/ICDE.2002.994731","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994731","url":null,"abstract":"The general concept of grouping and aggregation appears to be a fitting paradigm for various issues in data integration, but in its common form of equality-based grouping, a number of problems remain unsolved. We propose a generic approach to user-defined grouping as part of a SQL extension, allowing for more complex functions, for instance integration of data mining algorithms. Furthermore, we discuss high-level language primitives for common applications.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121765841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994721
Dongwon Lee, Murali Mani, Frank Chiu, W. Chu
Two conversion algorithms, called NeT and COT, to translate relational schemas to XML schemas using various semantic constraints are presented. We first present a language-independent formalism named XSchema so that our algorithms are able to generate output schema in various XML schema language proposals. The benefits of such a formalism are that it is both precise and concise. Based on the XSchema formalism, our proposed algorithms have the following characteristics: (1) NeT derives a nested structure from a flat relational model by repeatedly applying the nest operator so that the resulting XML schema becomes hierarchical, and (2) COT considers not only the structure of relational schemas, but also inclusion dependencies during the translation so that relational schemas where multiple tables are interconnected through inclusion dependencies can also be handled.
{"title":"NeT and CoT: inferring XML schemas from relational world","authors":"Dongwon Lee, Murali Mani, Frank Chiu, W. Chu","doi":"10.1109/ICDE.2002.994721","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994721","url":null,"abstract":"Two conversion algorithms, called NeT and COT, to translate relational schemas to XML schemas using various semantic constraints are presented. We first present a language-independent formalism named XSchema so that our algorithms are able to generate output schema in various XML schema language proposals. The benefits of such a formalism are that it is both precise and concise. Based on the XSchema formalism, our proposed algorithms have the following characteristics: (1) NeT derives a nested structure from a flat relational model by repeatedly applying the nest operator so that the resulting XML schema becomes hierarchical, and (2) COT considers not only the structure of relational schemas, but also inclusion dependencies during the translation so that relational schemas where multiple tables are interconnected through inclusion dependencies can also be handled.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121794865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994727
Eric Ka Ka Ng, A. Fu
With high-dimensional data, natural clusters are expected to exist in different subspaces. We propose the EPC (efficient projected clustering) algorithm to discover the sets of correlated dimensions and the location of the clusters. This algorithm is quite different from previous approaches and has the following advantages: (1) there is no requirement on the input regarding the number of natural clusters and the average cardinality of the subspaces; (2) it can handle clusters of irregular shapes; (3) it produces better clustering results compared to the best previous method; (4) it has high scalability. From experiments, it is several times faster than the previous method, while producing more accurate results.
{"title":"Efficient algorithm for projected clustering","authors":"Eric Ka Ka Ng, A. Fu","doi":"10.1109/ICDE.2002.994727","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994727","url":null,"abstract":"With high-dimensional data, natural clusters are expected to exist in different subspaces. We propose the EPC (efficient projected clustering) algorithm to discover the sets of correlated dimensions and the location of the clusters. This algorithm is quite different from previous approaches and has the following advantages: (1) there is no requirement on the input regarding the number of natural clusters and the average cardinality of the subspaces; (2) it can handle clusters of irregular shapes; (3) it produces better clustering results compared to the best previous method; (4) it has high scalability. From experiments, it is several times faster than the previous method, while producing more accurate results.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127501190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994764
Walid G. Aref, A. Catlin, A. Elmagarmid, Jianping Fan, J. Guo, M. Hammad, I. Ilyas, M. Marzouk, Sunil Prabhakar, A. Rezgui, S. Teoh, Evimaria Terzi, Yi-Cheng Tu, A. Vakali, Xingquan Zhu
In our project, we are adopting a new approach for handling video data. We view the video as a well-defined data type with its own description, parameters and applicable methods. The system is based on PREDATOR, an open-source object-relational DBMS. PREDATOR uses Shore as the underlying storage manager. Supporting video operations (storing, searching-by-content and streaming) and new query types (query-by-example and multi-feature similarity searching) requires major changes in many of the traditional system components. More specifically, the storage and buffer manager has to deal with huge volumes of data with real-time constraints. Query processing has to consider the video methods and operators in generating, optimizing and executing the query plans.
{"title":"A distributed database server for continuous media","authors":"Walid G. Aref, A. Catlin, A. Elmagarmid, Jianping Fan, J. Guo, M. Hammad, I. Ilyas, M. Marzouk, Sunil Prabhakar, A. Rezgui, S. Teoh, Evimaria Terzi, Yi-Cheng Tu, A. Vakali, Xingquan Zhu","doi":"10.1109/ICDE.2002.994764","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994764","url":null,"abstract":"In our project, we are adopting a new approach for handling video data. We view the video as a well-defined data type with its own description, parameters and applicable methods. The system is based on PREDATOR, an open-source object-relational DBMS. PREDATOR uses Shore as the underlying storage manager. Supporting video operations (storing, searching-by-content and streaming) and new query types (query-by-example and multi-feature similarity searching) requires major changes in many of the traditional system components. More specifically, the storage and buffer manager has to deal with huge volumes of data with real-time constraints. Query processing has to consider the video methods and operators in generating, optimizing and executing the query plans.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128263753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994756
Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan
With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results. BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.
{"title":"Keyword searching and browsing in databases using BANKS","authors":"Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan","doi":"10.1109/ICDE.2002.994756","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994756","url":null,"abstract":"With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results. BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132588105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2002-02-26DOI: 10.1109/ICDE.2002.994716
M. Akinde, Michael H. Böhlen, T. Johnson, L. Lakshmanan, D. Srivastava
The success of Internet applications has led to an explosive growth in the demand for bandwidth from ISPs. Managing an IP network includes complex data analysis that can often be expressed as OLAP queries. Current day OLAP tools assume the availability of the detailed data in a centralized warehouse. However, the inherently distributed nature of the data collection (e.g., flow-level traffic statistics are gathered at network routers) and the huge amount of data extracted at each collection point (of the order of several gigabytes per day for large IP networks) makes such an approach highly impractical. The natural solution to this problem is to maintain a distributed data warehouse, consisting of multiple local data warehouses (sites) adjacent to the collection points, together with a coordinator. In order for such a solution to make sense, we need a technology for distributed processing of complex OLAP queries. We have developed the Skalla system for this task. We conducted an experimental study of the Skalla evaluation scheme using TPC(R) data.
{"title":"Efficient OLAP query processing in distributed data warehouses","authors":"M. Akinde, Michael H. Böhlen, T. Johnson, L. Lakshmanan, D. Srivastava","doi":"10.1109/ICDE.2002.994716","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994716","url":null,"abstract":"The success of Internet applications has led to an explosive growth in the demand for bandwidth from ISPs. Managing an IP network includes complex data analysis that can often be expressed as OLAP queries. Current day OLAP tools assume the availability of the detailed data in a centralized warehouse. However, the inherently distributed nature of the data collection (e.g., flow-level traffic statistics are gathered at network routers) and the huge amount of data extracted at each collection point (of the order of several gigabytes per day for large IP networks) makes such an approach highly impractical. The natural solution to this problem is to maintain a distributed data warehouse, consisting of multiple local data warehouses (sites) adjacent to the collection points, together with a coordinator. In order for such a solution to make sense, we need a technology for distributed processing of complex OLAP queries. We have developed the Skalla system for this task. We conducted an experimental study of the Skalla evaluation scheme using TPC(R) data.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"19 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114043176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1999-07-31DOI: 10.1109/ICDE.2002.994753
A. Doan, A. Halevy
The goal of a data integration system is to provide a uniform interface to a multitude of data sources. Given a user query formulated in this interface, the system translates it into a set of query plans. Each plan is a query formulated over the data sources, and specifies a way to access sources and combine data to answer the user query. In practice, when the number of sources is large, a data-integration system must generate and execute many query plans with significantly varying utilities. Hence, it is crucial that the system finds the best plans efficiently and executes them first, to guarantee acceptable time to and the quality of the first answers. We describe efficient solutions to this problem. First, we formally define the problem of ordering query plans. Second, we identify several interesting structural properties of the problem and describe three ordering algorithms that exploit these properties. Finally, we describe experimental results that suggest guidance on which algorithms perform best under which conditions.
{"title":"Efficiently ordering query plans for data integration","authors":"A. Doan, A. Halevy","doi":"10.1109/ICDE.2002.994753","DOIUrl":"https://doi.org/10.1109/ICDE.2002.994753","url":null,"abstract":"The goal of a data integration system is to provide a uniform interface to a multitude of data sources. Given a user query formulated in this interface, the system translates it into a set of query plans. Each plan is a query formulated over the data sources, and specifies a way to access sources and combine data to answer the user query. In practice, when the number of sources is large, a data-integration system must generate and execute many query plans with significantly varying utilities. Hence, it is crucial that the system finds the best plans efficiently and executes them first, to guarantee acceptable time to and the quality of the first answers. We describe efficient solutions to this problem. First, we formally define the problem of ordering query plans. Second, we identify several interesting structural properties of the problem and describe three ordering algorithms that exploit these properties. Finally, we describe experimental results that suggest guidance on which algorithms perform best under which conditions.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114894775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}