Widespread interest in time-series similarity search has made more in need of efficient technique, which can reduce dimensionality of the data and then to index it easily using a multidimensional structure. In this paper, we introduce a new technique, which we called grid representation, based on a grid approximation of the data. We propose a lower bounding distance measure that enables a bitmap approach for fast computation and searching. We also show how grid representation can be indexed with a multidimensional index structure, and demonstrate its superiority.
{"title":"Grid Representation for Efficient Similarity Search in Time Series Databases","authors":"Guifang Duan, Yu Suzuki, K. Kawagoe","doi":"10.1109/ICDEW.2006.63","DOIUrl":"https://doi.org/10.1109/ICDEW.2006.63","url":null,"abstract":"Widespread interest in time-series similarity search has made more in need of efficient technique, which can reduce dimensionality of the data and then to index it easily using a multidimensional structure. In this paper, we introduce a new technique, which we called grid representation, based on a grid approximation of the data. We propose a lower bounding distance measure that enables a bitmap approach for fast computation and searching. We also show how grid representation can be indexed with a multidimensional index structure, and demonstrate its superiority.","PeriodicalId":331953,"journal":{"name":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","volume":"56 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128003232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Mork, A. Rosenthal, Leonard J. Seligman, Joel Korb, Ken Samuel
A key aspect of any data integration endeavor is establishing a transformation that translates instances of one or more source schemata into instances of a target schema. This schema integration task must be tackled regardless of the integration architecture or mapping formalism. In this paper we provide a task model for schema integration. We use this breakdown to motivate a workbench for schema integration in which multiple tools share a common knowledge repository. In particular, the workbench facilitates the interoperation of research prototypes for schema matching (which automatically identify likely semantic correspondences) with commercial schema mapping tools (which help produce instance-level transformations). Currently, each of these tools provides its own ad hoc representation of schemata and mappings; combining these tools requires aligning these representations. The workbench provides a common representation so that these tools can more rapidly be combined.
{"title":"Integration Workbench: Integrating Schema Integration Tools","authors":"P. Mork, A. Rosenthal, Leonard J. Seligman, Joel Korb, Ken Samuel","doi":"10.1109/ICDEW.2006.69","DOIUrl":"https://doi.org/10.1109/ICDEW.2006.69","url":null,"abstract":"A key aspect of any data integration endeavor is establishing a transformation that translates instances of one or more source schemata into instances of a target schema. This schema integration task must be tackled regardless of the integration architecture or mapping formalism. In this paper we provide a task model for schema integration. We use this breakdown to motivate a workbench for schema integration in which multiple tools share a common knowledge repository. In particular, the workbench facilitates the interoperation of research prototypes for schema matching (which automatically identify likely semantic correspondences) with commercial schema mapping tools (which help produce instance-level transformations). Currently, each of these tools provides its own ad hoc representation of schemata and mappings; combining these tools requires aligning these representations. The workbench provides a common representation so that these tools can more rapidly be combined.","PeriodicalId":331953,"journal":{"name":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129312795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data mining is the process of discovering hidden and meaningful knowledge in a data set. It has been successfully applied to many real-life problems, for instance, web personalization, network intrusion detection, and customized marketing. Recent advances in computational sciences have led to the application of data mining to various scientific domains, such as astronomy and bioinformatics, to facilitate the understanding of different scientific processes in the underlying domain. In this thesis work, we focus on designing and applying data mining techniques to analyze spatial and spatiotemporal data originated in scientific domains. Examples of spatial and spatio-temporal data in scientific domains include data describing protein structures and data produced from protein folding simulations, respectively. Specifically, we have proposed a generalized framework to effectively discover different types of spatial and spatio-temporal patterns in scientific data sets. Such patterns can be used to capture a variety of interactions among objects of interest and the evolutionary behavior of such interactions. We have applied the framework to analyze data originated in the following three application domains: bioinformatics, computational molecular dynamics, and computational fluid dynamics. Empirical results demonstrate that the discovered patterns are meaningful in the underlying domain and can provide important insights into various scientific phenomena.
{"title":"Mining Spatial and Spatio-Temporal Patterns in Scientific Data","authors":"Hui Yang, S. Parthasarathy","doi":"10.1109/ICDEW.2006.92","DOIUrl":"https://doi.org/10.1109/ICDEW.2006.92","url":null,"abstract":"Data mining is the process of discovering hidden and meaningful knowledge in a data set. It has been successfully applied to many real-life problems, for instance, web personalization, network intrusion detection, and customized marketing. Recent advances in computational sciences have led to the application of data mining to various scientific domains, such as astronomy and bioinformatics, to facilitate the understanding of different scientific processes in the underlying domain. In this thesis work, we focus on designing and applying data mining techniques to analyze spatial and spatiotemporal data originated in scientific domains. Examples of spatial and spatio-temporal data in scientific domains include data describing protein structures and data produced from protein folding simulations, respectively. Specifically, we have proposed a generalized framework to effectively discover different types of spatial and spatio-temporal patterns in scientific data sets. Such patterns can be used to capture a variety of interactions among objects of interest and the evolutionary behavior of such interactions. We have applied the framework to analyze data originated in the following three application domains: bioinformatics, computational molecular dynamics, and computational fluid dynamics. Empirical results demonstrate that the discovered patterns are meaningful in the underlying domain and can provide important insights into various scientific phenomena.","PeriodicalId":331953,"journal":{"name":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128624882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
During the wrapping of web interfaces ontological know-ledge is important in order to support an automated interpretation of information. The development of ontologies is a time consuming issue and not realistic in global contexts. On the other hand, the web provides a huge amount of knowledge, which can be used instead of ontologies. Three common classes of web knowledge sources are: Web Thesauri, search engines and Web encyclopedias. The paper investigates how Web knowledge can be utilized to solve the three semantic problems Parameter Finding for Query Interfaces, Labeling of Values and Relabeling after interface evolution. For the solution of the parameter finding problem an algorithm has been implemented using the web encyclopedia WikiPedia for the initial identification of parameter value candidates and the search engine Google for a validation of label-value relationships. The approach has been integrated into a wrapper definition framework.
{"title":"UsingWeb Knowledge to Improve the Wrapping of Web Sources","authors":"Thomas Kabisch, Ronald Padur, D. Rother","doi":"10.1109/ICDEW.2006.160","DOIUrl":"https://doi.org/10.1109/ICDEW.2006.160","url":null,"abstract":"During the wrapping of web interfaces ontological know-ledge is important in order to support an automated interpretation of information. The development of ontologies is a time consuming issue and not realistic in global contexts. On the other hand, the web provides a huge amount of knowledge, which can be used instead of ontologies. Three common classes of web knowledge sources are: Web Thesauri, search engines and Web encyclopedias. The paper investigates how Web knowledge can be utilized to solve the three semantic problems Parameter Finding for Query Interfaces, Labeling of Values and Relabeling after interface evolution. For the solution of the parameter finding problem an algorithm has been implemented using the web encyclopedia WikiPedia for the initial identification of parameter value candidates and the search engine Google for a validation of label-value relationships. The approach has been integrated into a wrapper definition framework.","PeriodicalId":331953,"journal":{"name":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133753296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we describe NKRL (Narrative Knowledge Representation Language), a conceptual modeling formalism for taking into account the semantic characteristics of this important component of eChronicle information represented by the ‘narrative’ documents. In these documents, the main part of the information consists in the description of the ‘events’ that relate the real or intended behavior of some ‘actors’. Narrative documents of an industrial and economic interest correspond to news stories, corporate documents, normative and legal texts, intelligence messages, medical records, etc. NKRL employs several representational principles and some high-level inference tools.
{"title":"Modeling and Advanced Exploitation of eChronicle ‘Narrative’ Information","authors":"G. P. Zarri","doi":"10.1109/ICDEW.2006.95","DOIUrl":"https://doi.org/10.1109/ICDEW.2006.95","url":null,"abstract":"In this paper, we describe NKRL (Narrative Knowledge Representation Language), a conceptual modeling formalism for taking into account the semantic characteristics of this important component of eChronicle information represented by the ‘narrative’ documents. In these documents, the main part of the information consists in the description of the ‘events’ that relate the real or intended behavior of some ‘actors’. Narrative documents of an industrial and economic interest correspond to news stories, corporate documents, normative and legal texts, intelligence messages, medical records, etc. NKRL employs several representational principles and some high-level inference tools.","PeriodicalId":331953,"journal":{"name":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130188007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As various models and languages have been proposed to handle information in the Semantic Web, it is important to be able to translate data from one to another. By referring to two specific models, namely RDF and Topic Maps, we propose a meta-modelling approach, based on previous experiences on handling heterogeneity in the database world.
{"title":"Management of Heterogeneity in the SemanticWeb","authors":"P. Atzeni, P. D. Nostro","doi":"10.1109/ICDEW.2006.74","DOIUrl":"https://doi.org/10.1109/ICDEW.2006.74","url":null,"abstract":"As various models and languages have been proposed to handle information in the Semantic Web, it is important to be able to translate data from one to another. By referring to two specific models, namely RDF and Topic Maps, we propose a meta-modelling approach, based on previous experiences on handling heterogeneity in the database world.","PeriodicalId":331953,"journal":{"name":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127892614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large data integration projects must often cope with undocumented data sources. Schema discovery aims at automatically finding structures in such cases. An important class of relationships between attributes that can be detected automatically are inclusion dependencies (IND), which provide an excellent basis for guessing foreign key constraints. INDs can be discovered by comparing the sets of distinct values of pairs of attributes. In this paper we present efficient algorithms for finding unary INDs. We first show that (and why) SQL is not suitable for this task. We then develop two algorithms that compute inclusion dependencies outside of the database. Both are much faster than the SQL-based methods; in fact, for larger schemas they are the only feasible solution. Our experiments show that we can compute all unary INDs in a schema of 1, 680 attributes with a total database size of 3.2 GB in approximately 2.5 hours.
{"title":"Efficiently Computing Inclusion Dependencies for Schema Discovery","authors":"Jana Bauckmann, U. Leser, Felix Naumann","doi":"10.1109/ICDEW.2006.54","DOIUrl":"https://doi.org/10.1109/ICDEW.2006.54","url":null,"abstract":"Large data integration projects must often cope with undocumented data sources. Schema discovery aims at automatically finding structures in such cases. An important class of relationships between attributes that can be detected automatically are inclusion dependencies (IND), which provide an excellent basis for guessing foreign key constraints. INDs can be discovered by comparing the sets of distinct values of pairs of attributes. In this paper we present efficient algorithms for finding unary INDs. We first show that (and why) SQL is not suitable for this task. We then develop two algorithms that compute inclusion dependencies outside of the database. Both are much faster than the SQL-based methods; in fact, for larger schemas they are the only feasible solution. Our experiments show that we can compute all unary INDs in a schema of 1, 680 attributes with a total database size of 3.2 GB in approximately 2.5 hours.","PeriodicalId":331953,"journal":{"name":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129814442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Web pages are collected and stored in Web archives, and several methods to construct Web archives have been developed. We propose a method to retrieve time series of Web pages from Web archives by using the pages’ temporal characteristics. We present two processes for searching Web archives based on the temporal relation of query keywords. One is a method for determining the relation. The other is a method of inquiring Web pages based on the relation. In this paper, we discuss the two processes and an experimental result of the method.
{"title":"A Temporal Clustering Method forWeb Archives","authors":"T. Kage, K. Sumiya","doi":"10.1109/ICDEW.2006.23","DOIUrl":"https://doi.org/10.1109/ICDEW.2006.23","url":null,"abstract":"Web pages are collected and stored in Web archives, and several methods to construct Web archives have been developed. We propose a method to retrieve time series of Web pages from Web archives by using the pages’ temporal characteristics. We present two processes for searching Web archives based on the temporal relation of query keywords. One is a method for determining the relation. The other is a method of inquiring Web pages based on the relation. In this paper, we discuss the two processes and an experimental result of the method.","PeriodicalId":331953,"journal":{"name":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","volume":"600 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132226434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The adoption of XML to represent any kind of data and documents, even complex and huge, is becoming a matter of fact. However, interfacing algorithms and applications with XML Parsers requires to adapt algorithms and applications: event-based SAX Parsers need algorithms that react to events generated by the parser. But parsing/loading XML documents provides poor performance (if compared to reading flat files): therefore, several researches are trying to address this problem by improving the parsing phase, e.g., by adopting condensed or binary representations of XML documents. This paper deals with the other side of the coin, i.e., the problem of coupling algorithms with XML Parsers, in a way that does not require to change the active (polling-based) nature of many algorithms and provides acceptable performance during execution; this problem becomes even more important when we consider Java algorithms, that usually are less efficient than C or C++ algorithms. This paper presents a study about the problem of loosely coupling Java algorithms with XML Parsers. The coupling is loose because the algorithm should be unaware of the particular interface provided by parsers. We consider several coupling techniques, and we compare them by analyzing their performance. The evaluation leads us to identify the coupling techniques that perform better, depending on the specific algorithm’s needs and application scenario.
{"title":"Loosely Coupling Java Algorithms and XML Parsers: a Performance-Oriented Study","authors":"G. Psaila","doi":"10.1109/ICDEW.2006.73","DOIUrl":"https://doi.org/10.1109/ICDEW.2006.73","url":null,"abstract":"The adoption of XML to represent any kind of data and documents, even complex and huge, is becoming a matter of fact. However, interfacing algorithms and applications with XML Parsers requires to adapt algorithms and applications: event-based SAX Parsers need algorithms that react to events generated by the parser. But parsing/loading XML documents provides poor performance (if compared to reading flat files): therefore, several researches are trying to address this problem by improving the parsing phase, e.g., by adopting condensed or binary representations of XML documents. This paper deals with the other side of the coin, i.e., the problem of coupling algorithms with XML Parsers, in a way that does not require to change the active (polling-based) nature of many algorithms and provides acceptable performance during execution; this problem becomes even more important when we consider Java algorithms, that usually are less efficient than C or C++ algorithms. This paper presents a study about the problem of loosely coupling Java algorithms with XML Parsers. The coupling is loose because the algorithm should be unaware of the particular interface provided by parsers. We consider several coupling techniques, and we compare them by analyzing their performance. The evaluation leads us to identify the coupling techniques that perform better, depending on the specific algorithm’s needs and application scenario.","PeriodicalId":331953,"journal":{"name":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127498965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The CORIE forecast factory consists of a set of data product generation runs that are executed daily on dedicated local resources. The goal is to maximize productivity and resource utilization while still ensuring timely completion of all forecasts. Many existing workflow management systems address low-level workflow specification and execution challenges, but do not directly address the high-level challenges posed by large-scale data product factories. In this paper we discuss several specific challenges to managing the CORIE forecast factory including planning and scheduling, improving data flow, and analyzing log data, and point out their analogs in the "physical" manufacturing world. We present solutions we have implemented to address these challenges, and present experimental results that show the benefits of these solutions.
{"title":"Managing the Forecast Factory","authors":"Laura Bright, D. Maier, Bill Howe","doi":"10.1109/ICDEW.2006.76","DOIUrl":"https://doi.org/10.1109/ICDEW.2006.76","url":null,"abstract":"The CORIE forecast factory consists of a set of data product generation runs that are executed daily on dedicated local resources. The goal is to maximize productivity and resource utilization while still ensuring timely completion of all forecasts. Many existing workflow management systems address low-level workflow specification and execution challenges, but do not directly address the high-level challenges posed by large-scale data product factories. In this paper we discuss several specific challenges to managing the CORIE forecast factory including planning and scheduling, improving data flow, and analyzing log data, and point out their analogs in the \"physical\" manufacturing world. We present solutions we have implemented to address these challenges, and present experimental results that show the benefits of these solutions.","PeriodicalId":331953,"journal":{"name":"22nd International Conference on Data Engineering Workshops (ICDEW'06)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128097930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}