Estimating the global data distribution in Peer-to-Peer (P2P) networks is an important issue and has yet to be well addressed. It can benefit many P2P applications, such as load balancing analysis, query processing, and data mining. Inspired by the inversion method for random variate generation, in this paper we present a novel model named distribution-free data density estimation for dynamic ring-based P2P networks to achieve high estimation accuracy with low estimation cost regardless of distribution models of the underlying data. It generates random samples for any arbitrary distribution by sampling the global cumulative distribution function and is free from sampling bias. In P2P networks, the key idea for distribution-free estimation is to sample a small subset of peers for estimating the global data distribution over the data domain. Algorithms on computing and sampling the global cumulative distribution function based on which global data distribution is estimated are introduced with detailed theoretical analysis. Our extensive performance study confirms the effectiveness and efficiency of our methods in ring-based P2P networks.
{"title":"Effective Data Density Estimation in Ring-Based P2P Networks","authors":"Minqi Zhou, Heng Tao Shen, Xiaofang Zhou, Weining Qian, Aoying Zhou","doi":"10.1109/ICDE.2012.19","DOIUrl":"https://doi.org/10.1109/ICDE.2012.19","url":null,"abstract":"Estimating the global data distribution in Peer-to-Peer (P2P) networks is an important issue and has yet to be well addressed. It can benefit many P2P applications, such as load balancing analysis, query processing, and data mining. Inspired by the inversion method for random variate generation, in this paper we present a novel model named distribution-free data density estimation for dynamic ring-based P2P networks to achieve high estimation accuracy with low estimation cost regardless of distribution models of the underlying data. It generates random samples for any arbitrary distribution by sampling the global cumulative distribution function and is free from sampling bias. In P2P networks, the key idea for distribution-free estimation is to sample a small subset of peers for estimating the global data distribution over the data domain. Algorithms on computing and sampling the global cumulative distribution function based on which global data distribution is estimated are introduced with detailed theoretical analysis. Our extensive performance study confirms the effectiveness and efficiency of our methods in ring-based P2P networks.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"8 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130356811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sub graph search is a popular query scenario on graph databases. Given a query graph q, the sub graph search algorithm returns all database graphs having q as a sub graph. To efficiently implement a subgraph search, subgraph features are mined in order to index the graph database. Many subgraph feature mining approaches have been proposed. They are all "mine-at-once" algorithms in which the whole feature set is mined in one run before building a stable graph index. However, due to the change of environments (such as an update of the graph database and the increase of available memory), the index needs to be updated to accommodate such changes. Most of the "mine-at-once" algorithms involve frequent subgraph or subtree mining over the whole graph database. Also, constructing and deploying a new index involves an expensive disk operation such that it is inefficient to re-mine the features and rebuild the index from scratch. We observe that, under most cases, it is sufficient to update a small part of the graph index. Here we propose an "iterative subgraph mining" algorithm which iteratively finds one feature to insert into (or remove from) the index. Since the majority of indexing features and the index structure are not changed, the algorithm can be frequently invoked. We define an objective function that guides the feature mining. Next, we propose a basic branch and bound algorithm to mine the features. Finally, we design an advanced search algorithm, which quickly finds a near-optimum subgraph feature and reduces the search space. Experiments show that our feature mining algorithm is 5 times faster than the popular graph indexing algorithm gIndex, and that features mined by our iterative algorithm have a better filtering rate for the subgraph search problem.
{"title":"Iterative Graph Feature Mining for Graph Indexing","authors":"Dayu Yuan, P. Mitra, Huiwen Yu, C. Lee Giles","doi":"10.1109/ICDE.2012.11","DOIUrl":"https://doi.org/10.1109/ICDE.2012.11","url":null,"abstract":"Sub graph search is a popular query scenario on graph databases. Given a query graph q, the sub graph search algorithm returns all database graphs having q as a sub graph. To efficiently implement a subgraph search, subgraph features are mined in order to index the graph database. Many subgraph feature mining approaches have been proposed. They are all \"mine-at-once\" algorithms in which the whole feature set is mined in one run before building a stable graph index. However, due to the change of environments (such as an update of the graph database and the increase of available memory), the index needs to be updated to accommodate such changes. Most of the \"mine-at-once\" algorithms involve frequent subgraph or subtree mining over the whole graph database. Also, constructing and deploying a new index involves an expensive disk operation such that it is inefficient to re-mine the features and rebuild the index from scratch. We observe that, under most cases, it is sufficient to update a small part of the graph index. Here we propose an \"iterative subgraph mining\" algorithm which iteratively finds one feature to insert into (or remove from) the index. Since the majority of indexing features and the index structure are not changed, the algorithm can be frequently invoked. We define an objective function that guides the feature mining. Next, we propose a basic branch and bound algorithm to mine the features. Finally, we design an advanced search algorithm, which quickly finds a near-optimum subgraph feature and reduces the search space. Experiments show that our feature mining algorithm is 5 times faster than the popular graph indexing algorithm gIndex, and that features mined by our iterative algorithm have a better filtering rate for the subgraph search problem.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"245 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132291337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many companies now routinely run massive data analysis jobs -- expressed in some scripting language -- on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.
{"title":"Exploiting Common Subexpressions for Cloud Query Processing","authors":"Yasin N. Silva, P. Larson, Jingren Zhou","doi":"10.1109/ICDE.2012.106","DOIUrl":"https://doi.org/10.1109/ICDE.2012.106","url":null,"abstract":"Many companies now routinely run massive data analysis jobs -- expressed in some scripting language -- on large clusters of low-end servers. Many analysis scripts are complex and contain common sub expressions, that is, intermediate results that are subsequently joined and aggregated in multiple different ways. Applying conventional optimization techniques to such scripts will produce plans that execute a common sub expression multiple times, once for each consumer, which is clearly wasteful. Moreover, different consumers may have different physical requirements on the result: one consumer may want it partitioned on a column A and another one partitioned on column B. To find a truly optimal plan, the optimizer must trade off such conflicting requirements in a cost-based manner. In this paper we show how to extend a Cascade-style optimizer to correctly optimize scripts containing common sub expression. The approach has been prototyped in SCOPE, Microsoft's system for massive data analysis. Experimental analysis of both simple and large real-world scripts shows that the extended optimizer produces plans with 21 to 57% lower estimated costs.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123813189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Several efforts have been made for more privacy aware Online Social Networks (OSNs) to protect personal data against various privacy threats. However, despite the relevance of these proposals, we believe there is still the lack of a conceptual model on top of which privacy tools have to be designed. Central to this model should be the concept of risk. Therefore, in this paper, we propose a risk measure for OSNs. The aim is to associate a risk level with social network users in order to provide other users with a measure of how much it might be risky, in terms of disclosure of private information, to have interactions with them. We compute risk levels based on similarity and benefit measures, by also taking into account the user risk attitudes. In particular, we adopt an active learning approach for risk estimation, where user risk attitude is learned from few required user interactions. The risk estimation process discussed in this paper has been developed into a Facebook application and tested on real data. The experiments show the effectiveness of our proposal.
{"title":"Privacy in Social Networks: How Risky is Your Social Graph?","authors":"C. Akcora, B. Carminati, E. Ferrari","doi":"10.1109/ICDE.2012.99","DOIUrl":"https://doi.org/10.1109/ICDE.2012.99","url":null,"abstract":"Several efforts have been made for more privacy aware Online Social Networks (OSNs) to protect personal data against various privacy threats. However, despite the relevance of these proposals, we believe there is still the lack of a conceptual model on top of which privacy tools have to be designed. Central to this model should be the concept of risk. Therefore, in this paper, we propose a risk measure for OSNs. The aim is to associate a risk level with social network users in order to provide other users with a measure of how much it might be risky, in terms of disclosure of private information, to have interactions with them. We compute risk levels based on similarity and benefit measures, by also taking into account the user risk attitudes. In particular, we adopt an active learning approach for risk estimation, where user risk attitude is learned from few required user interactions. The risk estimation process discussed in this paper has been developed into a Facebook application and tested on real data. The experiments show the effectiveness of our proposal.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125260322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Tauheed, Laurynas Biveinis, T. Heinis, F. Schürmann, H. Markram, A. Ailamaki
Neuroscientists increasingly use computational tools in building and simulating models of the brain. The amounts of data involved in these simulations are immense and efficiently managing this data is key. One particular problem in analyzing this data is the scalable execution of range queries on spatial models of the brain. Known indexing approaches do not perform well even on today's small models which represent a small fraction of the brain, containing only few millions of densely packed spatial elements. The problem of current approaches is that with the increasing level of detail in the models, also the overlap in the tree structure increases, ultimately slowing down query execution. The neuroscientists' need to work with bigger and more detailed (denser) models thus motivates us to develop a new indexing approach. To this end we develop FLAT, a scalable indexing approach for dense data sets. We base the development of FLAT on the key observation that current approaches suffer from overlap in case of dense data sets. We hence design FLAT as an approach with two phases, each independent of density. In the first phase it uses a traditional spatial index to retrieve an initial object efficiently. In the second phase it traverses the initial object's neighborhood to retrieve the remaining query result. Our experimental results show that FLAT not only outperforms R-Tree variants from a factor of two up to eight but that it also achieves independence from data set size and density.
{"title":"Accelerating Range Queries for Brain Simulations","authors":"F. Tauheed, Laurynas Biveinis, T. Heinis, F. Schürmann, H. Markram, A. Ailamaki","doi":"10.1109/ICDE.2012.56","DOIUrl":"https://doi.org/10.1109/ICDE.2012.56","url":null,"abstract":"Neuroscientists increasingly use computational tools in building and simulating models of the brain. The amounts of data involved in these simulations are immense and efficiently managing this data is key. One particular problem in analyzing this data is the scalable execution of range queries on spatial models of the brain. Known indexing approaches do not perform well even on today's small models which represent a small fraction of the brain, containing only few millions of densely packed spatial elements. The problem of current approaches is that with the increasing level of detail in the models, also the overlap in the tree structure increases, ultimately slowing down query execution. The neuroscientists' need to work with bigger and more detailed (denser) models thus motivates us to develop a new indexing approach. To this end we develop FLAT, a scalable indexing approach for dense data sets. We base the development of FLAT on the key observation that current approaches suffer from overlap in case of dense data sets. We hence design FLAT as an approach with two phases, each independent of density. In the first phase it uses a traditional spatial index to retrieve an initial object efficiently. In the second phase it traverses the initial object's neighborhood to retrieve the remaining query result. Our experimental results show that FLAT not only outperforms R-Tree variants from a factor of two up to eight but that it also achieves independence from data set size and density.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121248881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Text clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. In most cases, the data is not purely available in text form. A lot of side-information is available along with the text documents. Such side-information may be of different kinds, such as the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the clustering process, because it can either improve the quality of the representation for clustering, or can add noise to the process. Therefore, we need a principled way to perform the clustering process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach.
{"title":"On Text Clustering with Side Information","authors":"C. Aggarwal, Yuchen Zhao, Philip S. Yu","doi":"10.1109/ICDE.2012.111","DOIUrl":"https://doi.org/10.1109/ICDE.2012.111","url":null,"abstract":"Text clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. In most cases, the data is not purely available in text form. A lot of side-information is available along with the text documents. Such side-information may be of different kinds, such as the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the clustering process, because it can either improve the quality of the representation for clustering, or can add noise to the process. Therefore, we need a principled way to perform the clustering process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"15 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125761350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Uwe Draisbach, Felix Naumann, Sascha Szott, Oliver Wonneberg
Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose with the Duplicate Count Strategy (DCS) a variation of SNM that uses a varying window size. It is based on the intuition that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. Next to the basic variant of DCS, we also propose and thoroughly evaluate a variant called DCS++ which is provably better than the original SNM in terms of efficiency (same results with fewer comparisons).
{"title":"Adaptive Windows for Duplicate Detection","authors":"Uwe Draisbach, Felix Naumann, Sascha Szott, Oliver Wonneberg","doi":"10.1109/ICDE.2012.20","DOIUrl":"https://doi.org/10.1109/ICDE.2012.20","url":null,"abstract":"Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose with the Duplicate Count Strategy (DCS) a variation of SNM that uses a varying window size. It is based on the intuition that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. Next to the basic variant of DCS, we also propose and thoroughly evaluate a variant called DCS++ which is provably better than the original SNM in terms of efficiency (same results with fewer comparisons).","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115615179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
At the beginning of Web 2.0 era, Online Social Networks (OSNs) appeared as just another phenomenon among wikis, blogs, video sharing, and so on. However, they soon became one of the biggest revolution of the Internet era. Statistics confirm the continuing rise in the importance of social networking sites in terms of number of users (e.g., Facebook reaches 750 millions users, Twitter 200 millions, LinkedIn 100 millions), time spent in social networking sites, and amount of data flowing (e.g., Facebook users interact with about 900 million piece of data in terms of pages, groups, events and community pages). This successful trend lets OSNs to be one of the most promising paradigms for information sharing on the Web.
在Web 2.0时代之初,在线社交网络(Online Social Networks, OSNs)只是wiki、博客、视频共享等中的另一种现象。然而,它们很快成为互联网时代最大的革命之一。统计数据证实了社交网站在用户数量(例如,Facebook达到7.5亿用户,Twitter达到2亿用户,LinkedIn达到1亿用户)、在社交网站上花费的时间和数据流量(例如,Facebook用户在页面、群组、事件和社区页面方面与大约9亿条数据进行交互)方面的重要性持续上升。这种成功的趋势使osn成为Web上信息共享最有前途的范例之一。
{"title":"Trust and Share: Trusted Information Sharing in Online Social Networks","authors":"B. Carminati, E. Ferrari, Jacopo Girardi","doi":"10.1109/ICDE.2012.127","DOIUrl":"https://doi.org/10.1109/ICDE.2012.127","url":null,"abstract":"At the beginning of Web 2.0 era, Online Social Networks (OSNs) appeared as just another phenomenon among wikis, blogs, video sharing, and so on. However, they soon became one of the biggest revolution of the Internet era. Statistics confirm the continuing rise in the importance of social networking sites in terms of number of users (e.g., Facebook reaches 750 millions users, Twitter 200 millions, LinkedIn 100 millions), time spent in social networking sites, and amount of data flowing (e.g., Facebook users interact with about 900 million piece of data in terms of pages, groups, events and community pages). This successful trend lets OSNs to be one of the most promising paradigms for information sharing on the Web.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128265206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A system for efficient keyword search in graphs is demonstrated. The system has two components, a search through only the nodes containing the input keywords for a set of nodes that are close to each other and together cover the input keywords and an exploration for finding how these nodes are related to each other. The system generates all or top-k answers in polynomial delay. Answers are presented to the user according to a ranking criterion so that the answers with nodes closer to each other are presented before the ones with nodes farther away from each other. In addition, the set of answers produced by our system is duplication free. The system uses two methods for presenting the final answer to the user. The presentation methods reveal relationships among the nodes in an answer through a tree or a multi-center graph. We will show that each method has its own advantages and disadvantages. The system is demonstrated using two challenging datasets, very large DBLP and highly cyclic Mondial. Challenges and difficulties in implementing an efficient keyword search system are also demonstrated.
{"title":"Efficient Top-k Keyword Search in Graphs with Polynomial Delay","authors":"M. Kargar, Aijun An","doi":"10.1109/ICDE.2012.124","DOIUrl":"https://doi.org/10.1109/ICDE.2012.124","url":null,"abstract":"A system for efficient keyword search in graphs is demonstrated. The system has two components, a search through only the nodes containing the input keywords for a set of nodes that are close to each other and together cover the input keywords and an exploration for finding how these nodes are related to each other. The system generates all or top-k answers in polynomial delay. Answers are presented to the user according to a ranking criterion so that the answers with nodes closer to each other are presented before the ones with nodes farther away from each other. In addition, the set of answers produced by our system is duplication free. The system uses two methods for presenting the final answer to the user. The presentation methods reveal relationships among the nodes in an answer through a tree or a multi-center graph. We will show that each method has its own advantages and disadvantages. The system is demonstrated using two challenging datasets, very large DBLP and highly cyclic Mondial. Challenges and difficulties in implementing an efficient keyword search system are also demonstrated.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117346790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mapping complex metadata structures is crucial in a number of domains such as data integration, ontology alignment or model management. To speed up the generation of such mappings, automatic matching systems were developed to compute mapping suggestions that can be corrected by a user. However, constructing and tuning match strategies still requires a high manual effort by matching experts as well as correct mappings to evaluate generated mappings. We therefore propose a self-configuring schema matching system that is able to automatically adapt to the given mapping problem at hand. Our approach is based on analyzing the input schemas as well as intermediate matching results. A variety of matching rules use the analysis results to automatically construct and adapt an underlying matching process for a given match task. We comprehensively evaluate our approach on different mapping problems from the schema, ontology and model management domains. The evaluation shows that our system is able to robustly return good quality mappings across different mapping problems and domains.
{"title":"A Self-Configuring Schema Matching System","authors":"E. Peukert, Julian Eberius, E. Rahm","doi":"10.1109/ICDE.2012.21","DOIUrl":"https://doi.org/10.1109/ICDE.2012.21","url":null,"abstract":"Mapping complex metadata structures is crucial in a number of domains such as data integration, ontology alignment or model management. To speed up the generation of such mappings, automatic matching systems were developed to compute mapping suggestions that can be corrected by a user. However, constructing and tuning match strategies still requires a high manual effort by matching experts as well as correct mappings to evaluate generated mappings. We therefore propose a self-configuring schema matching system that is able to automatically adapt to the given mapping problem at hand. Our approach is based on analyzing the input schemas as well as intermediate matching results. A variety of matching rules use the analysis results to automatically construct and adapt an underlying matching process for a given match task. We comprehensively evaluate our approach on different mapping problems from the schema, ontology and model management domains. The evaluation shows that our system is able to robustly return good quality mappings across different mapping problems and domains.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"86 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132756283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}