Corpus-based schema matching

21st International Conference on Data Engineering (ICDE'05) Pub Date : 2005-04-05 DOI:10.1109/ICDE.2005.39

J. Madhavan, P. Bernstein, A. Doan, A. Halevy

{"title":"Corpus-based schema matching","authors":"J. Madhavan, P. Bernstein, A. Doan, A. Halevy","doi":"10.1109/ICDE.2005.39","DOIUrl":null,"url":null,"abstract":"Schema matching is the problem of identifying corresponding elements in different schemas. Discovering these correspondences or matches is inherently difficult to automate. Past solutions have proposed a principled combination of multiple algorithms. However, these solutions sometimes perform rather poorly due to the lack of sufficient evidence in the schemas being matched. In this paper we show how a corpus of schemas and mappings can be used to augment the evidence about the schemas being matched, so they can be matched better. Such a corpus typically contains multiple schemas that model similar concepts and hence enables us to learn variations in the elements and their properties. We exploit such a corpus in two ways. First, we increase the evidence about each element being matched by including evidence from similar elements in the corpus. Second, we learn statistics about elements and their relationships and use them to infer constraints that we use to prune candidate mappings. We also describe how to use known mappings to learn the importance of domain and generic constraints. We present experimental results that demonstrate corpus-based matching outperforms direct matching (without the benefit of a corpus) in multiple domains.","PeriodicalId":297231,"journal":{"name":"21st International Conference on Data Engineering (ICDE'05)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"435","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"21st International Conference on Data Engineering (ICDE'05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2005.39","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 435

Abstract

Schema matching is the problem of identifying corresponding elements in different schemas. Discovering these correspondences or matches is inherently difficult to automate. Past solutions have proposed a principled combination of multiple algorithms. However, these solutions sometimes perform rather poorly due to the lack of sufficient evidence in the schemas being matched. In this paper we show how a corpus of schemas and mappings can be used to augment the evidence about the schemas being matched, so they can be matched better. Such a corpus typically contains multiple schemas that model similar concepts and hence enables us to learn variations in the elements and their properties. We exploit such a corpus in two ways. First, we increase the evidence about each element being matched by including evidence from similar elements in the corpus. Second, we learn statistics about elements and their relationships and use them to infer constraints that we use to prune candidate mappings. We also describe how to use known mappings to learn the importance of domain and generic constraints. We present experimental results that demonstrate corpus-based matching outperforms direct matching (without the benefit of a corpus) in multiple domains.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于语料库的模式匹配

模式匹配是在不同模式中识别相应元素的问题。发现这些对应或匹配本身就很难实现自动化。过去的解决方案提出了多种算法的原则组合。然而，由于在匹配的模式中缺乏足够的证据，这些解决方案有时执行得相当差。在本文中，我们展示了如何使用模式和映射的语料库来增加关于正在匹配的模式的证据，以便更好地匹配它们。这样的语料库通常包含多个模式，这些模式对相似的概念进行建模，从而使我们能够了解元素及其属性的变化。我们以两种方式利用这样的语料库。首先，我们通过包含语料库中相似元素的证据来增加每个元素匹配的证据。其次，我们学习关于元素及其关系的统计信息，并使用它们来推断约束，我们使用这些约束来修剪候选映射。我们还描述了如何使用已知映射来了解域约束和泛型约束的重要性。我们提出的实验结果表明，基于语料库的匹配在多个领域优于直接匹配(没有语料库的好处)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

21st International Conference on Data Engineering (ICDE'05)

自引率

0.00%

发文量

期刊最新文献

Proactive caching for spatial queries in mobile environments MoDB: database system for synthesizing human motion Integrating data from disparate sources: a mass collaboration approach ViteX: a streaming XPath processing system Efficient data management on lightweight computing devices