Publish-time data integration for open data platforms

Julian Eberius, Patrick Damme, Katrin Braunschweig, Maik Thiele, Wolfgang Lehner
{"title":"Publish-time data integration for open data platforms","authors":"Julian Eberius, Patrick Damme, Katrin Braunschweig, Maik Thiele, Wolfgang Lehner","doi":"10.1145/2500410.2500413","DOIUrl":null,"url":null,"abstract":"Platforms for publication and collaborative management of data, such as Data.gov or Google Fusion Tables, are a new trend on the web. They manage very large corpora of datasets, but often lack an integrated schema, ontology, or even just common publication standards. This results in inconsistent names for attributes of the same meaning, which constrains the discovery of relationships between datasets as well as their reusability. Existing data integration techniques focus on reuse-time, i.e., they are applied when a user wants to combine a specific set of datasets or integrate them with an existing database. In contrast, this paper investigates a novel method of data integration at publish-time, where the publisher is provided with suggestions on how to integrate the new dataset with the corpus as a whole, without resorting to a manually created mediated schema or ontology for the platform. We propose data-driven algorithms that propose alternative attribute names for a newly published dataset based on attribute- and instance statistics maintained on the corpus. We evaluate the proposed algorithms using real-world corpora based on the Open Data Platform opendata.socrata.com and relational data extracted from Wikipedia. We report on the system's response time, and on the results of an extensive crowdsourcing-based evaluation of the quality of the generated attribute names alternatives.","PeriodicalId":328711,"journal":{"name":"International Workshop on Open Data","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on Open Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2500410.2500413","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Platforms for publication and collaborative management of data, such as Data.gov or Google Fusion Tables, are a new trend on the web. They manage very large corpora of datasets, but often lack an integrated schema, ontology, or even just common publication standards. This results in inconsistent names for attributes of the same meaning, which constrains the discovery of relationships between datasets as well as their reusability. Existing data integration techniques focus on reuse-time, i.e., they are applied when a user wants to combine a specific set of datasets or integrate them with an existing database. In contrast, this paper investigates a novel method of data integration at publish-time, where the publisher is provided with suggestions on how to integrate the new dataset with the corpus as a whole, without resorting to a manually created mediated schema or ontology for the platform. We propose data-driven algorithms that propose alternative attribute names for a newly published dataset based on attribute- and instance statistics maintained on the corpus. We evaluate the proposed algorithms using real-world corpora based on the Open Data Platform opendata.socrata.com and relational data extracted from Wikipedia. We report on the system's response time, and on the results of an extensive crowdsourcing-based evaluation of the quality of the generated attribute names alternatives.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
面向开放数据平台的发布时数据集成
发布和协作管理数据的平台,如data .gov或谷歌Fusion Tables,是网络上的新趋势。它们管理非常大的数据集语料库,但通常缺乏集成的模式、本体,甚至只是通用的发布标准。这导致相同含义的属性名称不一致,这限制了数据集之间关系的发现以及它们的可重用性。现有的数据集成技术侧重于重用时间,也就是说,当用户想要组合一组特定的数据集或将它们与现有数据库集成时,就会应用这些技术。相比之下,本文研究了一种在发布时进行数据集成的新方法,该方法为发布者提供了关于如何将新数据集与语料库作为一个整体集成的建议,而无需诉诸于为平台手动创建的中介模式或本体。我们提出了数据驱动算法,该算法基于语料库上维护的属性和实例统计信息,为新发布的数据集提供可选的属性名称。我们使用基于开放数据平台opendata.socrata.com的真实语料库和从维基百科中提取的关系数据来评估所提出的算法。我们报告了系统的响应时间,以及对生成的属性名替代方案的质量进行广泛的基于众包的评估的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Agronomic taxon Visualizing a large collection of open datasets: an experiment with proximity graphs Publishing census as linked open data: a case study Publish-time data integration for open data platforms Linked open GeoData management in the cloud
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1