Interlinking Large-scale Library Data with Authority Records

Felix Bensmann, Benjamin Zapilko, Philipp Mayr
{"title":"Interlinking Large-scale Library Data with Authority Records","authors":"Felix Bensmann, Benjamin Zapilko, Philipp Mayr","doi":"10.3389/fdigh.2017.00005","DOIUrl":null,"url":null,"abstract":"In the area of Linked Open Data (LOD) meaningful and high-performance interlinking of different datasets has become an ongoing challenge. Necessary tasks are supported by established standards and software, e.g. for the transformation, storage, interlinking and publication of data. Our use case Swissbib is a well-known provider for bibliographic data in Switzerland representing various libraries and library networks. In this article, a case study is presented from the project linked.swissbib.ch which focuses on the preparation and publication of the Swissbib data by means of LOD. Data available in Marc21 XML is extracted from the Swissbib system and transformed into an RDF/XML representation. From the approx. 21 million monolithic records the author information is extracted and interlinked with authority files from the Virtual International Authority File (VIAF) and DBpedia. The links are used to extract additional data from the counterpart corpora. Afterwards, data is pushed into an Elasticsearch index to make the data accessible for other components. As a demonstrator, a search portal is developed which presents the additional data and the generated links to users. In addition to that, a REST interface is developed in order to enable also access by other applications. A main obstacle in this project is the amount of data and the necessity of day-to-day (partial) updates. In the current situation the data in Swissbib and in the external corpora are too large to be processed by established linking tools. The arising memory footprint prevents the correct functioning of these tools. Also triple stores are unhandy by revealing a massive overhead for import and update operations. Hence, we have developed procedures for extracting and shaping the data into a more suitable form, e.g. data is reduced to the necessary properties and blocked. For this purpose, we used sorted N-Triples as an intermediate data format. This method proved to be very promising as our preliminary results show. Our approach could establish 30,773 links to DBpedia and 20,714 links to VIAF and both link sets show high precision values and could be generated in reasonable expenditures of time.","PeriodicalId":227954,"journal":{"name":"Frontiers Digit. Humanit.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers Digit. Humanit.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdigh.2017.00005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

In the area of Linked Open Data (LOD) meaningful and high-performance interlinking of different datasets has become an ongoing challenge. Necessary tasks are supported by established standards and software, e.g. for the transformation, storage, interlinking and publication of data. Our use case Swissbib is a well-known provider for bibliographic data in Switzerland representing various libraries and library networks. In this article, a case study is presented from the project linked.swissbib.ch which focuses on the preparation and publication of the Swissbib data by means of LOD. Data available in Marc21 XML is extracted from the Swissbib system and transformed into an RDF/XML representation. From the approx. 21 million monolithic records the author information is extracted and interlinked with authority files from the Virtual International Authority File (VIAF) and DBpedia. The links are used to extract additional data from the counterpart corpora. Afterwards, data is pushed into an Elasticsearch index to make the data accessible for other components. As a demonstrator, a search portal is developed which presents the additional data and the generated links to users. In addition to that, a REST interface is developed in order to enable also access by other applications. A main obstacle in this project is the amount of data and the necessity of day-to-day (partial) updates. In the current situation the data in Swissbib and in the external corpora are too large to be processed by established linking tools. The arising memory footprint prevents the correct functioning of these tools. Also triple stores are unhandy by revealing a massive overhead for import and update operations. Hence, we have developed procedures for extracting and shaping the data into a more suitable form, e.g. data is reduced to the necessary properties and blocked. For this purpose, we used sorted N-Triples as an intermediate data format. This method proved to be very promising as our preliminary results show. Our approach could establish 30,773 links to DBpedia and 20,714 links to VIAF and both link sets show high precision values and could be generated in reasonable expenditures of time.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大规模图书馆数据与权威记录的互连
在关联开放数据(LOD)领域,有意义和高性能的不同数据集互连已成为一个持续的挑战。必要的任务由已建立的标准和软件支持,例如数据的转换、存储、互连和发布。我们的用例Swissbib是一个著名的瑞士书目数据提供商,它代表了各种图书馆和图书馆网络。在这篇文章中,一个案例研究是由项目链接。Swissbib .ch提出的,该项目侧重于利用LOD方法编制和发布Swissbib数据。Marc21 XML中可用的数据是从Swissbib系统中提取出来的,并转换为RDF/XML表示。从近似。从虚拟国际权威文件(VIAF)和DBpedia中提取作者信息,并将其与权威文件相互链接。这些链接用于从对应的语料库中提取额外的数据。然后,数据被推送到Elasticsearch索引中,以便其他组件可以访问这些数据。作为演示,开发了一个搜索门户,它向用户展示了额外的数据和生成的链接。除此之外,还开发了一个REST接口,以便其他应用程序也可以访问。这个项目的一个主要障碍是数据量和每天(部分)更新的必要性。在目前的情况下,Swissbib和外部语料库中的数据太大,无法通过现有的链接工具进行处理。不断增加的内存占用妨碍了这些工具的正确运行。另外,三重存储也很不方便,因为它为导入和更新操作带来了大量开销。因此,我们开发了提取数据并将其塑造成更合适的形式的过程,例如,将数据简化为必要的属性并阻塞。为此,我们使用排序的N-Triples作为中间数据格式。我们的初步结果表明,这种方法非常有前途。我们的方法可以建立到DBpedia的30,773个链接和到VIAF的20,714个链接,这两个链接集都显示出很高的精度值,并且可以在合理的时间内生成。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Ancient City, Universal Growth? Exploring Urban Expansion and Economic Development on Rome's Eastern Periphery A New Kind of Relevance for Archaeology Modeling the Rise of the City: Early Urban Networks in Southern Italy Trajectories to Low-Density Settlements Past and Present: Paradox and Outcomes Corrigendum: Large-Scale Urban Prototyping for Responsive Cities: A Conceptual Framework
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1