Keeping linked open data caches up-to-date by predicting the life-time of RDF triples

Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics Pub Date : 2017-08-23 DOI:10.1145/3106426.3106463

Chifumi Nishioka, A. Scherp

{"title":"Keeping linked open data caches up-to-date by predicting the life-time of RDF triples","authors":"Chifumi Nishioka, A. Scherp","doi":"10.1145/3106426.3106463","DOIUrl":null,"url":null,"abstract":"Many Linked Open Data applications require fresh copies of RDF data at their local repositories. Since RDF documents constantly change and those changes are not automatically propagated to the LOD applications, it is important to regularly visit the RDF documents to refresh the local copies and keep them up-to-date. For this purpose, crawling strategies determine which RDF documents should be preferentially fetched. Traditional crawling strategies rely only on how an RDF document has been modified in the past. In contrast, we predict on the triple level whether a change will occur in the future. We use the weekly snapshots of the DyLDO dataset as well as the monthly snapshots of the Wikidata dataset. First, we conduct an in-depth analysis of the life span of triples in RDF documents. Through the analysis, we identify which triples are stable and which are ephemeral. We introduce different features based on the triples and apply a simple but effective linear regression model. Second, we propose a novel crawling strategy based on the linear regression model. We conduct two experimental setups where we vary the amount of available bandwidth as well as iteratively observe the quality of the local copies over time. The results demonstrate that the novel crawling strategy outperforms the state of the art in both setups.","PeriodicalId":20685,"journal":{"name":"Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3106426.3106463","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Many Linked Open Data applications require fresh copies of RDF data at their local repositories. Since RDF documents constantly change and those changes are not automatically propagated to the LOD applications, it is important to regularly visit the RDF documents to refresh the local copies and keep them up-to-date. For this purpose, crawling strategies determine which RDF documents should be preferentially fetched. Traditional crawling strategies rely only on how an RDF document has been modified in the past. In contrast, we predict on the triple level whether a change will occur in the future. We use the weekly snapshots of the DyLDO dataset as well as the monthly snapshots of the Wikidata dataset. First, we conduct an in-depth analysis of the life span of triples in RDF documents. Through the analysis, we identify which triples are stable and which are ephemeral. We introduce different features based on the triples and apply a simple but effective linear regression model. Second, we propose a novel crawling strategy based on the linear regression model. We conduct two experimental setups where we vary the amount of available bandwidth as well as iteratively observe the quality of the local copies over time. The results demonstrate that the novel crawling strategy outperforms the state of the art in both setups.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过预测RDF三元组的生命周期，使链接的开放数据缓存保持最新状态

许多链接开放数据应用程序需要在其本地存储库中获得RDF数据的新副本。由于RDF文档不断更改，而这些更改不会自动传播到LOD应用程序，因此定期访问RDF文档以刷新本地副本并使其保持最新非常重要。为此，爬行策略决定应该优先获取哪些RDF文档。传统的爬行策略仅依赖于RDF文档在过去是如何被修改的。相反，我们在三重层面上预测未来是否会发生变化。我们使用DyLDO数据集的每周快照以及维基数据集的每月快照。首先，我们对RDF文档中三元组的生命周期进行深入分析。通过分析，我们确定了哪些三元组是稳定的，哪些是短暂的。我们在三元组的基础上引入不同的特征，并应用一个简单而有效的线性回归模型。其次，我们提出了一种新的基于线性回归模型的爬行策略。我们进行了两个实验设置，其中我们改变了可用带宽的数量，并随时间迭代地观察本地副本的质量。结果表明，在这两种设置中，新的爬行策略都优于目前的状态。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics

自引率

0.00%

发文量

期刊最新文献

WIMS 2020: The 10th International Conference on Web Intelligence, Mining and Semantics, Biarritz, France, June 30 - July 3, 2020 A deep learning approach for web service interactions Partial sums-based P-Rank computation in information networks Mining ordinal data under human response uncertainty Haste makes waste: a case to favour voting bots