{"title":"No More 404s: Predicting Referenced Link Rot in Scholarly Articles for Pro-Active Archiving","authors":"Ke Zhou, Claire Grover, Martin Klein, R. Tobin","doi":"10.1145/2756406.2756940","DOIUrl":null,"url":null,"abstract":"The citation of resources is a fundamental part of scholarly discourse. Due to the popularity of the web, there is an increasing trend for scholarly articles to reference web resources (e.g. software, data). However, due to the dynamic nature of the web, the referenced links may become inaccessible ('rotten') sometime after publication, returning a \"404 Not Found\" HTTP error. In this paper we first present some preliminary findings of a study of the persistence and availability of web resources referenced from papers in a large-scale scholarly repository. We reaffirm previous research that link rot is a serious problem in the scholarly world and that current web archives do not always preserve all rotten links. Therefore, a more pro-active archival solution needs to be developed to further preserve web content referenced in scholarly articles. To this end, we propose to apply machine learning techniques to train a link rot predictor for use by an archival framework to prioritise pro-active archiving of links that are more likely to be rotten. We demonstrate that we can obtain a fairly high link rot prediction AUC (0.72) with only a small set of features. By simulation, we also show that our prediction framework is more effective than current web archives for preserving links that are likely to be rotten. This work has a potential impact for the scholarly world where publishers can utilise this framework to prioritise the archiving of links for digital preservation, especially when there is a large quantity of links to be archived.","PeriodicalId":256118,"journal":{"name":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2756406.2756940","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
The citation of resources is a fundamental part of scholarly discourse. Due to the popularity of the web, there is an increasing trend for scholarly articles to reference web resources (e.g. software, data). However, due to the dynamic nature of the web, the referenced links may become inaccessible ('rotten') sometime after publication, returning a "404 Not Found" HTTP error. In this paper we first present some preliminary findings of a study of the persistence and availability of web resources referenced from papers in a large-scale scholarly repository. We reaffirm previous research that link rot is a serious problem in the scholarly world and that current web archives do not always preserve all rotten links. Therefore, a more pro-active archival solution needs to be developed to further preserve web content referenced in scholarly articles. To this end, we propose to apply machine learning techniques to train a link rot predictor for use by an archival framework to prioritise pro-active archiving of links that are more likely to be rotten. We demonstrate that we can obtain a fairly high link rot prediction AUC (0.72) with only a small set of features. By simulation, we also show that our prediction framework is more effective than current web archives for preserving links that are likely to be rotten. This work has a potential impact for the scholarly world where publishers can utilise this framework to prioritise the archiving of links for digital preservation, especially when there is a large quantity of links to be archived.
资源引用是学术话语的基本组成部分。由于网络的普及,学术文章越来越多地引用网络资源(如软件、数据)。然而,由于网络的动态特性,引用的链接可能在发布后的某个时候变得无法访问(“腐烂”),返回“404 Not Found”HTTP错误。在这篇论文中,我们首先提出了一些关于大规模学术知识库中引用的网络资源的持久性和可用性的初步研究结果。我们重申先前的研究,链接腐烂是一个严重的问题,在学术界,目前的网络档案并不总是保存所有腐烂的链接。因此,需要开发一种更主动的存档解决方案,以进一步保存学术文章中引用的网络内容。为此,我们建议应用机器学习技术来训练一个链接腐烂预测器,供存档框架使用,以优先考虑更有可能腐烂的链接的主动存档。我们证明,我们可以获得一个相当高的链接rot预测AUC(0.72),只有一个小的特征集。通过模拟,我们还表明,我们的预测框架在保存可能腐烂的链接方面比当前的网络档案更有效。这项工作对学术世界有潜在的影响,出版商可以利用这个框架来优先归档数字保存的链接,特别是当有大量的链接需要归档时。