Finding pages on the unarchived Web

Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries Pub Date : 2014-09-08 DOI:10.1109/JCDL.2014.6970188

Hugo C. Huurdeman, Anat Ben-David, J. Kamps, Thaer Samar, A. D. Vries

{"title":"Finding pages on the unarchived Web","authors":"Hugo C. Huurdeman, Anat Ben-David, J. Kamps, Thaer Samar, A. D. Vries","doi":"10.1109/JCDL.2014.6970188","DOIUrl":null,"url":null,"abstract":"Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies-most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the DutchWeb archive. Our main findings are threefold. First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of the Web archive. Second, the link and anchor descriptions have a highly skewed distribution: popular pages such as home pages have more terms, but the richness tapers off quickly. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived Web: in a known-item search setting we can retrieve these pages within the first ranks on average.","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"13 1","pages":"331-340"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCDL.2014.6970188","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies-most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the DutchWeb archive. Our main findings are threefold. First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of the Web archive. Second, the link and anchor descriptions have a highly skewed distribution: popular pages such as home pages have more terms, but the richness tapers off quickly. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived Web: in a known-item search setting we can retrieve these pages within the first ranks on average.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在未存档的Web上查找页面

Web存档保存了快速变化的Web，但由于爬行限制、爬行深度和频率，或限制性选择策略，它们是高度不完整的——大多数Web未存档，因此丢失给后代。在本文中，我们提出了一种方法来恢复未存档的Web的重要部分，通过基于抓取页面集中的链接和锚重构这些页面的描述，并在DutchWeb存档上试验了这种方法。我们的主要发现有三个方面。首先，爬行的Web包含大量未存档的页面和网站的证据，这可能极大地增加了Web存档的覆盖范围。其次，链接和锚的描述有一个高度倾斜的分布:受欢迎的页面，如主页有更多的术语，但丰富程度很快减少。第三，简洁的表示通常足够丰富，可以唯一地标识未归档Web上的页面:在已知条目的搜索设置中，我们平均可以检索到排名靠前的页面。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries

自引率

0.00%

发文量