Finding pages on the unarchived Web

Hugo C. Huurdeman, Anat Ben-David, J. Kamps, Thaer Samar, A. D. Vries
{"title":"Finding pages on the unarchived Web","authors":"Hugo C. Huurdeman, Anat Ben-David, J. Kamps, Thaer Samar, A. D. Vries","doi":"10.1109/JCDL.2014.6970188","DOIUrl":null,"url":null,"abstract":"Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies-most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the DutchWeb archive. Our main findings are threefold. First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of the Web archive. Second, the link and anchor descriptions have a highly skewed distribution: popular pages such as home pages have more terms, but the richness tapers off quickly. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived Web: in a known-item search setting we can retrieve these pages within the first ranks on average.","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"13 1","pages":"331-340"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCDL.2014.6970188","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16

Abstract

Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies-most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the DutchWeb archive. Our main findings are threefold. First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of the Web archive. Second, the link and anchor descriptions have a highly skewed distribution: popular pages such as home pages have more terms, but the richness tapers off quickly. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived Web: in a known-item search setting we can retrieve these pages within the first ranks on average.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在未存档的Web上查找页面
Web存档保存了快速变化的Web,但由于爬行限制、爬行深度和频率,或限制性选择策略,它们是高度不完整的——大多数Web未存档,因此丢失给后代。在本文中,我们提出了一种方法来恢复未存档的Web的重要部分,通过基于抓取页面集中的链接和锚重构这些页面的描述,并在DutchWeb存档上试验了这种方法。我们的主要发现有三个方面。首先,爬行的Web包含大量未存档的页面和网站的证据,这可能极大地增加了Web存档的覆盖范围。其次,链接和锚的描述有一个高度倾斜的分布:受欢迎的页面,如主页有更多的术语,但丰富程度很快减少。第三,简洁的表示通常足够丰富,可以唯一地标识未归档Web上的页面:在已知条目的搜索设置中,我们平均可以检索到排名靠前的页面。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Keynote 1: A Conversation with Dr. Safiya Noble Towards Knowledge Maintenance in Scientific Digital Libraries with the Keystone Framework. Identifying the Development Process of the Electronic Health Records Research from the Perspective of Information Resource Management The Status, Hot Topics in the Field of Electronic Health Records: A Literature Review Based on Lda2vec Keynote: Standards and Communities: Connected People, Consistent Data, Usable Applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1