To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages

IF 4.1 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on the Web Pub Date : 2023-03-27 DOI:10.1145/3589206

John A. Berlin, Mat Kelly, Michael L. Nelson, M. Weigle

{"title":"To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages","authors":"John A. Berlin, Mat Kelly, Michael L. Nelson, M. Weigle","doi":"10.1145/3589206","DOIUrl":null,"url":null,"abstract":"When replaying an archived web page, or memento, the fundamental expectation is that the page should be viewable and function exactly as it did at archival time. However, this expectation requires web archives upon replay to modify the page and its embedded resources so that all resources and links reference the archive rather than the original server. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. The process of replaying mementos and the modifications made to the representations by web archives varies between archives. Because of this, there is no standard terminology for describing the replay and needed modifications. In this paper, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos to facilitate replay. Because of issues discovered with server-side only modifications, we propose a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter, we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. We were also able to replay mementos that were previously not replayable from the Internet Archive. Many of the client-side rewriting ideas described in this work have been implemented into Wombat, a client-side URL rewriting system that is used by the Webrecorder, Pywb, and Wayback Machine playback systems.","PeriodicalId":50940,"journal":{"name":"ACM Transactions on the Web","volume":" ","pages":""},"PeriodicalIF":4.1000,"publicationDate":"2023-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on the Web","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3589206","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 1

Abstract

When replaying an archived web page, or memento, the fundamental expectation is that the page should be viewable and function exactly as it did at archival time. However, this expectation requires web archives upon replay to modify the page and its embedded resources so that all resources and links reference the archive rather than the original server. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. The process of replaying mementos and the modifications made to the representations by web archives varies between archives. Because of this, there is no standard terminology for describing the replay and needed modifications. In this paper, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos to facilitate replay. Because of issues discovered with server-side only modifications, we propose a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter, we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. We were also able to replay mementos that were previously not replayable from the Internet Archive. Many of the client-side rewriting ideas described in this work have been implemented into Wombat, a client-side URL rewriting system that is used by the Webrecorder, Pywb, and Wayback Machine playback systems.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

重新体验网络:一个转换和重放存档网页的框架

当回放存档的网页或纪念品时，基本的期望是该页面应该是可查看的，并且功能与存档时完全相同。然而，这种期望需要在回放时使用web存档来修改页面及其嵌入的资源，以便所有资源和链接都引用存档，而不是原始服务器。尽管这些修改必然会改变表现的状态，但可以理解的是，如果没有它们，就不可能从档案中回放纪念品。网络档案馆回放纪念品的过程和对表现形式的修改因档案馆而异。因此，没有标准的术语来描述回放和所需的修改。在本文中，我们提出了描述现有回放风格的术语，以及网络档案对纪念品进行的修改，以便于回放。由于只在服务器端进行修改时发现了问题，我们提出了一个用于自动生成客户端重写库的通用框架。最后，我们评估了使用生成的客户端重写库来增强现有的网络档案回放系统的有效性，通过对从互联网档案的Wayback Machine回放的纪念品进行爬网，无论是否使用生成的客户机端重写器。通过使用生成的客户端重写器，我们能够将被Wayback Machine的内容安全策略阻止的577个纪念品的累计请求数量减少87.5%，并将累计请求数量增加32.8%。我们还能够回放以前无法从Internet档案中回放的纪念品。这项工作中描述的许多客户端重写思想已经在Wombat中实现，这是一个客户端URL重写系统，由Webrecorder、Pywb和Wayback Machine播放系统使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on the Web 工程技术-计算机：软件工程

CiteScore

4.90

自引率

0.00%

发文量

审稿时长

7.5 months

期刊介绍： Transactions on the Web (TWEB) is a journal publishing refereed articles reporting the results of research on Web content, applications, use, and related enabling technologies. Topics in the scope of TWEB include but are not limited to the following: Browsers and Web Interfaces; Electronic Commerce; Electronic Publishing; Hypertext and Hypermedia; Semantic Web; Web Engineering; Web Services; and Service-Oriented Computing XML. In addition, papers addressing the intersection of the following broader technologies with the Web are also in scope: Accessibility; Business Services Education; Knowledge Management and Representation; Mobility and pervasive computing; Performance and scalability; Recommender systems; Searching, Indexing, Classification, Retrieval and Querying, Data Mining and Analysis; Security and Privacy; and User Interfaces. Papers discussing specific Web technologies, applications, content generation and management and use are within scope. Also, papers describing novel applications of the web as well as papers on the underlying technologies are welcome.