Simon Donig, Markus Eckl, S. Gassner, Malte Rehbein
{"title":"Web archive analytics: Blind spots and silences in distant readings of the archived web","authors":"Simon Donig, Markus Eckl, S. Gassner, Malte Rehbein","doi":"10.1093/llc/fqad014","DOIUrl":null,"url":null,"abstract":"\n In this article, we discuss epistemological and methodological aspects of web archive analytics, a recent development towards more data-centred access to web archives. More specifically, we suggest understanding both the process of archiving and subsequent steps of analysis at scale as acts of observation that can be questioned for their epistemological priori. Therefore, we propose the concepts of ‘blind spots’ (features of the live web not included upon creation in the archive) and ‘silences’ (latent features present in the archive but requiring a particular method to be made articulate). In particular, we address two forms of silences playing a structural role in web archive analytics, crucial to both historians and social scientists alike: abundance (or scale) and time. We trace epistemological implications of web archive analytics across an exemplary case study workflow and suggest methodological answers to the issues raised in this process. On the data extraction side, we introduce warc2corpus (w2c), a new tool for extracting granular, structured data, especially temporal information related to the creation, modification, and publication specifically of webpages. For data analysis, we demonstrate how distant reading techniques—more specifically structural topic modelling (STM)—can contribute to providing a rich, temporally structured representation of textual web archive content that in turn can be subjected to scholarly inquiry, interpretation, and re-contextualization.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":" ","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Scholarship in the Humanities","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1093/llc/fqad014","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"HUMANITIES, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
In this article, we discuss epistemological and methodological aspects of web archive analytics, a recent development towards more data-centred access to web archives. More specifically, we suggest understanding both the process of archiving and subsequent steps of analysis at scale as acts of observation that can be questioned for their epistemological priori. Therefore, we propose the concepts of ‘blind spots’ (features of the live web not included upon creation in the archive) and ‘silences’ (latent features present in the archive but requiring a particular method to be made articulate). In particular, we address two forms of silences playing a structural role in web archive analytics, crucial to both historians and social scientists alike: abundance (or scale) and time. We trace epistemological implications of web archive analytics across an exemplary case study workflow and suggest methodological answers to the issues raised in this process. On the data extraction side, we introduce warc2corpus (w2c), a new tool for extracting granular, structured data, especially temporal information related to the creation, modification, and publication specifically of webpages. For data analysis, we demonstrate how distant reading techniques—more specifically structural topic modelling (STM)—can contribute to providing a rich, temporally structured representation of textual web archive content that in turn can be subjected to scholarly inquiry, interpretation, and re-contextualization.
期刊介绍:
DSH or Digital Scholarship in the Humanities is an international, peer reviewed journal which publishes original contributions on all aspects of digital scholarship in the Humanities including, but not limited to, the field of what is currently called the Digital Humanities. Long and short papers report on theoretical, methodological, experimental, and applied research and include results of research projects, descriptions and evaluations of tools, techniques, and methodologies, and reports on work in progress. DSH also publishes reviews of books and resources. Digital Scholarship in the Humanities was previously known as Literary and Linguistic Computing.