How to Assess the Exhaustiveness of Longitudinal Web Archives: A Case Study of the German Academic Web

Michael Paris, R. Jäschke
{"title":"How to Assess the Exhaustiveness of Longitudinal Web Archives: A Case Study of the German Academic Web","authors":"Michael Paris, R. Jäschke","doi":"10.1145/3372923.3404836","DOIUrl":null,"url":null,"abstract":"Longitudinal web archives can be a foundation for investigating structural and content-based research questions. One prerequisite is that they contain a faithful representation of the relevant subset of the web. Therefore, an assessment of the authority of a given dataset with respect to a research question should precede the actual investigation. Next to proper creation and curation, this requires measures for estimating the potential of a longitudinal web archive to yield information about the central objects the research question aims to investigate. In particular, content-based research questions often lack the ab-initio confidence about the integrity of the data. In this paper we focus on one specifically important aspect, namely the exhaustiveness of the dataset with respect to the central objects. Therefore, we investigate the recall coverage of researcher names in a longitudinal academic web crawl over a seven year period and the influence of our crawl method on the dataset integrity. Additionally, we propose a method to estimate the amount of missing information as a means to describe the exhaustiveness of the crawl and motivate a use case for the presented corpus.","PeriodicalId":389616,"journal":{"name":"Proceedings of the 31st ACM Conference on Hypertext and Social Media","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM Conference on Hypertext and Social Media","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3372923.3404836","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Longitudinal web archives can be a foundation for investigating structural and content-based research questions. One prerequisite is that they contain a faithful representation of the relevant subset of the web. Therefore, an assessment of the authority of a given dataset with respect to a research question should precede the actual investigation. Next to proper creation and curation, this requires measures for estimating the potential of a longitudinal web archive to yield information about the central objects the research question aims to investigate. In particular, content-based research questions often lack the ab-initio confidence about the integrity of the data. In this paper we focus on one specifically important aspect, namely the exhaustiveness of the dataset with respect to the central objects. Therefore, we investigate the recall coverage of researcher names in a longitudinal academic web crawl over a seven year period and the influence of our crawl method on the dataset integrity. Additionally, we propose a method to estimate the amount of missing information as a means to describe the exhaustiveness of the crawl and motivate a use case for the presented corpus.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
如何评估纵向网络档案的详尽性:以德国学术网络为例
纵向网络档案可以作为调查结构性和基于内容的研究问题的基础。一个先决条件是它们包含对网络相关子集的忠实表示。因此,就研究问题而言,对给定数据集的权威进行评估应该先于实际调查。除了适当的创建和管理之外,这需要评估纵向网络档案的潜力,以产生有关研究问题旨在调查的中心对象的信息。特别是,基于内容的研究问题往往缺乏对数据完整性的从头开始的信心。在本文中,我们专注于一个特别重要的方面,即数据集相对于中心对象的详尽性。因此,我们研究了纵向学术网络抓取七年期间研究人员姓名的召回覆盖率以及我们的抓取方法对数据集完整性的影响。此外,我们提出了一种方法来估计缺失信息的数量,作为描述爬行的穷竭性和为所呈现的语料库激发用例的手段。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Date the Artist: A Virtual Date with a Virtual Character 3rd Workshop on Human Factors in Hypertext (HUMAN'20) Noise-Enhanced Community Detection You Do Not Decide for Me! Evaluating Explainable Group Aggregation Strategies for Tourism Personalizing Information Exploration with an Open User Model
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1