The feasibility of investing in manual correction of metadata for a large-scale digital library

Hung-Hsuan Chen, Madian Khabsa, C. Lee Giles
{"title":"The feasibility of investing in manual correction of metadata for a large-scale digital library","authors":"Hung-Hsuan Chen, Madian Khabsa, C. Lee Giles","doi":"10.1109/JCDL.2014.6970172","DOIUrl":null,"url":null,"abstract":"Given a large-scale digital library that automatically crawls and parses PDF files to generate metadata for documents and authors, we estimate the number of person-hours required to correct a small portion of the metadata, in the hope that a large portion of users can benefit from these corrections. We obtain users requests by analyzing Cite-SeerX's log files from September 2009 to March 2013. We found that the distribution of users requests for search is highly imbalanced: most document search queries and author search queries concentrate on a small set of terms. As a result, even for a large-scale digital library, we estimate it is affordable to invest a few person-hours to check the correctness of a few metadata, and thus provide benefits to a good portion of document search and author search requests.","PeriodicalId":92278,"journal":{"name":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","volume":"42 1","pages":"225-228"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... ACM/IEEE Joint Conference on Digital Libraries. ACM/IEEE Joint Conference on Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCDL.2014.6970172","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Given a large-scale digital library that automatically crawls and parses PDF files to generate metadata for documents and authors, we estimate the number of person-hours required to correct a small portion of the metadata, in the hope that a large portion of users can benefit from these corrections. We obtain users requests by analyzing Cite-SeerX's log files from September 2009 to March 2013. We found that the distribution of users requests for search is highly imbalanced: most document search queries and author search queries concentrate on a small set of terms. As a result, even for a large-scale digital library, we estimate it is affordable to invest a few person-hours to check the correctness of a few metadata, and thus provide benefits to a good portion of document search and author search requests.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
投资大规模数字图书馆元数据人工校正的可行性
假设有一个大型数字图书馆可以自动抓取和解析PDF文件以生成文档和作者的元数据,我们估计需要多少人-小时来纠正一小部分元数据,希望大部分用户可以从这些更正中受益。我们通过分析Cite-SeerX从2009年9月到2013年3月的日志文件获取用户请求。我们发现用户搜索请求的分布是高度不平衡的:大多数文档搜索查询和作者搜索查询集中在一小组术语上。因此,即使对于大型数字图书馆,我们估计投入几个人-小时来检查一些元数据的正确性是可以承受的,从而为很大一部分文档搜索和作者搜索请求提供好处。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Keynote 1: A Conversation with Dr. Safiya Noble Towards Knowledge Maintenance in Scientific Digital Libraries with the Keystone Framework. Identifying the Development Process of the Electronic Health Records Research from the Perspective of Information Resource Management The Status, Hot Topics in the Field of Electronic Health Records: A Literature Review Based on Lda2vec Keynote: Standards and Communities: Connected People, Consistent Data, Usable Applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1