Data Curation in Practice: Extract Tabular Data from PDF Files Using a Data Analytics Tool

A. J. Choi, Xuying Xin
{"title":"Data Curation in Practice: Extract Tabular Data from PDF Files Using a Data Analytics Tool","authors":"A. J. Choi, Xuying Xin","doi":"10.7191/jeslib.2021.1209","DOIUrl":null,"url":null,"abstract":"Data curation is the process of managing data to make it available for reuse and preservation and to allow FAIR (findable, accessible, interoperable, reusable) uses. It is an important part of the research lifecycle as researchers are often either required by funders or generally encouraged to preserve the dataset and make it discoverable and reusable. This has been especially important as the Open Access (OA) policy is being implemented in many institutions across the nation. In facilitating research data discovery and enhancing its easier reuse, an efficient data repository and its data curation play key roles. In this article, we briefly discuss the local institutional repository at Penn State University and the general data curation practices we adopt for the deposited files and datasets, then we focus on a data analytics tool that has recently been applied to extract tabular data from PDF files. This is an enhancement to the existing data curation practices as it adds additional tabular data to deposits with PDF files where tables are often embedded and not easily reused.","PeriodicalId":90214,"journal":{"name":"Journal of escience librarianship","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of escience librarianship","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7191/jeslib.2021.1209","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Data curation is the process of managing data to make it available for reuse and preservation and to allow FAIR (findable, accessible, interoperable, reusable) uses. It is an important part of the research lifecycle as researchers are often either required by funders or generally encouraged to preserve the dataset and make it discoverable and reusable. This has been especially important as the Open Access (OA) policy is being implemented in many institutions across the nation. In facilitating research data discovery and enhancing its easier reuse, an efficient data repository and its data curation play key roles. In this article, we briefly discuss the local institutional repository at Penn State University and the general data curation practices we adopt for the deposited files and datasets, then we focus on a data analytics tool that has recently been applied to extract tabular data from PDF files. This is an enhancement to the existing data curation practices as it adds additional tabular data to deposits with PDF files where tables are often embedded and not easily reused.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
实践中的数据整理:使用数据分析工具从PDF文件中提取表格数据
数据管理是管理数据的过程,目的是使数据可以重用和保存,并允许对数据进行FAIR(可查找、可访问、可互操作、可重用)的使用。这是研究生命周期的重要组成部分,因为研究人员通常要么被资助者要求,要么被鼓励保存数据集,使其可发现和可重用。随着开放获取(OA)政策在全国许多机构中实施,这一点尤为重要。在促进研究数据发现和提高其更容易重用的过程中,高效的数据存储库及其数据管理起着关键作用。在本文中,我们简要讨论了宾夕法尼亚州立大学的本地机构存储库以及我们为存储的文件和数据集采用的一般数据管理实践,然后我们将重点放在最近用于从PDF文件中提取表格数据的数据分析工具上。这是对现有数据管理实践的增强,因为它将额外的表格数据添加到包含PDF文件的存储中,而PDF文件中通常嵌入表格且不易重用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
审稿时长
16 weeks
期刊最新文献
Ethical considerations in utilizing artificial intelligence for analyzing the NHGRI's History of Genomics and Human Genome Project archives. The Creative Urge Title Pending 740 A Problem Shared Is a Community Created: Recommendations for Cross-Institutional Collaborations. Train the Teacher: Practical guidance for effective, critical teaching approaches for science and data librarians
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1