The retrospective double-entry of a long-term ecological dataset

IF 5.8 2区 环境科学与生态学 Q1 ECOLOGY Ecological Informatics Pub Date : 2024-11-02 DOI:10.1016/j.ecoinf.2024.102873
Simon Bull, Robert Sharrad, Michael G. Gardner
{"title":"The retrospective double-entry of a long-term ecological dataset","authors":"Simon Bull,&nbsp;Robert Sharrad,&nbsp;Michael G. Gardner","doi":"10.1016/j.ecoinf.2024.102873","DOIUrl":null,"url":null,"abstract":"<div><div>Research data are almost always assumed to be reliable, but there are many reasons why data can be unreliable. Manual data-entry error rates are typically observed in the 1 to 4 % range and can be statistically impactful. This has encouraged techniques to mitigate the risk of transcription error, among which the double-entry method remains the most effective. Unfortunately, these techniques are rarely applied retrospectively to datasets collected years or decades ago, including to highly valued long-term ecological datasets that continue to contribute to active research.</div><div>This study defines an approach for the retrospective double-entry of long-term ecological datasets and then applies it to one such dataset: the 34-year (and counting) Mt Mary Lizard Survey. Software was used to execute comparisons of c.760,000 individual data value pairs across c.56,000 records to corroborate matching values and identify unmatched values.</div><div>The key findings are: a) from 760,967 value pair comparisons between the originally keyed dataset and a retrospectively re-keyed version of the same dataset, 18,637 differences (2.5 %) were detected, b) almost half (48 %) of the differences detected were intentional alterations made to the original dataset during data curation efforts, c) data differences were not uniformly distributed across data fields but concentrated in the animal identity data field, and d) a three-way comparison of the identity field corroborated a recorded value in almost all cases.</div><div>Landmark, long-term ecological studies continue to be the evidentiary framework for ecological science. However, data quality metrics—including how faithfully digital transcriptions represent the originally recorded values—are rarely reported. Given that manual transcription errors are virtually assured and the realistic possibility of post hoc, intentional alterations made during data curation, one could legitimately ask whether a manually transcribed and curated dataset is a genuine representation of the originally recorded values. The retrospective double-entry approach is one way to find out.</div></div>","PeriodicalId":51024,"journal":{"name":"Ecological Informatics","volume":null,"pages":null},"PeriodicalIF":5.8000,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ecological Informatics","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574954124004151","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Research data are almost always assumed to be reliable, but there are many reasons why data can be unreliable. Manual data-entry error rates are typically observed in the 1 to 4 % range and can be statistically impactful. This has encouraged techniques to mitigate the risk of transcription error, among which the double-entry method remains the most effective. Unfortunately, these techniques are rarely applied retrospectively to datasets collected years or decades ago, including to highly valued long-term ecological datasets that continue to contribute to active research.
This study defines an approach for the retrospective double-entry of long-term ecological datasets and then applies it to one such dataset: the 34-year (and counting) Mt Mary Lizard Survey. Software was used to execute comparisons of c.760,000 individual data value pairs across c.56,000 records to corroborate matching values and identify unmatched values.
The key findings are: a) from 760,967 value pair comparisons between the originally keyed dataset and a retrospectively re-keyed version of the same dataset, 18,637 differences (2.5 %) were detected, b) almost half (48 %) of the differences detected were intentional alterations made to the original dataset during data curation efforts, c) data differences were not uniformly distributed across data fields but concentrated in the animal identity data field, and d) a three-way comparison of the identity field corroborated a recorded value in almost all cases.
Landmark, long-term ecological studies continue to be the evidentiary framework for ecological science. However, data quality metrics—including how faithfully digital transcriptions represent the originally recorded values—are rarely reported. Given that manual transcription errors are virtually assured and the realistic possibility of post hoc, intentional alterations made during data curation, one could legitimately ask whether a manually transcribed and curated dataset is a genuine representation of the originally recorded values. The retrospective double-entry approach is one way to find out.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
长期生态数据集的回顾性双重输入
研究数据几乎总是被假定为可靠的,但数据不可靠的原因有很多。人工数据录入的错误率通常在 1% 到 4% 之间,在统计上可能会产生影响。这促使人们采用各种技术来降低转录错误的风险,其中复式输入法仍然是最有效的方法。遗憾的是,这些技术很少被追溯性地应用到几年或几十年前收集的数据集上,包括价值很高的长期生态数据集,这些数据集仍在为积极的研究做出贡献。本研究定义了一种对长期生态数据集进行追溯性重复录入的方法,然后将其应用到这样一个数据集上:玛丽山蜥蜴调查,历时 34 年(还在继续)。使用软件对大约 56,000 条记录中的大约 760,000 个数据值对进行比较,以确证匹配值。主要发现有:a) 在最初键入的数据集与回溯重新键入的同一数据集之间的 760,967 个值对比较中,发现了 18,637 个差异(2.5%);b) 在最初键入的数据集与回溯重新键入的同一数据集之间的 760,967 个值对比较中,发现了 18,637 个差异(2.5%)。b) 几乎一半(48%)被检测到的差异是在数据整理过程中对原始数据集进行的有意修改;c) 数据差异并非均匀分布在各个数据字段,而是集中在动物身份数据字段;d) 几乎在所有情况下,身份字段的三方比较都证实了记录值。然而,数据质量指标--包括数字转录如何忠实反映原始记录值--却鲜有报道。鉴于人工转录几乎肯定会出现错误,而且在数据整理过程中也存在事后故意修改的现实可能性,人们有理由质疑人工转录和整理的数据集是否真正代表了最初记录的数值。回顾性复式输入法是找出答案的一种方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Ecological Informatics
Ecological Informatics 环境科学-生态学
CiteScore
8.30
自引率
11.80%
发文量
346
审稿时长
46 days
期刊介绍: The journal Ecological Informatics is devoted to the publication of high quality, peer-reviewed articles on all aspects of computational ecology, data science and biogeography. The scope of the journal takes into account the data-intensive nature of ecology, the growing capacity of information technology to access, harness and leverage complex data as well as the critical need for informing sustainable management in view of global environmental and climate change. The nature of the journal is interdisciplinary at the crossover between ecology and informatics. It focuses on novel concepts and techniques for image- and genome-based monitoring and interpretation, sensor- and multimedia-based data acquisition, internet-based data archiving and sharing, data assimilation, modelling and prediction of ecological data.
期刊最新文献
The retrospective double-entry of a long-term ecological dataset Integrating infiltration processes in hybrid downscaling methods to estimate sub-surface soil moisture A deep learning pipeline for time-lapse camera monitoring of insects and their floral environments A complete framework for hyperbolic acoustic localization with application to northern bobwhite covey calls Impacts of LULC changes on runoff from rivers through a coupled SWAT and BiLSTM model: A case study in Zhanghe River Basin, China
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1