The retrospective double-entry of a long-term ecological dataset

IF 5.8 2区环境科学与生态学 Q1 ECOLOGY Ecological Informatics Pub Date : 2024-11-02 DOI:10.1016/j.ecoinf.2024.102873

Simon Bull, Robert Sharrad, Michael G. Gardner

{"title":"The retrospective double-entry of a long-term ecological dataset","authors":"Simon Bull, Robert Sharrad, Michael G. Gardner","doi":"10.1016/j.ecoinf.2024.102873","DOIUrl":null,"url":null,"abstract":"<div><div>Research data are almost always assumed to be reliable, but there are many reasons why data can be unreliable. Manual data-entry error rates are typically observed in the 1 to 4 % range and can be statistically impactful. This has encouraged techniques to mitigate the risk of transcription error, among which the double-entry method remains the most effective. Unfortunately, these techniques are rarely applied retrospectively to datasets collected years or decades ago, including to highly valued long-term ecological datasets that continue to contribute to active research.</div><div>This study defines an approach for the retrospective double-entry of long-term ecological datasets and then applies it to one such dataset: the 34-year (and counting) Mt Mary Lizard Survey. Software was used to execute comparisons of c.760,000 individual data value pairs across c.56,000 records to corroborate matching values and identify unmatched values.</div><div>The key findings are: a) from 760,967 value pair comparisons between the originally keyed dataset and a retrospectively re-keyed version of the same dataset, 18,637 differences (2.5 %) were detected, b) almost half (48 %) of the differences detected were intentional alterations made to the original dataset during data curation efforts, c) data differences were not uniformly distributed across data fields but concentrated in the animal identity data field, and d) a three-way comparison of the identity field corroborated a recorded value in almost all cases.</div><div>Landmark, long-term ecological studies continue to be the evidentiary framework for ecological science. However, data quality metrics—including how faithfully digital transcriptions represent the originally recorded values—are rarely reported. Given that manual transcription errors are virtually assured and the realistic possibility of post hoc, intentional alterations made during data curation, one could legitimately ask whether a manually transcribed and curated dataset is a genuine representation of the originally recorded values. The retrospective double-entry approach is one way to find out.</div></div>","PeriodicalId":51024,"journal":{"name":"Ecological Informatics","volume":"84 ","pages":"Article 102873"},"PeriodicalIF":5.8000,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ecological Informatics","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574954124004151","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Research data are almost always assumed to be reliable, but there are many reasons why data can be unreliable. Manual data-entry error rates are typically observed in the 1 to 4 % range and can be statistically impactful. This has encouraged techniques to mitigate the risk of transcription error, among which the double-entry method remains the most effective. Unfortunately, these techniques are rarely applied retrospectively to datasets collected years or decades ago, including to highly valued long-term ecological datasets that continue to contribute to active research.

This study defines an approach for the retrospective double-entry of long-term ecological datasets and then applies it to one such dataset: the 34-year (and counting) Mt Mary Lizard Survey. Software was used to execute comparisons of c.760,000 individual data value pairs across c.56,000 records to corroborate matching values and identify unmatched values.

The key findings are: a) from 760,967 value pair comparisons between the originally keyed dataset and a retrospectively re-keyed version of the same dataset, 18,637 differences (2.5 %) were detected, b) almost half (48 %) of the differences detected were intentional alterations made to the original dataset during data curation efforts, c) data differences were not uniformly distributed across data fields but concentrated in the animal identity data field, and d) a three-way comparison of the identity field corroborated a recorded value in almost all cases.

Landmark, long-term ecological studies continue to be the evidentiary framework for ecological science. However, data quality metrics—including how faithfully digital transcriptions represent the originally recorded values—are rarely reported. Given that manual transcription errors are virtually assured and the realistic possibility of post hoc, intentional alterations made during data curation, one could legitimately ask whether a manually transcribed and curated dataset is a genuine representation of the originally recorded values. The retrospective double-entry approach is one way to find out.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

长期生态数据集的回顾性双重输入

研究数据几乎总是被假定为可靠的，但数据不可靠的原因有很多。人工数据录入的错误率通常在 1% 到 4% 之间，在统计上可能会产生影响。这促使人们采用各种技术来降低转录错误的风险，其中复式输入法仍然是最有效的方法。遗憾的是，这些技术很少被追溯性地应用到几年或几十年前收集的数据集上，包括价值很高的长期生态数据集，这些数据集仍在为积极的研究做出贡献。本研究定义了一种对长期生态数据集进行追溯性重复录入的方法，然后将其应用到这样一个数据集上：玛丽山蜥蜴调查，历时 34 年（还在继续）。使用软件对大约 56,000 条记录中的大约 760,000 个数据值对进行比较，以确证匹配值。主要发现有：a) 在最初键入的数据集与回溯重新键入的同一数据集之间的 760,967 个值对比较中，发现了 18,637 个差异（2.5%）；b) 在最初键入的数据集与回溯重新键入的同一数据集之间的 760,967 个值对比较中，发现了 18,637 个差异（2.5%）。b) 几乎一半（48%）被检测到的差异是在数据整理过程中对原始数据集进行的有意修改；c) 数据差异并非均匀分布在各个数据字段，而是集中在动物身份数据字段；d) 几乎在所有情况下，身份字段的三方比较都证实了记录值。然而，数据质量指标--包括数字转录如何忠实反映原始记录值--却鲜有报道。鉴于人工转录几乎肯定会出现错误，而且在数据整理过程中也存在事后故意修改的现实可能性，人们有理由质疑人工转录和整理的数据集是否真正代表了最初记录的数值。回顾性复式输入法是找出答案的一种方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Ecological Informatics 环境科学-生态学

CiteScore

8.30

自引率

11.80%

发文量

346

审稿时长

46 days

期刊介绍： The journal Ecological Informatics is devoted to the publication of high quality, peer-reviewed articles on all aspects of computational ecology, data science and biogeography. The scope of the journal takes into account the data-intensive nature of ecology, the growing capacity of information technology to access, harness and leverage complex data as well as the critical need for informing sustainable management in view of global environmental and climate change. The nature of the journal is interdisciplinary at the crossover between ecology and informatics. It focuses on novel concepts and techniques for image- and genome-based monitoring and interpretation, sensor- and multimedia-based data acquisition, internet-based data archiving and sharing, data assimilation, modelling and prediction of ecological data.