{"title":"The retrospective double-entry of a long-term ecological dataset","authors":"Simon Bull, Robert Sharrad, Michael G. Gardner","doi":"10.1016/j.ecoinf.2024.102873","DOIUrl":null,"url":null,"abstract":"<div><div>Research data are almost always assumed to be reliable, but there are many reasons why data can be unreliable. Manual data-entry error rates are typically observed in the 1 to 4 % range and can be statistically impactful. This has encouraged techniques to mitigate the risk of transcription error, among which the double-entry method remains the most effective. Unfortunately, these techniques are rarely applied retrospectively to datasets collected years or decades ago, including to highly valued long-term ecological datasets that continue to contribute to active research.</div><div>This study defines an approach for the retrospective double-entry of long-term ecological datasets and then applies it to one such dataset: the 34-year (and counting) Mt Mary Lizard Survey. Software was used to execute comparisons of c.760,000 individual data value pairs across c.56,000 records to corroborate matching values and identify unmatched values.</div><div>The key findings are: a) from 760,967 value pair comparisons between the originally keyed dataset and a retrospectively re-keyed version of the same dataset, 18,637 differences (2.5 %) were detected, b) almost half (48 %) of the differences detected were intentional alterations made to the original dataset during data curation efforts, c) data differences were not uniformly distributed across data fields but concentrated in the animal identity data field, and d) a three-way comparison of the identity field corroborated a recorded value in almost all cases.</div><div>Landmark, long-term ecological studies continue to be the evidentiary framework for ecological science. However, data quality metrics—including how faithfully digital transcriptions represent the originally recorded values—are rarely reported. Given that manual transcription errors are virtually assured and the realistic possibility of post hoc, intentional alterations made during data curation, one could legitimately ask whether a manually transcribed and curated dataset is a genuine representation of the originally recorded values. The retrospective double-entry approach is one way to find out.</div></div>","PeriodicalId":51024,"journal":{"name":"Ecological Informatics","volume":null,"pages":null},"PeriodicalIF":5.8000,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ecological Informatics","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574954124004151","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Research data are almost always assumed to be reliable, but there are many reasons why data can be unreliable. Manual data-entry error rates are typically observed in the 1 to 4 % range and can be statistically impactful. This has encouraged techniques to mitigate the risk of transcription error, among which the double-entry method remains the most effective. Unfortunately, these techniques are rarely applied retrospectively to datasets collected years or decades ago, including to highly valued long-term ecological datasets that continue to contribute to active research.
This study defines an approach for the retrospective double-entry of long-term ecological datasets and then applies it to one such dataset: the 34-year (and counting) Mt Mary Lizard Survey. Software was used to execute comparisons of c.760,000 individual data value pairs across c.56,000 records to corroborate matching values and identify unmatched values.
The key findings are: a) from 760,967 value pair comparisons between the originally keyed dataset and a retrospectively re-keyed version of the same dataset, 18,637 differences (2.5 %) were detected, b) almost half (48 %) of the differences detected were intentional alterations made to the original dataset during data curation efforts, c) data differences were not uniformly distributed across data fields but concentrated in the animal identity data field, and d) a three-way comparison of the identity field corroborated a recorded value in almost all cases.
Landmark, long-term ecological studies continue to be the evidentiary framework for ecological science. However, data quality metrics—including how faithfully digital transcriptions represent the originally recorded values—are rarely reported. Given that manual transcription errors are virtually assured and the realistic possibility of post hoc, intentional alterations made during data curation, one could legitimately ask whether a manually transcribed and curated dataset is a genuine representation of the originally recorded values. The retrospective double-entry approach is one way to find out.
期刊介绍:
The journal Ecological Informatics is devoted to the publication of high quality, peer-reviewed articles on all aspects of computational ecology, data science and biogeography. The scope of the journal takes into account the data-intensive nature of ecology, the growing capacity of information technology to access, harness and leverage complex data as well as the critical need for informing sustainable management in view of global environmental and climate change.
The nature of the journal is interdisciplinary at the crossover between ecology and informatics. It focuses on novel concepts and techniques for image- and genome-based monitoring and interpretation, sensor- and multimedia-based data acquisition, internet-based data archiving and sharing, data assimilation, modelling and prediction of ecological data.