{"title":"Harmonised Data is Actionable Data: DiSSCo’s solution to data mapping","authors":"Sam Leeflang, W. Addink","doi":"10.3897/biss.7.112137","DOIUrl":null,"url":null,"abstract":"Predictability is one of the core requirements for creating machine actionable data. The better predictable the data, the more generic the service acting on the data can be. The more generic the service, the easier we can exchange ideas, collaborate on initiatives and leverage machines to do the work. It is essential for implementing the FAIR Principles (Findable, Accessible, Interoperable, Reproducible), as it provides the “I” for Interoperability (Jacobsen et al. 2020). The FAIR principles emphasise machine actionability because the amount of data generated is far too large for humans to handle. \n While Biodiversity Information Standards (TDWG) standards have massively improved the standardisation of biodiversity data, there is still room for improvement. Within the Distributed System of Scientific Collections (DiSSCo), we aim to harmonise all scientific data derived from European specimen collections, including geological specimens, into a single data specification. We call this data specification the open Digital Specimen (openDS). It is being built on top of existing and developing biodiversity information standards such as Darwin Core (DwC), Minimal Information Digital Specimen (MIDS), Latimer Core, Access to Biological Collection Data (ABCD) Schema, Extension for Geosciences (EFG) and also on the new Global Biodiversity Information Facility (GBIF) Unified Model. In openDS we leverage the existing standards within the TDWG community but combine these with stricter constraints and controlled vocabularies, with the aim to improve the FAIRness of the data. This will not only make the data easier to use, but will also increase its quality and machine actionability.\n As the first step towards this the harmonisation of terms, we make sure that similar values use the same term in a standard as key. This enables the next step in which we harmonise the values. We can transform free-text values into standardised or controlled vocabularies. For example: instead of using the names J. Doe, John Doe and J. Doe sr. for a collector, we aim to standardise these to J. Doe, with a person identifier that connects this name with more information about the collector.\n Biodiversity information standards such as DwC were developed to lower the bar for data sharing. The downside of including minimal restraints and flexibility is that they provide room for ambiguity, leading to multiple ways of interpretation. This limits interoperability and hampers machine actionability. In DiSSCo, data will come from different sources that use different biodiversity information standards. To cover this, we need to harmonise terms between these standards. To complicate things further, different serialisation methods are used for data exchange. Darwin Core Archives (DwC-A; GBIF 2021) use Comma-separated values (CSV) files. ABCD(EFG) exposed through Biological Collection Access Service (BioCASe) uses XML. And most custom formats use JavaScript Object Notation (JSON).\n In this lightning talk, we will dive into DiSSCo’s technical implementation of the harmonisation process. DiSSCo currently supports two biodiversity information standards, DwC and ABCD(EFG), and maps the data to our openDS specification on a record-by-record basis. We will highlight some of the more problematic mappings, but also show how a harmonised model massively simplifies generic actions, such as the calculation of MIDS levels, which provide information about digitisation completeness of a specimen. We will conclude by having a quick look at the next steps and hope to start a discussion about controlled vocabularies. \n The development of high quality, standardised data based on a strict specification with controlled vocabularies, rooted in community accepted standards, can have a huge impact on biodiversity research and is an essential step towards scaling up research with computational support.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.112137","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Predictability is one of the core requirements for creating machine actionable data. The better predictable the data, the more generic the service acting on the data can be. The more generic the service, the easier we can exchange ideas, collaborate on initiatives and leverage machines to do the work. It is essential for implementing the FAIR Principles (Findable, Accessible, Interoperable, Reproducible), as it provides the “I” for Interoperability (Jacobsen et al. 2020). The FAIR principles emphasise machine actionability because the amount of data generated is far too large for humans to handle.
While Biodiversity Information Standards (TDWG) standards have massively improved the standardisation of biodiversity data, there is still room for improvement. Within the Distributed System of Scientific Collections (DiSSCo), we aim to harmonise all scientific data derived from European specimen collections, including geological specimens, into a single data specification. We call this data specification the open Digital Specimen (openDS). It is being built on top of existing and developing biodiversity information standards such as Darwin Core (DwC), Minimal Information Digital Specimen (MIDS), Latimer Core, Access to Biological Collection Data (ABCD) Schema, Extension for Geosciences (EFG) and also on the new Global Biodiversity Information Facility (GBIF) Unified Model. In openDS we leverage the existing standards within the TDWG community but combine these with stricter constraints and controlled vocabularies, with the aim to improve the FAIRness of the data. This will not only make the data easier to use, but will also increase its quality and machine actionability.
As the first step towards this the harmonisation of terms, we make sure that similar values use the same term in a standard as key. This enables the next step in which we harmonise the values. We can transform free-text values into standardised or controlled vocabularies. For example: instead of using the names J. Doe, John Doe and J. Doe sr. for a collector, we aim to standardise these to J. Doe, with a person identifier that connects this name with more information about the collector.
Biodiversity information standards such as DwC were developed to lower the bar for data sharing. The downside of including minimal restraints and flexibility is that they provide room for ambiguity, leading to multiple ways of interpretation. This limits interoperability and hampers machine actionability. In DiSSCo, data will come from different sources that use different biodiversity information standards. To cover this, we need to harmonise terms between these standards. To complicate things further, different serialisation methods are used for data exchange. Darwin Core Archives (DwC-A; GBIF 2021) use Comma-separated values (CSV) files. ABCD(EFG) exposed through Biological Collection Access Service (BioCASe) uses XML. And most custom formats use JavaScript Object Notation (JSON).
In this lightning talk, we will dive into DiSSCo’s technical implementation of the harmonisation process. DiSSCo currently supports two biodiversity information standards, DwC and ABCD(EFG), and maps the data to our openDS specification on a record-by-record basis. We will highlight some of the more problematic mappings, but also show how a harmonised model massively simplifies generic actions, such as the calculation of MIDS levels, which provide information about digitisation completeness of a specimen. We will conclude by having a quick look at the next steps and hope to start a discussion about controlled vocabularies.
The development of high quality, standardised data based on a strict specification with controlled vocabularies, rooted in community accepted standards, can have a huge impact on biodiversity research and is an essential step towards scaling up research with computational support.
可预测性是创建机器可操作数据的核心要求之一。数据的可预测性越好,作用于数据的服务就越通用。服务越通用,我们就越容易交换想法,在计划上进行合作,并利用机器来完成工作。它对于实现FAIR原则(可查找、可访问、可互操作、可复制)至关重要,因为它为互操作性提供了“I”(Jacobsen et al. 2020)。FAIR原则强调机器的可操作性,因为生成的数据量太大,人类无法处理。虽然生物多样性信息标准(TDWG)标准极大地改善了生物多样性数据的标准化,但仍有改进的空间。在分布式科学收集系统(DiSSCo)中,我们的目标是将来自欧洲标本收集的所有科学数据(包括地质标本)统一到一个单一的数据规范中。我们称这种数据规范为开放数字样本(openes)。它建立在现有的和正在发展的生物多样性信息标准之上,如达尔文核心(DwC)、最小信息数字标本(MIDS)、拉蒂默核心(Latimer Core)、生物收集数据获取(ABCD)模式、地球科学扩展(EFG)以及新的全球生物多样性信息设施(GBIF)统一模型。在openDS中,我们利用了TDWG社区中的现有标准,但将这些标准与更严格的约束和受控词汇表结合起来,目的是提高数据的公平性。这不仅将使数据更容易使用,而且还将提高其质量和机器可操作性。作为实现这一目标的第一步,术语的协调,我们确保相似的值在标准中使用相同的术语作为关键。这使我们能够协调价值观的下一步。我们可以将自由文本值转换为标准化或受控词汇表。例如:我们的目标是将这些名称标准化为J. Doe,而不是使用J. Doe, John Doe和J. Doe sr.来表示收集器,并使用一个个人标识符将该名称与收集器的更多信息联系起来。诸如DwC之类的生物多样性信息标准的制定是为了降低数据共享的门槛。包含最小限制和灵活性的缺点是,它们为歧义提供了空间,导致多种解释方式。这限制了互操作性并阻碍了机器的可操作性。在DiSSCo中,数据将来自使用不同生物多样性信息标准的不同来源。为了解决这个问题,我们需要协调这些标准之间的术语。使事情进一步复杂化的是,数据交换使用了不同的序列化方法。达尔文核心档案(DwC-A;GBIF 2021)使用逗号分隔值(CSV)文件。通过生物收集访问服务(BioCASe)公开的ABCD(EFG)使用XML。大多数自定义格式使用JavaScript对象表示法(JSON)。在这个简短的演讲中,我们将深入探讨disco在协调过程中的技术实现。DiSSCo目前支持两个生物多样性信息标准,DwC和ABCD(EFG),并在逐个记录的基础上将数据映射到我们的openDS规范。我们将重点介绍一些更有问题的映射,但也展示了一个协调的模型如何大规模简化一般的动作,如MIDS水平的计算,它提供了关于标本数字化完整性的信息。最后,我们将快速浏览一下接下来的步骤,并希望开始讨论受控词汇表。高质量、标准化数据的开发基于严格的规范,具有受控的词汇,植根于社区接受的标准,可以对生物多样性研究产生巨大影响,并且是在计算支持下扩大研究规模的重要一步。