{"title":"双向链接:好处、挑战、缺陷和解决方案","authors":"Guido Sautter, D. Agosti","doi":"10.3897/biss.7.112344","DOIUrl":null,"url":null,"abstract":"Taxonomy, and biodiversity science in general, mainly revolve around four types of entities, which are available digitally in ever increasing numbers from different services: (1) Physical specimens (kept in museums and other collections around the world) and observations are available digitally via the Global Biodiversity Information Facility (GBIF). (2) DNA sequences (often derived from preserved specimens) are available from the European Nucleotide Archive (ENA) and National Center for Biotechnology Information (NCBI), having accession numbers as their primary means of citation. (3) Taxa, identified by taxon names, are increasingly registered to nomenclatural reference databases (ZooBank, International Plant Names Index (IPNI)) and aggregated in the Catalogue of Life (CoL). (4) Taxonomic treatments combine the former three; they define taxa, express scientific opinions about existing taxa, based upon specimens as well as DNA sequences derived from themand coin respective names; they are available from TreatmentBank (as well as Zenodo/Biodiversity Literature Repository (BLR) and Swiss Institute of Bioinformatics Literature Services (SIBiLS), and GBIF).\n Traditionally, treatments cite specimens, taxa, and other treatments in mainly human-centric ways, describing where to find the cited object, but they are not immediately actionable in a digital sense. Specimen citations use institution and collection codes and catalog numbers (often combined with geographical and environmental data). Taxon names are a type of self-citing entities, especially when given in combination with their (bibliographic) authorship, as they represent a historical approach to human-readable taxon identifiers. Citations of treatments are very similar to those of taxon names, adding (bibliographic) information of subsequent name usages as needed. Accession numbers for DNA sequences are the closest to modern digital identifiers. However, none of these means of citation, as usually found in literature, are readily machine actionable, which makes them hard to process at scale and analyze programmatically. Identifiers coined by the various data providers, in combination with APIs to resolve them, alleviate this problem and enable computational navigation of such links. However, this alone only defers the problem, as actionable identifiers (e.g., HTTP URIs) at some point still need to be inferred from the information given in the traditional means of citation where the latter occur in data.\n Recent projects, like BiCIKL, aim to add machine navigable links to the various entities (or respective data records) at scale, in pursuit of (ideally) fully intermeshed records, connecting (1) treatments to subject taxon names and concepts, cited specimens and DNA sequences, as well as cited treatments (with explicit nomenclatorial implications, e.g., taxon name synonymies or rebuttals thereof), (2) (digital) specimens to assigned taxon names, citing treatments, and any derived DNA sequences, (3) DNA sequences to source specimens (or their digital counterparts), where applicable, assigned taxon names, and citing treatments, and (4) taxon names to defining and synonymizing treatments, associated (digital) specimens, and any derived DNA sequences. This removes possible issues with transitive dependencies in a sequence of links, as an intermediate point of failure; all major data providers have been doing this to various degrees for some time, which provides a great starting point, but several challenges and pitfalls remain: For valid technical reasons, the systems of the individual data providers are (and need to be) self-contained, which comes at the cost of a certain amount of duplication (e.g., GBIF and ENA/NCBI backbone taxonomies). This is unproblematic per se, but slows down update proliferation and can incur some discrepancies. Further, traditional human-readable identifiers can be somewhat ambiguous: (1) some institution and collection codes are not unique, or authors use them in non-standard ways (some codes in the Global Registry of Scientific Collections (GrSciColl) point to half a dozen different institutions, for instance); (2) certain catalog numbers of museum specimens are also valid (resolvable) accession numbers, with actual semantics only emerging from context; (3) absence of the latter renders the semantics of data presented in tables especially hard to infer; (4) none of the providers has complete data coverage, so linking is not even technically possible in all cases at any given point, and some links can only be added over time, as coverage and thus overlap between data increases (newly published names cannot possibly be in CoL when the defining treatment gets digitized, for instance); (5) occasional full re-computation or re-processing is impractical and wasteful at best.\n In this presentation, we discuss various ways of overcoming the outlined challenges and avoiding the described pitfalls, and also make related suggestions for APIs to better support respective mechanisms.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bidirectional Linking: Benefits, challenges, pitfalls, and solutions\",\"authors\":\"Guido Sautter, D. Agosti\",\"doi\":\"10.3897/biss.7.112344\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Taxonomy, and biodiversity science in general, mainly revolve around four types of entities, which are available digitally in ever increasing numbers from different services: (1) Physical specimens (kept in museums and other collections around the world) and observations are available digitally via the Global Biodiversity Information Facility (GBIF). (2) DNA sequences (often derived from preserved specimens) are available from the European Nucleotide Archive (ENA) and National Center for Biotechnology Information (NCBI), having accession numbers as their primary means of citation. (3) Taxa, identified by taxon names, are increasingly registered to nomenclatural reference databases (ZooBank, International Plant Names Index (IPNI)) and aggregated in the Catalogue of Life (CoL). (4) Taxonomic treatments combine the former three; they define taxa, express scientific opinions about existing taxa, based upon specimens as well as DNA sequences derived from themand coin respective names; they are available from TreatmentBank (as well as Zenodo/Biodiversity Literature Repository (BLR) and Swiss Institute of Bioinformatics Literature Services (SIBiLS), and GBIF).\\n Traditionally, treatments cite specimens, taxa, and other treatments in mainly human-centric ways, describing where to find the cited object, but they are not immediately actionable in a digital sense. Specimen citations use institution and collection codes and catalog numbers (often combined with geographical and environmental data). Taxon names are a type of self-citing entities, especially when given in combination with their (bibliographic) authorship, as they represent a historical approach to human-readable taxon identifiers. Citations of treatments are very similar to those of taxon names, adding (bibliographic) information of subsequent name usages as needed. Accession numbers for DNA sequences are the closest to modern digital identifiers. However, none of these means of citation, as usually found in literature, are readily machine actionable, which makes them hard to process at scale and analyze programmatically. Identifiers coined by the various data providers, in combination with APIs to resolve them, alleviate this problem and enable computational navigation of such links. However, this alone only defers the problem, as actionable identifiers (e.g., HTTP URIs) at some point still need to be inferred from the information given in the traditional means of citation where the latter occur in data.\\n Recent projects, like BiCIKL, aim to add machine navigable links to the various entities (or respective data records) at scale, in pursuit of (ideally) fully intermeshed records, connecting (1) treatments to subject taxon names and concepts, cited specimens and DNA sequences, as well as cited treatments (with explicit nomenclatorial implications, e.g., taxon name synonymies or rebuttals thereof), (2) (digital) specimens to assigned taxon names, citing treatments, and any derived DNA sequences, (3) DNA sequences to source specimens (or their digital counterparts), where applicable, assigned taxon names, and citing treatments, and (4) taxon names to defining and synonymizing treatments, associated (digital) specimens, and any derived DNA sequences. This removes possible issues with transitive dependencies in a sequence of links, as an intermediate point of failure; all major data providers have been doing this to various degrees for some time, which provides a great starting point, but several challenges and pitfalls remain: For valid technical reasons, the systems of the individual data providers are (and need to be) self-contained, which comes at the cost of a certain amount of duplication (e.g., GBIF and ENA/NCBI backbone taxonomies). This is unproblematic per se, but slows down update proliferation and can incur some discrepancies. Further, traditional human-readable identifiers can be somewhat ambiguous: (1) some institution and collection codes are not unique, or authors use them in non-standard ways (some codes in the Global Registry of Scientific Collections (GrSciColl) point to half a dozen different institutions, for instance); (2) certain catalog numbers of museum specimens are also valid (resolvable) accession numbers, with actual semantics only emerging from context; (3) absence of the latter renders the semantics of data presented in tables especially hard to infer; (4) none of the providers has complete data coverage, so linking is not even technically possible in all cases at any given point, and some links can only be added over time, as coverage and thus overlap between data increases (newly published names cannot possibly be in CoL when the defining treatment gets digitized, for instance); (5) occasional full re-computation or re-processing is impractical and wasteful at best.\\n In this presentation, we discuss various ways of overcoming the outlined challenges and avoiding the described pitfalls, and also make related suggestions for APIs to better support respective mechanisms.\",\"PeriodicalId\":9011,\"journal\":{\"name\":\"Biodiversity Information Science and Standards\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodiversity Information Science and Standards\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3897/biss.7.112344\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.112344","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
分类学和生物多样性科学总体上主要围绕四种类型的实体,这些实体从不同的服务中获得的数字数量不断增加:(1)物理标本(保存在世界各地的博物馆和其他收藏品中)和观测结果通过全球生物多样性信息设施(GBIF)获得数字。(2) DNA序列(通常来源于保存的标本)可从欧洲核苷酸档案(ENA)和国家生物技术信息中心(NCBI)获得,其检索编号是其主要引用方式。(3)通过分类单元名称识别的分类群越来越多地被收录到命名参考数据库(ZooBank、International Plant names Index (IPNI))和生命目录(CoL)中。(4)前三种分类处理相结合;他们根据标本及其衍生的DNA序列来定义分类群,表达对现有分类群的科学看法,并创造各自的名称;它们可从TreatmentBank(以及Zenodo/生物多样性文献库(BLR)和瑞士生物信息学文献服务研究所(SIBiLS)和GBIF)获得。传统上,治疗方法主要以人为中心的方式引用标本、分类群和其他治疗方法,描述在哪里可以找到被引用的对象,但它们在数字意义上不是立即可操作的。标本引用使用机构和收集代码以及目录编号(通常与地理和环境数据相结合)。分类单元名称是一种自引用实体,特别是当与它们的(书目)作者身份结合在一起时,因为它们代表了人类可读分类单元标识符的历史方法。处理的引用与分类单元名称的引用非常相似,并根据需要添加后续名称用法的(书目)信息。DNA序列的编号是最接近现代数字标识符的。然而,通常在文献中发现的这些引用方式都不容易被机器操作,这使得它们难以大规模处理和编程分析。由各种数据提供者创造的标识符,结合解决它们的api,缓解了这个问题,并使这些链接的计算导航成为可能。然而,这只是推迟了问题的解决,因为在某些情况下,可操作的标识符(例如HTTP uri)仍然需要从传统的引用方式中给出的信息中推断出来,而后者出现在数据中。最近的项目,如BiCIKL,旨在大规模地添加到各种实体(或各自的数据记录)的机器可导航链接,以追求(理想的)完全互连的记录,将(1)处理与主题分类单元名称和概念,引用的标本和DNA序列,以及引用的处理(具有明确的命名含义,例如,分类单元名称同义词或其反驳),(2)(数字)标本与指定的分类单元名称,引用的处理和任何衍生的DNA序列连接起来。(3)源标本(或其数字对应物)的DNA序列,如适用,分配分类群名称和引用处理;(4)分类群名称,定义和同义化处理、相关(数字)标本和任何衍生DNA序列。这消除了链接序列中传递依赖关系可能出现的问题,作为故障的中间点;所有主要的数据提供者已经在不同程度上这样做了一段时间,这提供了一个很好的起点,但是仍然存在一些挑战和陷阱:由于有效的技术原因,单个数据提供者的系统是(并且需要是)自包含的,这是以一定数量的重复(例如,GBIF和ENA/NCBI主干分类法)为代价的。这本身没有问题,但会减缓更新的扩散,并可能导致一些差异。此外,传统的人类可读标识符可能有些模棱两可:(1)一些机构和收藏代码不是唯一的,或者作者以非标准的方式使用它们(例如,全球科学收藏注册(GrSciColl)中的一些代码指向六个不同的机构);(2)博物馆标本的某些目录编号也是有效的(可解析的)加入编号,其实际语义仅从上下文中出现;(3)缺少后者使得表中数据的语义特别难以推断;(4)没有一个提供者具有完整的数据覆盖,因此在任何给定的点上,链接甚至在技术上都不可能在所有情况下都是可行的,并且随着覆盖范围和数据之间的重叠增加,一些链接只能随着时间的推移而添加(例如,当定义处理被数字化时,新发布的名称不可能在CoL中);(5)偶尔的完全重新计算或重新处理是不切实际和浪费的。 在本次演讲中,我们将讨论克服上述挑战和避免上述缺陷的各种方法,并为api提供相关建议,以更好地支持各自的机制。
Bidirectional Linking: Benefits, challenges, pitfalls, and solutions
Taxonomy, and biodiversity science in general, mainly revolve around four types of entities, which are available digitally in ever increasing numbers from different services: (1) Physical specimens (kept in museums and other collections around the world) and observations are available digitally via the Global Biodiversity Information Facility (GBIF). (2) DNA sequences (often derived from preserved specimens) are available from the European Nucleotide Archive (ENA) and National Center for Biotechnology Information (NCBI), having accession numbers as their primary means of citation. (3) Taxa, identified by taxon names, are increasingly registered to nomenclatural reference databases (ZooBank, International Plant Names Index (IPNI)) and aggregated in the Catalogue of Life (CoL). (4) Taxonomic treatments combine the former three; they define taxa, express scientific opinions about existing taxa, based upon specimens as well as DNA sequences derived from themand coin respective names; they are available from TreatmentBank (as well as Zenodo/Biodiversity Literature Repository (BLR) and Swiss Institute of Bioinformatics Literature Services (SIBiLS), and GBIF).
Traditionally, treatments cite specimens, taxa, and other treatments in mainly human-centric ways, describing where to find the cited object, but they are not immediately actionable in a digital sense. Specimen citations use institution and collection codes and catalog numbers (often combined with geographical and environmental data). Taxon names are a type of self-citing entities, especially when given in combination with their (bibliographic) authorship, as they represent a historical approach to human-readable taxon identifiers. Citations of treatments are very similar to those of taxon names, adding (bibliographic) information of subsequent name usages as needed. Accession numbers for DNA sequences are the closest to modern digital identifiers. However, none of these means of citation, as usually found in literature, are readily machine actionable, which makes them hard to process at scale and analyze programmatically. Identifiers coined by the various data providers, in combination with APIs to resolve them, alleviate this problem and enable computational navigation of such links. However, this alone only defers the problem, as actionable identifiers (e.g., HTTP URIs) at some point still need to be inferred from the information given in the traditional means of citation where the latter occur in data.
Recent projects, like BiCIKL, aim to add machine navigable links to the various entities (or respective data records) at scale, in pursuit of (ideally) fully intermeshed records, connecting (1) treatments to subject taxon names and concepts, cited specimens and DNA sequences, as well as cited treatments (with explicit nomenclatorial implications, e.g., taxon name synonymies or rebuttals thereof), (2) (digital) specimens to assigned taxon names, citing treatments, and any derived DNA sequences, (3) DNA sequences to source specimens (or their digital counterparts), where applicable, assigned taxon names, and citing treatments, and (4) taxon names to defining and synonymizing treatments, associated (digital) specimens, and any derived DNA sequences. This removes possible issues with transitive dependencies in a sequence of links, as an intermediate point of failure; all major data providers have been doing this to various degrees for some time, which provides a great starting point, but several challenges and pitfalls remain: For valid technical reasons, the systems of the individual data providers are (and need to be) self-contained, which comes at the cost of a certain amount of duplication (e.g., GBIF and ENA/NCBI backbone taxonomies). This is unproblematic per se, but slows down update proliferation and can incur some discrepancies. Further, traditional human-readable identifiers can be somewhat ambiguous: (1) some institution and collection codes are not unique, or authors use them in non-standard ways (some codes in the Global Registry of Scientific Collections (GrSciColl) point to half a dozen different institutions, for instance); (2) certain catalog numbers of museum specimens are also valid (resolvable) accession numbers, with actual semantics only emerging from context; (3) absence of the latter renders the semantics of data presented in tables especially hard to infer; (4) none of the providers has complete data coverage, so linking is not even technically possible in all cases at any given point, and some links can only be added over time, as coverage and thus overlap between data increases (newly published names cannot possibly be in CoL when the defining treatment gets digitized, for instance); (5) occasional full re-computation or re-processing is impractical and wasteful at best.
In this presentation, we discuss various ways of overcoming the outlined challenges and avoiding the described pitfalls, and also make related suggestions for APIs to better support respective mechanisms.