Progress with Repository-based Annotation Infrastructure for Biodiversity Applications

Biodiversity Information Science and Standards Pub Date : 2023-09-14 DOI:10.3897/biss.7.112707

Peter Cornwell

{"title":"Progress with Repository-based Annotation Infrastructure for Biodiversity Applications","authors":"Peter Cornwell","doi":"10.3897/biss.7.112707","DOIUrl":null,"url":null,"abstract":"Rapid development since the 1980s of technologies for analysing texts, has led not only to widespread employment of text 'mining', but also to now-pervasive large language model artificial intelligence (AI) applications. However, building new, concise, data resources from historic, as well as contemporary scientific literature, which can be employed efficiently at scale by automation and which have long-term value for the research community, has proved more elusive. Efforts at codifying analyses, such as the Text Encoding Initiative (TEI), date from the early 1990s and were initially driven by the social sciences and humanities (SSH) and linguistics communities, and extended with multiple XML-based tagging schemes, including in biodiversity (Miller et al. 2012). In 2010, the Bio-Ontologies Special Interest Group (of the International Society for Computational Biology) presented its Annotation Ontology (AO), incorporating JavaScript Object Notation and broadening previous XML-based approaches (Ciccarese et al. 2011). From 2011, the Open Annotation Data Model (OADM) (Sanderson et al. 2013) focused on cross-domain standards with utility for Web 3.0, leading to the W3C Web Annotation Data Model (WADM) Recommendation in February 2017*1 and the potential for unifying the multiplicity of already-in-use tagging approaches. This continual evolution has made the preservation of investment using annotation methods, and in particular of the connections between annotations and their context in source literature, particularly challenging. Infrastructure that entered service during the intervening years does not yet support WADM, and has only recently started to address the parallel emergence of page imagery-based standards such as the International Image Interoperability Framework (IIIF). Notably, IIIF instruments such as Mirador-2, which has been employed widely for manual creation and editing of annotations in SSH, continue to employ the now-deprecated OADM. Although multiple efforts now address combining IIIF and TEI text coordinate systems, they are currently fundamentally incompatible. However, emerging repository technologies enable preservation of annotation investment to be accomplished comprehensively for the first time. Native IIIF support enables interactive previewing of annotations within repository graphical user interfaces and dynamic serialisation technologies provide compatibility with existing XML-based infrastructures. Repository access controls can permit experts to trace annotation sources in original texts even if the literature is not publicly accessible, e.g., due to copyright restriction. This is of paramount importance, not only because surrounding context can be crucial to qualify formal terms that have been annotated, such as collecting country. Also, contemporary automated text mining—essential for operation at the scale of known biodiversity literature—is not 100% accurate and manual checking of uncertainties is currently essential. On-going improvement of language analysis tools through AI integration offers significant future gains from reprocessing literature and updating annotation data resources. Nevertheless, without effective preservation of digitized literature, as well as annotations, this enrichment will not be possible—and today's investments in gathering together, as well as analysing scientific literature will be devalued or lost. We report new functionality included in the InvenioRDM*2 Free and Open Source Software (FOSS) repository software platform, which natively supports IIIF and WADM. InvenioRDM development and maintenance is funded and managed by an international consortium. From late 2023, the InvenioRDM-based ZenodoRDM update*3 will display annotations on biodiversity literature interactively. Significantly, the Biodiversity Literature Repository (BLR) is a Zenodo Community. BLR automatically notifies the Global Biodiversity Information Facility (GBIF) of new taxonomic data and GBIF downloads and integrates this into its service. Moreover, an annotation service based on the WADM-native Mirador-3 FOSS IIIF viewer has now been developed and will enter service with ZenodoRDM. This enables editing of biodiversity annotations from within the repository interface, as well as automated updating of taxonomic information products provided to other major infrastructures such as GBIF. Two aspects of this ZenodoRDM annotation service are presented: dynamic transformation of (preservable) WADM annotations for consumption by contemporary IIIF-compliant applications such as Mirador-3, as well as for Plazi TreatmentBank/GBIF compatibility authentication and task organization permitting management of groups of expert contributors performing annotation enrichment tasks directly through the ZenodoRDM graphical user interface (GUI) dynamic transformation of (preservable) WADM annotations for consumption by contemporary IIIF-compliant applications such as Mirador-3, as well as for Plazi TreatmentBank/GBIF compatibility authentication and task organization permitting management of groups of expert contributors performing annotation enrichment tasks directly through the ZenodoRDM graphical user interface (GUI) Workflows for editing existing biodiversity annotations, as well as origination of new annotations, need to be tailored for specific tasks—e.g., unifying geographic collecting location definitions in historic reports—via configurable dialogs for contributors and controlled vocabularies. Selectively populating workflows with annotations according to a task definition is also important to avoid cluttering the editing GUI with non-essential information. Updated annotations are integrated into a new annotation collection upon completion of a task, before updating repository records. Current work on annotation workflows for SSH applications is also reported. The ZenodoRDM biodiversity annotation service implements a generic repository micro-service API, and the implementation of similar services for other repository software platforms is discussed.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodiversity Information Science and Standards","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/biss.7.112707","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Rapid development since the 1980s of technologies for analysing texts, has led not only to widespread employment of text 'mining', but also to now-pervasive large language model artificial intelligence (AI) applications. However, building new, concise, data resources from historic, as well as contemporary scientific literature, which can be employed efficiently at scale by automation and which have long-term value for the research community, has proved more elusive. Efforts at codifying analyses, such as the Text Encoding Initiative (TEI), date from the early 1990s and were initially driven by the social sciences and humanities (SSH) and linguistics communities, and extended with multiple XML-based tagging schemes, including in biodiversity (Miller et al. 2012). In 2010, the Bio-Ontologies Special Interest Group (of the International Society for Computational Biology) presented its Annotation Ontology (AO), incorporating JavaScript Object Notation and broadening previous XML-based approaches (Ciccarese et al. 2011). From 2011, the Open Annotation Data Model (OADM) (Sanderson et al. 2013) focused on cross-domain standards with utility for Web 3.0, leading to the W3C Web Annotation Data Model (WADM) Recommendation in February 2017*1 and the potential for unifying the multiplicity of already-in-use tagging approaches. This continual evolution has made the preservation of investment using annotation methods, and in particular of the connections between annotations and their context in source literature, particularly challenging. Infrastructure that entered service during the intervening years does not yet support WADM, and has only recently started to address the parallel emergence of page imagery-based standards such as the International Image Interoperability Framework (IIIF). Notably, IIIF instruments such as Mirador-2, which has been employed widely for manual creation and editing of annotations in SSH, continue to employ the now-deprecated OADM. Although multiple efforts now address combining IIIF and TEI text coordinate systems, they are currently fundamentally incompatible. However, emerging repository technologies enable preservation of annotation investment to be accomplished comprehensively for the first time. Native IIIF support enables interactive previewing of annotations within repository graphical user interfaces and dynamic serialisation technologies provide compatibility with existing XML-based infrastructures. Repository access controls can permit experts to trace annotation sources in original texts even if the literature is not publicly accessible, e.g., due to copyright restriction. This is of paramount importance, not only because surrounding context can be crucial to qualify formal terms that have been annotated, such as collecting country. Also, contemporary automated text mining—essential for operation at the scale of known biodiversity literature—is not 100% accurate and manual checking of uncertainties is currently essential. On-going improvement of language analysis tools through AI integration offers significant future gains from reprocessing literature and updating annotation data resources. Nevertheless, without effective preservation of digitized literature, as well as annotations, this enrichment will not be possible—and today's investments in gathering together, as well as analysing scientific literature will be devalued or lost. We report new functionality included in the InvenioRDM*2 Free and Open Source Software (FOSS) repository software platform, which natively supports IIIF and WADM. InvenioRDM development and maintenance is funded and managed by an international consortium. From late 2023, the InvenioRDM-based ZenodoRDM update*3 will display annotations on biodiversity literature interactively. Significantly, the Biodiversity Literature Repository (BLR) is a Zenodo Community. BLR automatically notifies the Global Biodiversity Information Facility (GBIF) of new taxonomic data and GBIF downloads and integrates this into its service. Moreover, an annotation service based on the WADM-native Mirador-3 FOSS IIIF viewer has now been developed and will enter service with ZenodoRDM. This enables editing of biodiversity annotations from within the repository interface, as well as automated updating of taxonomic information products provided to other major infrastructures such as GBIF. Two aspects of this ZenodoRDM annotation service are presented: dynamic transformation of (preservable) WADM annotations for consumption by contemporary IIIF-compliant applications such as Mirador-3, as well as for Plazi TreatmentBank/GBIF compatibility authentication and task organization permitting management of groups of expert contributors performing annotation enrichment tasks directly through the ZenodoRDM graphical user interface (GUI) dynamic transformation of (preservable) WADM annotations for consumption by contemporary IIIF-compliant applications such as Mirador-3, as well as for Plazi TreatmentBank/GBIF compatibility authentication and task organization permitting management of groups of expert contributors performing annotation enrichment tasks directly through the ZenodoRDM graphical user interface (GUI) Workflows for editing existing biodiversity annotations, as well as origination of new annotations, need to be tailored for specific tasks—e.g., unifying geographic collecting location definitions in historic reports—via configurable dialogs for contributors and controlled vocabularies. Selectively populating workflows with annotations according to a task definition is also important to avoid cluttering the editing GUI with non-essential information. Updated annotations are integrated into a new annotation collection upon completion of a task, before updating repository records. Current work on annotation workflows for SSH applications is also reported. The ZenodoRDM biodiversity annotation service implements a generic repository micro-service API, and the implementation of similar services for other repository software platforms is discussed.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于知识库的生物多样性标注基础设施研究进展

介绍了ZenodoRDM注释服务的两个方面:(可保存的)WADM注释的动态转换，以供当代符合iiif的应用程序(如Mirador-3)使用;以及Plazi TreatmentBank/GBIF兼容性认证和任务组织，允许管理专家贡献者组，直接通过ZenodoRDM图形用户界面(GUI)动态转换(可保存的)WADM注释，以供当代iiif兼容的应用程序(如Mirador-3)使用。以及Plazi TreatmentBank/GBIF兼容性认证和任务组织，允许管理专家贡献者组直接通过ZenodoRDM图形用户界面(GUI)执行注释丰富任务。编辑现有生物多样性注释的工作流程，以及创建新注释的工作流程，需要针对特定任务进行定制，例如:在历史报告中统一地理收集位置定义——通过为贡献者和受控词汇表提供的可配置对话框。根据任务定义有选择地填充带有注释的工作流，这对于避免编辑GUI与非必要信息混淆也很重要。更新后的注释在任务完成后集成到新的注释集合中，然后再更新存储库记录。还报告了SSH应用程序注释工作流的当前工作。ZenodoRDM生物多样性标注服务实现了一个通用的存储库微服务API，并讨论了其他存储库软件平台上类似服务的实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Biodiversity Information Science and Standards

自引率

0.00%

发文量