Jennifer C. Girón Duque, Meghan Balk, W. Dahdul, H. Lapp, István Mikó, Elie Alhajjar, Brenen M. Wynd, Sergei Tarasov, Christopher Lawrence, Basanta Khakurel, Arthur Porto, Lin Yan, Isadora E Fluck, D. Porto, Joseph Keating, I. Borokini, Katja Seltmann, G. Montanaro, Paula M. Mabee
The Phenoscape project has developed ontology-based tools and a knowledge base that enables the integration and discovery of phenotypes across species from the scientific literature. The Phenoscape TraitFest 2023 event aimed to promote innovative applications that adopt the capabilities supported by the data in the Phenoscape Knowledgebase and its corresponding semantics-enabled tools, algorithms and infrastructure. The event brought together 26 participants, including domain experts in biodiversity informatics, taxonomy and phylogenetics and software developers from various life-sciences programming toolkits and phylogenetic software projects, for an intense four-day collaborative software coding event. The event was designed as a hands-on workshop, based on the Open Space Technology methodology, in which participants self-organise into subgroups to collaboratively plan and work on their shared research interests. We describe how the workshop was organised, the projects developed and outcomes resulting from the workshop, as well as the challenges in bringing together a diverse group of participants to engage productively in a collaborative environment.
Phenoscape项目开发了基于本体论的工具和知识库,能够整合和发现科学文献中的跨物种表型。Phenoscape TraitFest 2023活动旨在推广创新应用,这些应用采用了Phenoscape知识库中的数据及其相应的语义化工具、算法和基础设施所支持的功能。该活动汇集了 26 名参与者,包括生物多样性信息学、分类学和系统发生学领域的专家,以及来自各种生命科学编程工具包和系统发生学软件项目的软件开发人员,共同参加了为期四天的紧张的软件编码协作活动。此次活动是基于开放空间技术(Open Space Technology)方法而设计的实践研讨会,与会者自行组织成小组,就共同的研究兴趣进行合作规划和工作。我们将介绍工作坊的组织方式、开发的项目、工作坊取得的成果,以及将不同的参与者聚集在一起,在协作环境中开展富有成效的工作所面临的挑战。
{"title":"Meeting Report for the Phenoscape TraitFest 2023 with Comments on Organising Interdisciplinary Meetings","authors":"Jennifer C. Girón Duque, Meghan Balk, W. Dahdul, H. Lapp, István Mikó, Elie Alhajjar, Brenen M. Wynd, Sergei Tarasov, Christopher Lawrence, Basanta Khakurel, Arthur Porto, Lin Yan, Isadora E Fluck, D. Porto, Joseph Keating, I. Borokini, Katja Seltmann, G. Montanaro, Paula M. Mabee","doi":"10.3897/biss.8.115232","DOIUrl":"https://doi.org/10.3897/biss.8.115232","url":null,"abstract":"The Phenoscape project has developed ontology-based tools and a knowledge base that enables the integration and discovery of phenotypes across species from the scientific literature. The Phenoscape TraitFest 2023 event aimed to promote innovative applications that adopt the capabilities supported by the data in the Phenoscape Knowledgebase and its corresponding semantics-enabled tools, algorithms and infrastructure. The event brought together 26 participants, including domain experts in biodiversity informatics, taxonomy and phylogenetics and software developers from various life-sciences programming toolkits and phylogenetic software projects, for an intense four-day collaborative software coding event. The event was designed as a hands-on workshop, based on the Open Space Technology methodology, in which participants self-organise into subgroups to collaboratively plan and work on their shared research interests. We describe how the workshop was organised, the projects developed and outcomes resulting from the workshop, as well as the challenges in bringing together a diverse group of participants to engage productively in a collaborative environment.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"15 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140263231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementation Experience Report for the Developing Latimer Core Standard: The DiSSCo Flanders use-case","authors":"Lissa Breugelmans, Maarten Trekels","doi":"10.3897/biss.7.113766","DOIUrl":"https://doi.org/10.3897/biss.7.113766","url":null,"abstract":"<jats:p />","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"605 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139213641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
William Weaver, Kyle Lough, Stephen Smith, Brad Ruhfel
Natural history collections are critical reservoirs of biodiversity information but collections staff are constantly grappling with substantial backlogs and limited resources. The task of transcribing specimen label text into searchable databases requires a significant amount of time, manual labor, and funding. To address this challenge, we introduce VoucherVision, a tool harnessing the capabilities of several Large Language Models (LLMs; Naveed et al. 2023) to augment specimen label transcription. The VoucherVision tool automates laborious components of the transcription process, leveraging an Optical Character Recognition (OCR) system and LLMs to convert unstructured label text into appropriate data formats compatible with database ingestion. VoucherVision uses a combination of structured output parsers and recursive re-prompting strategies to ensure consistency and quality of the LLM-formatted text, significantly reducing errors. Integration of VoucherVision with the University of Michigan Herbarium’s transcription workflow resulted in a significant reduction in per-image transcription time, suggesting significant potential advantages for collections workflows. VoucherVision offers promising strides towards efficient digitization, with curatorial staff playing critical roles in data quality assurance and process oversight. Emphasizing the importance of knowledge sharing, the University of Michigan Herbarium is backing the Specimen Label Transcription Project (SLTP), which will provide open access to benchmarking datasets, fine-tuned models, and validation tools to rank the performance of different methodologies, LLMs, and prompting strategies. In the rapidly evolving landscape of Artificial Intelligence (AI) development, we recognize the profound potential of diverse contributions and innovative methodologies to redefine and advance the transformation of curatorial practices, catalyzing an era of accelerated digitization in natural history collections. An early, public version of VoucherVision is available to try here: https://vouchervision.azurewebsites.net/
自然历史馆藏是生物多样性信息的重要储存库,但馆藏工作人员一直在努力解决大量积压和资源有限的问题。将标本标签文本转录到可搜索数据库的任务需要大量的时间、体力劳动和资金。为了应对这一挑战,我们引入了VoucherVision,这是一种利用几个大型语言模型(llm)功能的工具;Naveed et al. 2023)增加标本标签转录。VoucherVision工具自动化了转录过程中费力的组件,利用光学字符识别(OCR)系统和llm将非结构化标签文本转换为与数据库摄取兼容的适当数据格式。VoucherVision使用结构化输出解析器和递归重新提示策略的组合,以确保llm格式文本的一致性和质量,显著减少错误。将VoucherVision与密歇根大学植物标本馆的转录工作流程集成,可以显著减少每张图像的转录时间,这表明馆藏工作流程具有显著的潜在优势。VoucherVision向高效数字化迈进了一大步,管理人员在数据质量保证和流程监督方面发挥了关键作用。为了强调知识共享的重要性,密歇根大学植物标本馆正在支持标本标签转录项目(SLTP),该项目将提供对基准数据集、微调模型和验证工具的开放访问,以对不同方法、法学硕士和提示策略的性能进行排名。在人工智能(AI)快速发展的背景下,我们认识到各种贡献和创新方法的巨大潜力,可以重新定义和推进策展实践的转型,催化自然历史藏品加速数字化的时代。VoucherVision的早期公开版本可以在这里试用:https://vouchervision.azurewebsites.net/
{"title":"The Future of Natural History Transcription: Navigating AI advancements with VoucherVision and the Specimen Label Transcription Project (SLTP)","authors":"William Weaver, Kyle Lough, Stephen Smith, Brad Ruhfel","doi":"10.3897/biss.7.113067","DOIUrl":"https://doi.org/10.3897/biss.7.113067","url":null,"abstract":"Natural history collections are critical reservoirs of biodiversity information but collections staff are constantly grappling with substantial backlogs and limited resources. The task of transcribing specimen label text into searchable databases requires a significant amount of time, manual labor, and funding. To address this challenge, we introduce VoucherVision, a tool harnessing the capabilities of several Large Language Models (LLMs; Naveed et al. 2023) to augment specimen label transcription. The VoucherVision tool automates laborious components of the transcription process, leveraging an Optical Character Recognition (OCR) system and LLMs to convert unstructured label text into appropriate data formats compatible with database ingestion. VoucherVision uses a combination of structured output parsers and recursive re-prompting strategies to ensure consistency and quality of the LLM-formatted text, significantly reducing errors.\u0000 \u0000 Integration of VoucherVision with the University of Michigan Herbarium’s transcription workflow resulted in a significant reduction in per-image transcription time, suggesting significant potential advantages for collections workflows. VoucherVision offers promising strides towards efficient digitization, with curatorial staff playing critical roles in data quality assurance and process oversight. Emphasizing the importance of knowledge sharing, the University of Michigan Herbarium is backing the Specimen Label Transcription Project (SLTP), which will provide open access to benchmarking datasets, fine-tuned models, and validation tools to rank the performance of different methodologies, LLMs, and prompting strategies. In the rapidly evolving landscape of Artificial Intelligence (AI) development, we recognize the profound potential of diverse contributions and innovative methodologies to redefine and advance the transformation of curatorial practices, catalyzing an era of accelerated digitization in natural history collections.\u0000 An early, public version of VoucherVision is available to try here: https://vouchervision.azurewebsites.net/","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136235172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matt Woodburn, Jutta Buschbom, Sharon Grant, Janeen Jones, Ben Norton, Maarten Trekels, Sarah Vincent, Kate Webbink
Latimer Core (LtC) is a new proposed Biodiversity Information Standards (TDWG) data standard that supports the representation and discovery of natural science collections by structuring data about the groups of objects that those collections and their subcomponents encompass (Woodburn et al. 2022). It is designed to be applicable to a range of use cases that include high level collection registries, rich textual narratives and semantic networks of collections, as well as more granular, quantitative breakdowns of collections to aid collection discovery and digitisation planning. As a standard that is (in this first version) focused on natural science collections, LtC has significant intersections with existing data standards and models (Fig. 1) that represent individual natural science objects and occurrences and their associated data (e.g., Darwin Core (DwC), Access to Biological Collection Data (ABCD), Conceptual Reference Model of the International Committee on Documentation (CIDOC-CRM)). LtC’s scope also overlaps with standards for more generic concepts like metadata, organisations, people and activities (i.e., Dublin Core, World Wide Web Consortium (W3C) ORG Ontology and PROV Ontology, Schema.org). LtC represents just an element of this extended network of data standards for the natural sciences and related concepts. Mapping between LtC and intersecting standards is therefore crucial for avoiding duplication of effort in the standard development process, and ensuring that data stored using the different standards are as interoperable as possible in alignment with FAIR (Findable, Accessible, Interoperable, Reusable) principles. In particular, it is vital to make robust associations between records representing groups of objects in LtC and records (where available) that represent the objects within those groups. During LtC development, efforts were made to identify and align with relevant standards and vocabularies, and adopt existing terms from them where possible. During expert review, a more structured approach was proposed and implemented using the Simple Knowledge Organization System (SKOS) mappingRelation vocabulary. This exercise helped to better describe the nature of the mappings between new LtC terms and related terms in other standards, and to validate decisions around the borrowing of existing terms for LtC. A further exercise also used elements of the Simple Standard for Sharing Ontological Mappings (SSSOM) to start to develop a more comprehensive set of metadata around these mappings. At present, these mappings (Suppl. material 1 and Suppl. material 2) are provisional and not considered to be comprehensive, but should be further refined and expanded over time. Even with the support provided by the SKOS and SSSOM standards, the LtC experience has proven the mapping process to be far from straightforward. Different standards vary in how they are structured, for example, DwC is a ‘bag of terms’, with informal classes and no structura
Latimer Core (LtC)是一个新提出的生物多样性信息标准(TDWG)数据标准,它通过构建这些集合及其子组件所包含的对象组的数据来支持自然科学集合的表示和发现(Woodburn et al. 2022)。它被设计为适用于一系列用例,包括高层次的集合注册,丰富的文本叙述和集合的语义网络,以及更细粒度的,定量的集合分解,以帮助集合发现和数字化规划。作为一个(在第一个版本中)专注于自然科学收藏的标准,LtC与现有的数据标准和模型(图1)有重要的交集,这些数据标准和模型表示单个自然科学对象和事件及其相关数据(例如,达尔文核心(DwC),获取生物收集数据(ABCD),国际文献委员会概念参考模型(CIDOC-CRM))。LtC的范围还与元数据、组织、人员和活动等更通用概念的标准重叠(即,Dublin Core、万维网联盟(W3C) ORG本体和PROV本体、Schema.org)。LtC只是自然科学和相关概念的数据标准扩展网络的一个元素。因此,LtC和交叉标准之间的映射对于避免标准开发过程中的重复工作至关重要,并确保使用不同标准存储的数据尽可能与FAIR(可查找、可访问、可互操作、可重用)原则保持一致。特别是,在LtC中表示对象组的记录和表示这些组中的对象的记录(如果可用)之间建立健壮的关联是至关重要的。在LtC开发过程中,我们努力确定并与相关标准和词汇表保持一致,并在可能的情况下采用其中的现有术语。在专家评审期间,提出了一种更结构化的方法,并使用简单知识组织系统(SKOS)映射关系词汇表实现了该方法。这个练习有助于更好地描述新的LtC术语与其他标准中的相关术语之间映射的性质,并验证围绕LtC借用现有术语的决策。进一步的练习还使用了共享本体映射简单标准(SSSOM)的元素,开始围绕这些映射开发更全面的元数据集。目前,这些映射(supl。材料1和供应。材料2)是临时的,不被认为是全面的,但应该随着时间的推移进一步完善和扩展。即使有SKOS和SSSOM标准提供的支持,LtC的经验也证明了映射过程远非简单。不同的标准在结构方式上各不相同,例如,DwC是一个“术语包”,具有非正式的类,没有结构约束,而更结构化的标准和本体(如ABCD和proof)采用不同的方法来定义和记录结构。各种标准使用不同的元数据模式和序列化(例如,资源描述框架(RDF)、XML)作为它们的文档,并使用不同的方法为它们的术语提供持久的、可解析的标识符。在评估源术语和目标术语所表示的概念之间的一致性时,还涉及许多微妙的细微差别,特别是在评估匹配是否足够精确以允许采用现有术语时。这些因素使得映射过程相当手工和劳动密集。方法和工具,如开发决策树(图2)来表示所涉及的逻辑,并进一步探索SSSOM标准,可以帮助简化这一过程。在本次演讲中,我们将讨论LtC在标准映射过程中的经验、面临的挑战和使用的方法,以及在预期的TDWG标准映射兴趣组内将这些经验贡献给协作标准映射的潜力。
{"title":"No Pain No Gain: Standards mapping in Latimer Core development","authors":"Matt Woodburn, Jutta Buschbom, Sharon Grant, Janeen Jones, Ben Norton, Maarten Trekels, Sarah Vincent, Kate Webbink","doi":"10.3897/biss.7.113053","DOIUrl":"https://doi.org/10.3897/biss.7.113053","url":null,"abstract":"Latimer Core (LtC) is a new proposed Biodiversity Information Standards (TDWG) data standard that supports the representation and discovery of natural science collections by structuring data about the groups of objects that those collections and their subcomponents encompass (Woodburn et al. 2022). It is designed to be applicable to a range of use cases that include high level collection registries, rich textual narratives and semantic networks of collections, as well as more granular, quantitative breakdowns of collections to aid collection discovery and digitisation planning. As a standard that is (in this first version) focused on natural science collections, LtC has significant intersections with existing data standards and models (Fig. 1) that represent individual natural science objects and occurrences and their associated data (e.g., Darwin Core (DwC), Access to Biological Collection Data (ABCD), Conceptual Reference Model of the International Committee on Documentation (CIDOC-CRM)). LtC’s scope also overlaps with standards for more generic concepts like metadata, organisations, people and activities (i.e., Dublin Core, World Wide Web Consortium (W3C) ORG Ontology and PROV Ontology, Schema.org). LtC represents just an element of this extended network of data standards for the natural sciences and related concepts. Mapping between LtC and intersecting standards is therefore crucial for avoiding duplication of effort in the standard development process, and ensuring that data stored using the different standards are as interoperable as possible in alignment with FAIR (Findable, Accessible, Interoperable, Reusable) principles. In particular, it is vital to make robust associations between records representing groups of objects in LtC and records (where available) that represent the objects within those groups. During LtC development, efforts were made to identify and align with relevant standards and vocabularies, and adopt existing terms from them where possible. During expert review, a more structured approach was proposed and implemented using the Simple Knowledge Organization System (SKOS) mappingRelation vocabulary. This exercise helped to better describe the nature of the mappings between new LtC terms and related terms in other standards, and to validate decisions around the borrowing of existing terms for LtC. A further exercise also used elements of the Simple Standard for Sharing Ontological Mappings (SSSOM) to start to develop a more comprehensive set of metadata around these mappings. At present, these mappings (Suppl. material 1 and Suppl. material 2) are provisional and not considered to be comprehensive, but should be further refined and expanded over time. Even with the support provided by the SKOS and SSSOM standards, the LtC experience has proven the mapping process to be far from straightforward. Different standards vary in how they are structured, for example, DwC is a ‘bag of terms’, with informal classes and no structura","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136236194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maria Mora-Cross, William Ulate, Brandon Retana Chacón, María Biarreta Portillo, Josué David Castro Ramírez, Jose Chavarria Madriz
Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for research and sustainable management. The number of publications generated is quite large: the corpus of biodiversity literature includes tens of millions of figures and taxonomic treatments. Unfortunately, most of the taxonomic descriptions are from scientific publications in text format. With more than 61 million digitized pages in the Biodiversity Heritage Library (BHL), only 467,265 taxonomic treatments are available in the Biodiversity Literature Repository. To obtain highly structured texts from digitized text has been shown to be complex and very expensive (Cui et al. 2021). The scientific community has described over 1.2 million species, but studies suggest that 86% of existing species on Earth and 91% of species in the ocean still await description (Mora et al. 2011). The published descriptions synthesize observations made by taxonomists over centuries of research and include detailed morphological aspects (i.e., shape and structure) of species useful to identify specimens, to improve information search mechanisms, to perform data analysis of species having particular characteristics, and to compare species descriptions. To take full advantage of this information and to work towards integrating it with repositories of biodiversity knowledge, the biodiversity informatics community first needs to convert plain text into a machine-processable format. More precisely, there is a need to identify structures and substructure names and the characters that describe them (Fig. 1). Open information extraction (OIE) is a research area of Natural Language Processing (NLP), which aims to automatically extract structured, machine-readable representations of data available in unstructured text; usually the result is handled as n-ary propositions, for instance, triples of the form <noun phrase, relation phrase, noun phrase> (Shen et al. 2022). OIE is continuously evolving with advancements in NLP and machine learning techniques. The state of the art in OIE involves the use of neural approaches, pre-trained language models, and integration of dependency parsing and semantic role labeling. Neural solutions mainly formulate OIE as a sequence tagging problem or a sequence generation problem. Ongoing research focuses on improving extraction accuracy; handling complex linguistic phenomena, for instance, addressing challenges like coreference resolution; and more open information extraction, because most existing neural solutions work in English texts (Zhou et al. 2022). The main objective of this project is to evaluate and compare the results of automatic data extraction from plant morphological descriptions using pre-trained language models (PLM) and a language model trained on data from plant morphological descriptions written in Spanish. The research data for this study were sourced from the species records database of the National Biodi
分类学文献记录了地球的生物多样性,并为研究和可持续管理提供了所需的知识。产生的出版物数量相当大:生物多样性文献的语料库包括数以千万计的图和分类处理。不幸的是,大多数的分类描述都是来自科学出版物的文本格式。生物多样性遗产图书馆(BHL)的数字化页面超过6100万页,而生物多样性文献库中仅有467,265份分类处理。从数字化文本中获取高度结构化的文本已被证明是复杂且非常昂贵的(Cui et al. 2021)。科学界已经描述了超过120万种物种,但研究表明,地球上86%的现有物种和91%的海洋物种仍有待描述(Mora et al. 2011)。已发表的描述综合了分类学家几个世纪以来的研究成果,包括物种的详细形态学方面(即形状和结构),有助于识别标本,改进信息搜索机制,对具有特定特征的物种进行数据分析,并比较物种描述。为了充分利用这些信息并努力将其与生物多样性知识库整合,生物多样性信息学社区首先需要将纯文本转换为机器可处理的格式。更准确地说,需要识别结构和子结构名称以及描述它们的字符(图1)。开放信息提取(OIE)是自然语言处理(NLP)的一个研究领域,旨在自动提取非结构化文本中可用数据的结构化、机器可读表示;通常结果被处理为n元命题,例如,形式为& & &;名词短语,关系短语,名词短语& & &;gt;(Shen et al. 2022)。随着自然语言处理和机器学习技术的进步,OIE不断发展。OIE的最新技术包括使用神经方法、预先训练的语言模型,以及依赖性分析和语义角色标记的集成。神经解决方案主要将OIE表述为序列标记问题或序列生成问题。目前的研究重点是提高提取精度;例如,处理复杂的语言现象,解决诸如共指解析之类的挑战;以及更开放的信息提取,因为大多数现有的神经解决方案都适用于英语文本(Zhou et al. 2022)。该项目的主要目的是评估和比较使用预训练语言模型(PLM)和使用西班牙语植物形态描述数据训练的语言模型从植物形态描述自动提取数据的结果。本研究的数据来源于哥斯达黎加国家生物多样性研究所(INBio)的物种记录数据库。具体而言,该项目侧重于选择用西班牙语写的植物物种形态描述记录。系统使用工作流处理形态学描述,该工作流包括数据选择和预处理、特征提取、测试PLM、本地语言模型训练以及测试和评估结果等阶段。图2显示了本研究中使用的一般工作流程。预处理和注释:通过删除特殊字符(如双引号和单引号)、替换缩写、标记文本和其他转换,对描述进行了标准化。数据集的一些记录用从每个段落中提取的三元组形式的基本事实结构化信息进行注释。此外,数据集中还包括Mora和Araya (Mora和Araya 2018)开展的项目中的结构化数据。特征提取:通过语言模型直接使用词嵌入进行标记矢量化。测试PLM: PLM模型的评估过程使用零射击方法,包括将模型应用于测试数据集,提取信息,并将其与注释的地面真值进行比较。局部语言模型训练:将标注的数据分成80%的训练数据和20%的测试数据。利用训练数据,对基于变形金刚架构的语言模型进行了训练。评估结果:计算评估指标,如精度、召回率和F1(模型准确性的度量),比较提取的信息和基本事实。对结果进行分析,以了解模型的性能,识别优点和缺点,并深入了解它们提取准确和相关信息的能力。在分析的基础上,对模型结果进行了迭代改进。
{"title":"Structuring Information from Plant Morphological Descriptions using Open Information Extraction","authors":"Maria Mora-Cross, William Ulate, Brandon Retana Chacón, María Biarreta Portillo, Josué David Castro Ramírez, Jose Chavarria Madriz","doi":"10.3897/biss.7.113055","DOIUrl":"https://doi.org/10.3897/biss.7.113055","url":null,"abstract":"Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for research and sustainable management. The number of publications generated is quite large: the corpus of biodiversity literature includes tens of millions of figures and taxonomic treatments. Unfortunately, most of the taxonomic descriptions are from scientific publications in text format. With more than 61 million digitized pages in the Biodiversity Heritage Library (BHL), only 467,265 taxonomic treatments are available in the Biodiversity Literature Repository. To obtain highly structured texts from digitized text has been shown to be complex and very expensive (Cui et al. 2021). The scientific community has described over 1.2 million species, but studies suggest that 86% of existing species on Earth and 91% of species in the ocean still await description (Mora et al. 2011). The published descriptions synthesize observations made by taxonomists over centuries of research and include detailed morphological aspects (i.e., shape and structure) of species useful to identify specimens, to improve information search mechanisms, to perform data analysis of species having particular characteristics, and to compare species descriptions. To take full advantage of this information and to work towards integrating it with repositories of biodiversity knowledge, the biodiversity informatics community first needs to convert plain text into a machine-processable format. More precisely, there is a need to identify structures and substructure names and the characters that describe them (Fig. 1). Open information extraction (OIE) is a research area of Natural Language Processing (NLP), which aims to automatically extract structured, machine-readable representations of data available in unstructured text; usually the result is handled as n-ary propositions, for instance, triples of the form &lt;noun phrase, relation phrase, noun phrase&gt; (Shen et al. 2022). OIE is continuously evolving with advancements in NLP and machine learning techniques. The state of the art in OIE involves the use of neural approaches, pre-trained language models, and integration of dependency parsing and semantic role labeling. Neural solutions mainly formulate OIE as a sequence tagging problem or a sequence generation problem. Ongoing research focuses on improving extraction accuracy; handling complex linguistic phenomena, for instance, addressing challenges like coreference resolution; and more open information extraction, because most existing neural solutions work in English texts (Zhou et al. 2022). The main objective of this project is to evaluate and compare the results of automatic data extraction from plant morphological descriptions using pre-trained language models (PLM) and a language model trained on data from plant morphological descriptions written in Spanish. The research data for this study were sourced from the species records database of the National Biodi","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136155252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
José Chavarría Madriz, Maria Mora-Cross, William Ulate
Image-based identification of plant specimens plays a crucial role in various fields such as agriculture, ecology, and biodiversity conservation. The growing interest in deep learning has led to remarkable advancements in image classification techniques, particularly with the utilization of convolutional neural networks (CNNs). Since 2015, in the context of the PlantCLEF (Conference and Labs of the Evaluation Forum) challenge (Joly et al. 2015), deep learning models, specifically CNNs, have consistently achieved the most impressive results in this field (Carranza-Rojas 2018). However, recent developments have introduced transformer-based models, such as ViT (Vision Transformer) (Dosovitskiy et al. 2020) and CvT (Convolutional vision Transformer) (Wu et al. 2021), as a promising alternative for image classification tasks. Transformers offer unique advantages such as capturing global context and handling long-range dependencies (Vaswani et al. 2017), which make them suitable for complex recognition tasks like plant identification. In this study, we focus on the image classification task using the PlantNet-300k dataset (Garcin et al. 2021a). The dataset consists of a large collection of 306,146 plant images representing 1,081 distinct species. These images were selected from the Pl@ntNet citizen observatory database. The dataset has two prominent characteristics that pose challenges for classification. First, there is a significant class imbalance, meaning that a small subset of species dominates the majority of the images. This imbalance creates bias and affects the accuracy of classification models. Second, many species exhibit visual similarities, making it tough, even for experts, to accurately identify them. These characteristics are referred to by the dataset authors as long-tailed distribution and high intrinsic ambiguity, respectively (Garcin et al. 2021b). In order to address the inherent challenges of the PlantNet-300k dataset, we employed a two-fold approach. Firstly, we leveraged transformer-based models to tackle the dataset's intrinsic ambiguity and effectively capture the complex visual patterns present in plant images. Secondly, we focused on mitigating the class imbalance issue through various data preprocessing techniques, specifically class balancing methods. By implementing these techniques, we aimed to ensure fair representation of all plant species in order to improve the overall performance of image classification models. Our objective is to assess the effects of data preprocessing techniques, specifically class balancing, on the classification performance of the PlantNet-300k dataset. By exploring different preprocessing methods, we addressed the class imbalance issue and through precise evaluation, conducted a comparison of the performance of transformer-based models with and without class balancing techniques. Through these efforts, our ultimate goal is to assert if these techniques allow us to achieve more accurate and rel
基于图像的植物标本识别在农业、生态、生物多样性保护等领域发挥着至关重要的作用。对深度学习日益增长的兴趣导致了图像分类技术的显著进步,特别是卷积神经网络(cnn)的使用。自2015年以来,在PlantCLEF(评估论坛的会议和实验室)挑战的背景下(Joly等人,2015),深度学习模型,特别是cnn,在这一领域一直取得了最令人印象深刻的结果(Carranza-Rojas 2018)。然而,最近的发展已经引入了基于变压器的模型,如ViT(视觉变压器)(Dosovitskiy等人,2020)和CvT(卷积视觉变压器)(Wu等人,2021),作为图像分类任务的有希望的替代方案。变压器具有独特的优势,例如捕获全局上下文和处理远程依赖关系(Vaswani et al. 2017),这使得它们适用于复杂的识别任务,如植物识别。在本研究中,我们重点研究了使用PlantNet-300k数据集的图像分类任务(Garcin et al. 2021a)。该数据集包含306146张植物图像,代表1081个不同的物种。这些图像选自Pl@ntNet公民天文台数据库。数据集有两个突出的特征,这给分类带来了挑战。首先,存在明显的类不平衡,这意味着一小部分物种支配了大多数图像。这种不平衡产生了偏差,影响了分类模型的准确性。其次,许多物种表现出视觉上的相似性,这使得即使是专家也很难准确识别它们。这些特征分别被数据集作者称为长尾分布和高内在模糊性(Garcin et al. 2021b)。为了解决PlantNet-300k数据集的固有挑战,我们采用了双重方法。首先,我们利用基于变压器的模型来解决数据集的内在模糊性,并有效捕获植物图像中存在的复杂视觉模式。其次,我们着重于通过各种数据预处理技术,特别是类平衡方法来缓解类不平衡问题。通过实施这些技术,我们的目标是确保所有植物物种的公平表示,以提高图像分类模型的整体性能。我们的目标是评估数据预处理技术,特别是类平衡对PlantNet-300k数据集分类性能的影响。通过探索不同的预处理方法,我们解决了类不平衡问题,并通过精确的评估,对有和没有类平衡技术的基于变压器的模型的性能进行了比较。通过这些努力,我们的最终目标是断言这些技术是否允许我们获得更准确和可靠的分类结果,特别是对于数据集中代表性不足的物种。在我们的实验中,我们使用两个版本的PlantNet-300k数据集比较了两种基于变压器的模型(ViT和CvT)的性能:一个有类平衡,另一个没有类平衡。这种设置总共产生了四组用于评估的度量。为了评估分类性能,我们使用了广泛的常用指标,包括召回率、精密度、准确度、曲线下面积(AUC)、ROC(接收者工作特征)等。这些指标提供了对每个模型正确分类植物物种、识别假阳性和阴性、测量总体准确性以及评估模型歧视性能力的见解。通过进行这项比较研究,我们寻求通过提供类平衡技术在改善基于变压器的模型在PlantNet-300k数据集和任何其他类似数据集上的性能方面的优势和有效性的经验证据,为植物识别研究的进步做出贡献。
{"title":"Comparative Study: Evaluating the effects of class balancing on transformer performance in the PlantNet-300k image dataset","authors":"José Chavarría Madriz, Maria Mora-Cross, William Ulate","doi":"10.3897/biss.7.113057","DOIUrl":"https://doi.org/10.3897/biss.7.113057","url":null,"abstract":"Image-based identification of plant specimens plays a crucial role in various fields such as agriculture, ecology, and biodiversity conservation. The growing interest in deep learning has led to remarkable advancements in image classification techniques, particularly with the utilization of convolutional neural networks (CNNs). Since 2015, in the context of the PlantCLEF (Conference and Labs of the Evaluation Forum) challenge (Joly et al. 2015), deep learning models, specifically CNNs, have consistently achieved the most impressive results in this field (Carranza-Rojas 2018). However, recent developments have introduced transformer-based models, such as ViT (Vision Transformer) (Dosovitskiy et al. 2020) and CvT (Convolutional vision Transformer) (Wu et al. 2021), as a promising alternative for image classification tasks. Transformers offer unique advantages such as capturing global context and handling long-range dependencies (Vaswani et al. 2017), which make them suitable for complex recognition tasks like plant identification. In this study, we focus on the image classification task using the PlantNet-300k dataset (Garcin et al. 2021a). The dataset consists of a large collection of 306,146 plant images representing 1,081 distinct species. These images were selected from the Pl@ntNet citizen observatory database. The dataset has two prominent characteristics that pose challenges for classification. First, there is a significant class imbalance, meaning that a small subset of species dominates the majority of the images. This imbalance creates bias and affects the accuracy of classification models. Second, many species exhibit visual similarities, making it tough, even for experts, to accurately identify them. These characteristics are referred to by the dataset authors as long-tailed distribution and high intrinsic ambiguity, respectively (Garcin et al. 2021b). In order to address the inherent challenges of the PlantNet-300k dataset, we employed a two-fold approach. Firstly, we leveraged transformer-based models to tackle the dataset's intrinsic ambiguity and effectively capture the complex visual patterns present in plant images. Secondly, we focused on mitigating the class imbalance issue through various data preprocessing techniques, specifically class balancing methods. By implementing these techniques, we aimed to ensure fair representation of all plant species in order to improve the overall performance of image classification models. Our objective is to assess the effects of data preprocessing techniques, specifically class balancing, on the classification performance of the PlantNet-300k dataset. By exploring different preprocessing methods, we addressed the class imbalance issue and through precise evaluation, conducted a comparison of the performance of transformer-based models with and without class balancing techniques. Through these efforts, our ultimate goal is to assert if these techniques allow us to achieve more accurate and rel","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136236192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data availability for certain groups of organisms (ecosystem engineers, invasive or protected species, etc.) is important for monitoring and making predictions in changing environments. One of the most promising directions for research on the impact of changes is species distribution modelling. Such technologies are highly dependent on occurrence data of high quality (Van Eupen et al. 2021). Earthworms (order Crassiclitellata) are a key group of organisms (Lavelle 2014), but their distribution around the globe is underrepresented in digital resources. Dozens of earthworm species, both widespread and endemic, inhabit the territory of Northern Eurasia (Perel 1979), but extremely poor data on them is available through global biodiversity repositories (Cameron 2018). There are two main obstacles to data mobilisation. Firstly, studies of the diversity of earthworms in Northen Eurasia have a long history (since the end of the nineteenth century) and were conducted by several generations of Soviet and Russian researchers. Most of the collected data have been published in "grey literature", now stored only in a few libraries. Until recently, most of these remained largely undigitised, and some are probably irretrievably lost. The second problem is the difference in the taxonomic checklists used by Soviet and European researchers. Not all species and synonyms are included in the GBIF (Global Biodiversity Information Facility) Backbone Taxonomy. As a result, existing earthworm species distribution models (Phillips 2019) potentially miss a significant amount of data and may underestimate biodiversity, and predict distributions inaccurately. To fill this gap, we collected occurrence data from the Russian language literature (published by Soviet and Russian researchers) and digitised species checklists, keeping the original scientific names. To find relevant literature, we conducted a keyword search for "earthworms" and "Lumbricidae" through the Russian national scientific online library eLibrary and screened reference lists from the monographs of leading Soviet and Russian soil zoologist Tamara Perel (Vsevolodova-Perel 1997, Perel 1979). As a result, about 1,000 references were collected, of which 330 papers had titles indicating the potential to contain data on earthworm occurrences. Among these, 219 were found as PDF files or printed papers. For dataset compilation, 159 papers were used; the others had no exact location data or duplicated data contained in other papers. Most of the sources were peer-reviewed articles (Table 1). A reference list is available through Zenodo (Ivanova et al. 2023). The earliest publication we could find dates back to 1899, by Wilhelm Michaelsen. The most recent publication is 2023. About a third of the sources were written by systematists Iosif Malevich and Tamara Perel. Occurrence data were extracted and structured according to the Darwin Core standard (Wieczorek et al. 2012). During the data digitisation process, we tried to
某些生物群体(生态系统工程师、入侵物种或受保护物种等)的数据可用性对于监测和预测不断变化的环境非常重要。物种分布模型是研究变化影响最有前途的方向之一。这些技术高度依赖于高质量的发生率数据(Van Eupen et al. 2021)。蚯蚓是一个重要的生物类群(Lavelle 2014),但它们在全球的分布在数字资源中代表性不足。数十种广泛和特有的蚯蚓物种栖息在欧亚大陆北部(Perel 1979),但通过全球生物多样性存储库可获得的有关它们的数据极其贫乏(Cameron 2018)。数据移动有两个主要障碍。首先,对欧亚大陆北部蚯蚓多样性的研究有很长的历史(自19世纪末以来),由几代苏联和俄罗斯研究人员进行。大多数收集到的数据都以“灰色文献”的形式发表,现在只存放在少数图书馆中。直到最近,其中大部分仍未被数字化,有些可能已经无可挽回地丢失了。第二个问题是苏联和欧洲研究人员使用的分类清单不同。并非所有的物种和同义词都包含在GBIF(全球生物多样性信息设施)主干分类中。因此,现有的蚯蚓物种分布模型(Phillips 2019)可能会遗漏大量数据,并可能低估生物多样性,并不准确地预测分布。为了填补这一空白,我们从俄语文献(由苏联和俄罗斯研究人员发表)中收集了发生数据,并将物种清单数字化,保留了原始的学名。为了寻找相关文献,我们通过俄罗斯国家科学在线图书馆对“蚯蚓”和“蚓科”进行了关键词搜索,并从苏联和俄罗斯著名土壤动物学家Tamara Perel (Vsevolodova-Perel 1997, Perel 1979)的专著中筛选了参考文献列表。结果,收集了大约1,000篇参考文献,其中330篇论文的标题表明可能包含有关蚯蚓发生的数据。其中219份是PDF文件或纸质文件。数据集的编制使用了159篇论文;其他的没有确切的位置数据或重复的数据包含在其他文件。大多数来源是同行评议的文章(表1)。参考文献列表可通过Zenodo (Ivanova et al. 2023)获得。我们能找到的最早的出版物要追溯到1899年,作者是威廉·迈克尔森。最近的出版物是2023年。大约三分之一的资料是由系统学家Iosif Malevich和Tamara Perel撰写的。根据达尔文核心标准(Wieczorek et al. 2012)提取和构建产率数据。在数据数字化的过程中,我们试图尽可能多地包含原始信息。只有十分之一的文献记载了作者提供的地点的地理坐标。使用点半径法(Wieczorek et al. 2010)手动对剩余的发生点进行地理参考。通过全球生物多样性信息设施门户网站发布了俄语文献中的蚯蚓事件数据集(Shashkov et al. 2023)。它包含来自27个国家的117个物种的5304个事件(图1)。为了改进GBIF主干分类,我们对Tamara Perel出版的苏联(Perel 1979)和俄罗斯联邦(Vsevolodova-Perel 1997)的两份蚯蚓物种目录进行了数字化。基于这些专著,通过GBIF发布了3个检查表数据集(Shashkov 2023b, 124条;Shashkov 2023c, 87记录;Shashkov 2023a, 95记录)。现在,我们正在努力将这些名字纳入GBIF主干,以便所有物种的名字都可以匹配和记录,就像苏联和俄罗斯研究人员发表的论文中提到的那样。
{"title":"Filling Gaps in Earthworm Digital Diversity in Northern Eurasia from Russian-language Literature","authors":"Maxim Shashkov, Natalya Ivanova, Sergey Ermolov","doi":"10.3897/biss.7.112957","DOIUrl":"https://doi.org/10.3897/biss.7.112957","url":null,"abstract":"Data availability for certain groups of organisms (ecosystem engineers, invasive or protected species, etc.) is important for monitoring and making predictions in changing environments. One of the most promising directions for research on the impact of changes is species distribution modelling. Such technologies are highly dependent on occurrence data of high quality (Van Eupen et al. 2021). Earthworms (order Crassiclitellata) are a key group of organisms (Lavelle 2014), but their distribution around the globe is underrepresented in digital resources. Dozens of earthworm species, both widespread and endemic, inhabit the territory of Northern Eurasia (Perel 1979), but extremely poor data on them is available through global biodiversity repositories (Cameron 2018). There are two main obstacles to data mobilisation. Firstly, studies of the diversity of earthworms in Northen Eurasia have a long history (since the end of the nineteenth century) and were conducted by several generations of Soviet and Russian researchers. Most of the collected data have been published in \"grey literature\", now stored only in a few libraries. Until recently, most of these remained largely undigitised, and some are probably irretrievably lost. The second problem is the difference in the taxonomic checklists used by Soviet and European researchers. Not all species and synonyms are included in the GBIF (Global Biodiversity Information Facility) Backbone Taxonomy. As a result, existing earthworm species distribution models (Phillips 2019) potentially miss a significant amount of data and may underestimate biodiversity, and predict distributions inaccurately. To fill this gap, we collected occurrence data from the Russian language literature (published by Soviet and Russian researchers) and digitised species checklists, keeping the original scientific names. To find relevant literature, we conducted a keyword search for \"earthworms\" and \"Lumbricidae\" through the Russian national scientific online library eLibrary and screened reference lists from the monographs of leading Soviet and Russian soil zoologist Tamara Perel (Vsevolodova-Perel 1997, Perel 1979). As a result, about 1,000 references were collected, of which 330 papers had titles indicating the potential to contain data on earthworm occurrences. Among these, 219 were found as PDF files or printed papers. For dataset compilation, 159 papers were used; the others had no exact location data or duplicated data contained in other papers. Most of the sources were peer-reviewed articles (Table 1). A reference list is available through Zenodo (Ivanova et al. 2023). The earliest publication we could find dates back to 1899, by Wilhelm Michaelsen. The most recent publication is 2023. About a third of the sources were written by systematists Iosif Malevich and Tamara Perel. Occurrence data were extracted and structured according to the Darwin Core standard (Wieczorek et al. 2012). During the data digitisation process, we tried to","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136308970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Natural History Museum, London (NHM) is home to an impressive collection of over 80 million specimens, of which just 5.5 million have been digitised. Like all similar collections, digitisation of these specimens is very labour intensive, requiring time-consuming manual handling. Each specimen is extracted from its curatorial unit, placed for imaging, labels are manually manipulated, and then returned to storage. Thanks to the NHM’s team of digitisers, workflows are becoming more efficient as they are refined. However, many of these workflows are highly repetitive and ideally suited to automation. The museum is now exploring integrating robots into the digitisation process. The NHM has purchased a Techman TM5 900 robotic arm, equipped with integrated Artificial Intelligence (AI) software and additional features such as custom grippers and a 3D scanner. This robotic arm combines advanced imaging technologies, machine learning algorithms, and robotic manipulation capabilities to capture high-quality specimen data, making it possible to digitise vast collections efficiently (Fig. 1). We showcase the NHM's application of robotics for digitisation, outlining the use cases developed for implementation and the prototypical workflows already in place at the museum. We will explore our invasive and non-invasive digitisation experiments, the many challenges, and the initial results of our early experiments with this transformative technology.
{"title":"Robot-in-the-loop: Prototyping robotic digitisation at the Natural History Museum","authors":"Ben Scott, Arianna Salili-James, Vincent Smith","doi":"10.3897/biss.7.112947","DOIUrl":"https://doi.org/10.3897/biss.7.112947","url":null,"abstract":"The Natural History Museum, London (NHM) is home to an impressive collection of over 80 million specimens, of which just 5.5 million have been digitised. Like all similar collections, digitisation of these specimens is very labour intensive, requiring time-consuming manual handling. Each specimen is extracted from its curatorial unit, placed for imaging, labels are manually manipulated, and then returned to storage. Thanks to the NHM’s team of digitisers, workflows are becoming more efficient as they are refined. However, many of these workflows are highly repetitive and ideally suited to automation. The museum is now exploring integrating robots into the digitisation process. The NHM has purchased a Techman TM5 900 robotic arm, equipped with integrated Artificial Intelligence (AI) software and additional features such as custom grippers and a 3D scanner. This robotic arm combines advanced imaging technologies, machine learning algorithms, and robotic manipulation capabilities to capture high-quality specimen data, making it possible to digitise vast collections efficiently (Fig. 1). We showcase the NHM's application of robotics for digitisation, outlining the use cases developed for implementation and the prototypical workflows already in place at the museum. We will explore our invasive and non-invasive digitisation experiments, the many challenges, and the initial results of our early experiments with this transformative technology.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136308760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I’m a historian who works with data from the GLAM sector (galleries, libraries, archives and museums). When I talk about GLAM data, I’m usually talking about things like newspapers, government documents, photographs, letters, websites, and books. Some of it is well-described, structured, and easily accessible, and some is not. All of it offers us the chance to ask new questions of our past, to see things differently. But what tools, what examples, what documentation, and what support are needed to encourage researchers to explore these possibilities—to engage with collections as data? In this talk, I’ll be describing some of my own adventures amidst GLAM data, before focusing on questions of access, infrastructure, and skills development. In particular, I’ll be introducing the GLAM Workbench—a collection of tools, tutorials, examples, and hacks aimed at helping humanities researchers navigate the world of data. What pathways do we need, and how can we build them?
{"title":"What Can You Do With 200 Million Newspaper Articles: Exploring GLAM data in the Humanities","authors":"Tim Sherratt","doi":"10.3897/biss.7.112935","DOIUrl":"https://doi.org/10.3897/biss.7.112935","url":null,"abstract":"I’m a historian who works with data from the GLAM sector (galleries, libraries, archives and museums). When I talk about GLAM data, I’m usually talking about things like newspapers, government documents, photographs, letters, websites, and books. Some of it is well-described, structured, and easily accessible, and some is not. All of it offers us the chance to ask new questions of our past, to see things differently. But what tools, what examples, what documentation, and what support are needed to encourage researchers to explore these possibilities—to engage with collections as data? In this talk, I’ll be describing some of my own adventures amidst GLAM data, before focusing on questions of access, infrastructure, and skills development. In particular, I’ll be introducing the GLAM Workbench—a collection of tools, tutorials, examples, and hacks aimed at helping humanities researchers navigate the world of data. What pathways do we need, and how can we build them?","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135061374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in conversational Artificial Intelligence (AI), such as OpenAI's Chat Generative Pre-Trained Transformer (ChatGPT), present the possibility of using large language models (LLMs) as tools for retrieving, analyzing, and transforming scientific information. We have found that ChatGPT (GPT 3.5) can provide accurate biodiversity knowledge in response to questions about species descriptions, occurrences, and taxonomy, as well as structure information according to data sharing standards such as Darwin Core. A rigorous evaluation of ChatGPT's capabilities in biodiversity-related tasks may help to inform viable use cases for today's LLMs in research and information workflows. In this work, we test the extent of ChatGPT's biodiversity knowledge, characterize its mistakes, and suggest how LLM-based systems might be designed to complete knowledge-based tasks with confidence. To test ChatGPT's biodiversity knowledge, we compiled a question-and-answer test set derived from Darwin Core records available in Integrated Digitized Biocollections (iDigBio). Each question focuses on one or more Darwin Core terms to test the model’s ability to recall species occurrence information and its understanding of the standard. The test set covers a range of locations, taxonomic groups, and both common and rare species (defined by the number of records in iDigBio). The results of the tests will be presented. We also tested ChatGPT on generative tasks, such as creating species occurrence maps. A visual comparison of the maps with iDigBio data shows that for some species, ChatGPT can generate fairly accurate representationsof their geographic ranges (Fig. 1). ChatGPT's incorrect responses in our tests show several patterns of mistakes. First, responses can be self-conflicting. For example, when asked "Does Acer saccharum naturally occur in Benton, Oregon?", ChatGPT responded "YES, Acer saccharum DOES NOT naturally occur in Benton, Oregon". ChatGPT can also be misled by semantics in species names. For Rafinesquia neomexicana , the word "neomexicana" leads ChatGPT to believe that the species primarily occurs in New Mexico, USA. ChatGPT may also confuse species, such as when attempting to describe a lesser-known species (e.g., a rare bee) within the same genus as a better-known species. Other causes of mistakes include hallucination (Ji et al. 2023), memorization (Chang and Bergen 2023), and user deception (Li et al. 2023). Some mistakes may be avoided by prompt engineering, e.g., few-shot prompting (Chang and Bergen 2023) and chain-of-thought prompting (Wei et al. 2022). These techniques assist Large Language Models (LLMs) by clarifying expectations or by guiding recollection. However, such methods cannot help when LLMs lack required knowledge. In these cases, alternative approaches are needed. A desired reliability can be theoretically guaranteed if responses that contain mistakes are discarded or corrected. This requires either detecting or predicting mistake
对话式人工智能(AI)的最新进展,如OpenAI的聊天生成预训练转换器(ChatGPT),提供了使用大型语言模型(llm)作为检索、分析和转换科学信息的工具的可能性。我们发现,ChatGPT (GPT 3.5)可以提供准确的生物多样性知识,以回答关于物种描述、发生和分类的问题,并根据达尔文核心等数据共享标准构建信息。对ChatGPT在生物多样性相关任务中的能力进行严格评估,可能有助于为当今法学硕士在研究和信息工作流程中的可行用例提供信息。在这项工作中,我们测试了ChatGPT的生物多样性知识的程度,描述了它的错误,并建议如何设计基于法学硕士的系统来自信地完成基于知识的任务。为了测试ChatGPT的生物多样性知识,我们编制了一个问答测试集,该测试集来自集成数字化生物收集(iDigBio)中的达尔文核心记录。每个问题都集中在一个或多个达尔文核心术语上,以测试模型回忆物种发生信息的能力及其对标准的理解。测试集涵盖了一系列地点、分类组以及常见和稀有物种(由iDigBio中的记录数量定义)。测试结果将会公布。我们还在生成任务上测试了ChatGPT,例如创建物种发生图。与iDigBio数据的可视化比较表明,对于某些物种,ChatGPT可以相当准确地表示其地理范围(图1)。在我们的测试中,ChatGPT的错误响应显示了几种错误模式。首先,反应可能是自相矛盾的。例如,当被问到“糖糖槭是否自然生长在俄勒冈州本顿?”,ChatGPT回答“是的,糖糖槭不自然生长在俄勒冈州本顿”。ChatGPT还可能被物种名称中的语义所误导。对于Rafinesquia neomexicana,“neomexicana”这个词使ChatGPT相信该物种主要出现在美国的新墨西哥州。ChatGPT也可能混淆物种,例如当试图描述一个不太知名的物种(例如,一种罕见的蜜蜂)与一个更知名的物种在同一属时。其他导致错误的原因包括幻觉(Ji et al. 2023)、记忆(Chang and Bergen 2023)和用户欺骗(Li et al. 2023)。通过提示工程可以避免一些错误,例如,few-shot提示(Chang and Bergen 2023)和chain-of-thought提示(Wei et al. 2022)。这些技术通过澄清期望或引导回忆来帮助大型语言模型(llm)。然而,当法学硕士缺乏必要的知识时,这些方法就没有帮助了。在这些情况下,需要其他方法。如果包含错误的响应被丢弃或纠正,理论上可以保证期望的可靠性。这需要检测或预测错误。有时可以通过与可信来源验证响应来排除错误。例如,可能会找到一个可信的样本记录来证实响应。然而,困难在于如何通过程序找到这些记录;例如,使用iDigBio和全球生物多样性信息设施(GBIF)的搜索应用程序编程接口(api)需要指定可能不会出现在法学硕士回复中的索引术语。这提出了法学硕士可能非常适合的第二个问题。请注意,对于仅存在的数据,可能很难反驳存在声明或证明缺席声明。除了验证之外,还可以使用概率方法预测错误。制定错误概率通常依赖于启发式。例如,模型对重复查询的响应的可变性可能是幻觉的迹象(Manakul et al. 2023)。在实践中,可能需要概率方法和验证方法来达到期望的可靠性。可以验证的LLM输出可以直接接受(或丢弃),而其他输出则通过估计错误概率来判断。我们将考虑一组启发式和验证方法,并报告其对ChatGPT可靠性影响的经验评估。
{"title":"Using ChatGPT with Confidence for Biodiversity-Related Information Tasks","authors":"Michael Elliott, José Fortes","doi":"10.3897/biss.7.112926","DOIUrl":"https://doi.org/10.3897/biss.7.112926","url":null,"abstract":"Recent advancements in conversational Artificial Intelligence (AI), such as OpenAI's Chat Generative Pre-Trained Transformer (ChatGPT), present the possibility of using large language models (LLMs) as tools for retrieving, analyzing, and transforming scientific information. We have found that ChatGPT (GPT 3.5) can provide accurate biodiversity knowledge in response to questions about species descriptions, occurrences, and taxonomy, as well as structure information according to data sharing standards such as Darwin Core. A rigorous evaluation of ChatGPT's capabilities in biodiversity-related tasks may help to inform viable use cases for today's LLMs in research and information workflows. In this work, we test the extent of ChatGPT's biodiversity knowledge, characterize its mistakes, and suggest how LLM-based systems might be designed to complete knowledge-based tasks with confidence. To test ChatGPT's biodiversity knowledge, we compiled a question-and-answer test set derived from Darwin Core records available in Integrated Digitized Biocollections (iDigBio). Each question focuses on one or more Darwin Core terms to test the model’s ability to recall species occurrence information and its understanding of the standard. The test set covers a range of locations, taxonomic groups, and both common and rare species (defined by the number of records in iDigBio). The results of the tests will be presented. We also tested ChatGPT on generative tasks, such as creating species occurrence maps. A visual comparison of the maps with iDigBio data shows that for some species, ChatGPT can generate fairly accurate representationsof their geographic ranges (Fig. 1). ChatGPT's incorrect responses in our tests show several patterns of mistakes. First, responses can be self-conflicting. For example, when asked \"Does Acer saccharum naturally occur in Benton, Oregon?\", ChatGPT responded \"YES, Acer saccharum DOES NOT naturally occur in Benton, Oregon\". ChatGPT can also be misled by semantics in species names. For Rafinesquia neomexicana , the word \"neomexicana\" leads ChatGPT to believe that the species primarily occurs in New Mexico, USA. ChatGPT may also confuse species, such as when attempting to describe a lesser-known species (e.g., a rare bee) within the same genus as a better-known species. Other causes of mistakes include hallucination (Ji et al. 2023), memorization (Chang and Bergen 2023), and user deception (Li et al. 2023). Some mistakes may be avoided by prompt engineering, e.g., few-shot prompting (Chang and Bergen 2023) and chain-of-thought prompting (Wei et al. 2022). These techniques assist Large Language Models (LLMs) by clarifying expectations or by guiding recollection. However, such methods cannot help when LLMs lack required knowledge. In these cases, alternative approaches are needed. A desired reliability can be theoretically guaranteed if responses that contain mistakes are discarded or corrected. This requires either detecting or predicting mistake","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"178 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135061369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}