Biodiversity Information Science and Standards最新文献_第4页

The Role of the CLIP Model in Analysing Herbarium Specimen Images CLIP模型在分析植物标本图像中的作用

Biodiversity Information Science and Standards

Pub Date : 2023-09-12 DOI: 10.3897/biss.7.112566

Vamsi Krishna Kommineni, Jens Kattge, Jitendra Gaikwad, Susanne Tautenhahn, Birgitta Koenig-ries

The number of openly-accessible digital plant specimen images is growing tremendously and available through data aggregators: Global Biodiversity Information Facility (GBIF) contains 43.2 million images, and Intergrated Digitized Biocollections (iDigBio) contains 32.4 million images (Accessed on 29.06.2023). All these images contain great ecological (morphological, phenological, taxonomic etc.) information, which has the potential to facilitate the conduct of large-scale analyses. However, extracting this information from these images and making it available to analysis tools remains challenging and requires more advanced computer vision algorithms. With the latest advancements in the natural language processing field, it is becoming possible to analyse images with text prompts. For example, with the Contrastive Language-Image Pre-Training (CLIP) model, which was trained on 400 million image-text pairs, it is feasible to classify day-to-day life images by providing different text prompts and an image as an input to the model, then the model can predict the most suitable text prompt for the input image. We explored the feasibility of using the CLIP model to analyse digital plant specimen images. A particular focus of this study was on the generation of appropriate text prompts. This is important as the prompt has a large influence on the results of the model. We experimented with three different methods: a) automatic text prompt based on metadata of the specific image or other datasets, b) automatic generic text prompt of the image (describing what is in the image) and c) manual text prompt by annotating the image. We investigated the suitability of these prompts with an experiment, where we tested whether the CLIP model could recognize a herbarium specimen image using digital plant specimen images and semantically disparate text prompts. Our ultimate goal is to filter the digital plant specimen images based on the availability of intact leaves and measurement scale to reduce the number of specimens that reach the downstream pipeline, for instance, the segmentation task for the leaf trait extraction process. To achieve the goal, we are fine-tuning the CLIP model with a dataset of around 20,000 digital plant specimen image-text prompt pairs, where the text prompts were generated using different datasets, metadata and generic text prompt methods. Since the text prompts can be created automatically, it is possible to eradicate the laborious manual annotating process. In conclusion, we present our experimental testing of the CLIP model on digital plant specimen images with varied settings and how the CLIP model can act as a potential filtering tool. In future, we plan to investigate the possibility of using text prompts to do the instance segmentation to extract leaf trait information using Large Language Models (LLMs).

开放访问的数字植物标本图像数量正在急剧增长，并可通过数据聚合器获得:全球生物多样性信息设施(GBIF)包含4320万张图像，集成数字化生物收藏馆(iDigBio)包含3240万张图像(于2023年6月29日访问)。所有这些图像都包含大量的生态信息(形态、物候、分类等)，有可能促进大规模分析的进行。然而，从这些图像中提取这些信息并将其用于分析工具仍然具有挑战性，并且需要更先进的计算机视觉算法。随着自然语言处理领域的最新进展，用文本提示来分析图像已经成为可能。例如，对比语言图像预训练(CLIP)模型经过4亿对图像-文本对的训练，可以通过提供不同的文本提示并将图像作为模型的输入，对日常生活图像进行分类，然后模型可以预测出最适合输入图像的文本提示。我们探索了使用CLIP模型分析数字植物标本图像的可行性。本研究的一个特别重点是生成适当的文本提示。这一点很重要，因为提示对模型的结果有很大的影响。我们尝试了三种不同的方法:a)基于特定图像或其他数据集的元数据的自动文本提示，b)图像的自动通用文本提示(描述图像中的内容)和c)通过注释图像的手动文本提示。我们通过实验研究了这些提示符的适用性，在实验中，我们测试了CLIP模型是否可以使用数字植物标本图像和语义不同的文本提示符识别植物标本馆标本图像。我们的最终目标是基于完整叶片的可用性和测量尺度对数字植物标本图像进行过滤，以减少到达下游管道的标本数量，例如叶片特征提取过程的分割任务。为了实现这一目标，我们正在使用大约20,000个数字植物标本图像-文本提示对数据集对CLIP模型进行微调，其中文本提示是使用不同的数据集、元数据和通用文本提示方法生成的。由于可以自动创建文本提示，因此可以消除费力的手动注释过程。总之，我们介绍了CLIP模型在不同设置的数字植物标本图像上的实验测试，以及CLIP模型如何作为潜在的过滤工具。未来，我们计划研究使用大型语言模型(LLMs)使用文本提示进行实例分割以提取叶片性状信息的可能性。

{"title":"The Role of the CLIP Model in Analysing Herbarium Specimen Images","authors":"Vamsi Krishna Kommineni, Jens Kattge, Jitendra Gaikwad, Susanne Tautenhahn, Birgitta Koenig-ries","doi":"10.3897/biss.7.112566","DOIUrl":"https://doi.org/10.3897/biss.7.112566","url":null,"abstract":"The number of openly-accessible digital plant specimen images is growing tremendously and available through data aggregators: Global Biodiversity Information Facility (GBIF) contains 43.2 million images, and Intergrated Digitized Biocollections (iDigBio) contains 32.4 million images (Accessed on 29.06.2023). All these images contain great ecological (morphological, phenological, taxonomic etc.) information, which has the potential to facilitate the conduct of large-scale analyses. However, extracting this information from these images and making it available to analysis tools remains challenging and requires more advanced computer vision algorithms. With the latest advancements in the natural language processing field, it is becoming possible to analyse images with text prompts. For example, with the Contrastive Language-Image Pre-Training (CLIP) model, which was trained on 400 million image-text pairs, it is feasible to classify day-to-day life images by providing different text prompts and an image as an input to the model, then the model can predict the most suitable text prompt for the input image. We explored the feasibility of using the CLIP model to analyse digital plant specimen images. A particular focus of this study was on the generation of appropriate text prompts. This is important as the prompt has a large influence on the results of the model. We experimented with three different methods: a) automatic text prompt based on metadata of the specific image or other datasets, b) automatic generic text prompt of the image (describing what is in the image) and c) manual text prompt by annotating the image. We investigated the suitability of these prompts with an experiment, where we tested whether the CLIP model could recognize a herbarium specimen image using digital plant specimen images and semantically disparate text prompts. Our ultimate goal is to filter the digital plant specimen images based on the availability of intact leaves and measurement scale to reduce the number of specimens that reach the downstream pipeline, for instance, the segmentation task for the leaf trait extraction process. To achieve the goal, we are fine-tuning the CLIP model with a dataset of around 20,000 digital plant specimen image-text prompt pairs, where the text prompts were generated using different datasets, metadata and generic text prompt methods. Since the text prompts can be created automatically, it is possible to eradicate the laborious manual annotating process. In conclusion, we present our experimental testing of the CLIP model on digital plant specimen images with varied settings and how the CLIP model can act as a potential filtering tool. In future, we plan to investigate the possibility of using text prompts to do the instance segmentation to extract leaf trait information using Large Language Models (LLMs).","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135879248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Community Curation of Nomenclatural and Taxonomic Information in the Context of the Collection Management System JACQ 集合管理系统JACQ环境下的命名与分类信息的社区管理

Biodiversity Information Science and Standards

Pub Date : 2023-09-12 DOI: 10.3897/biss.7.112571

Heimo Rainer, Andreas Berger, Tanja Schuster, Johannes Walter, Dieter Reich, Kurt Zernig, Jiří Danihelka, Hana Galušková, Patrik Mráz, Natalia Tkach, Jörn Hentschel, Jochen Müller, Sarah Wagner, Walter Berendsohn, Robert Lücking, Robert Vogt, Lia Pignotti, Francesco Roma-Marzio, Lorenzo Peruzzi

Nomenclatural and taxonomic information are crucial for curating botanical collections. In the course of changing methods for systematic and taxonomic studies, classification systems changed considerably over time (Dalla Torre and Harms 1900, Durand and Bentham 1888, Endlicher 1836, Angiosperm Phylogeny Group et al. 2016). Various approaches to store preserved material have been implemented, most of them based on scientific names (e.g., families, genera, species) often in combination with other criteria such as geographic provenance or collectors. The collection management system, JACQ, was established in the early 2000s then developed to support multiple institutions. It features a centralised data storage (with mirror sites) and access via the Internet. Participating collections can download their data at any time in a comma-separated values (CSV) format. From the beginning, JACQ was conceived as a collaboration platform for objects housed in botanical collections, i.e., plant, fungal and algal groups. For these groups, various sources of taxonomic reference exist, nowadays online resources are preferred, e.g., Catalogue of Life, AlgaeBase, Index Fungorum, Mycobank, Tropicos, Plants of the World Online, International Plant Names Index (IPNI), World Flora Online, Euro+Med, Anthos, Flora of Northamerica, REFLORA, Flora of China, Flora of Cuba, Australian Virtual Herbarium (AVH). Implementation and (re)use of PIDs Persistent identifiers (PIDs) for names (at any taxonomic rank) apart from PIDs for taxa, are essential to allow and support reliable referencing across institutions and thematic research networks (Agosti et al. 2022). For this purpose we have integrated referencing to several of the above mentioned resources and populate the names used inside JACQ with those external PIDs. For example, Salix rosmarinifolia is accepted in Plants of the World Online while Euro+Med Plantbase considers it a synonym of Salix repens subsp. rosmarinifolia. Either one can be an identification of a specimen in the JACQ database. Retrieval of collection material One strong use case is the curation of material in historic collections. On the basis of outdated taxon concepts that were applied to the material in history, "old" synonyms are omnipresent in historical collections. In order to retrieve all material of a given taxon, it is necessary to know all relevant names. Future outlook In combination with the capability of Linked Data and the IIIF (International Image Interoperability Framework) technology, these PIDs serve as crucial elements for the integration of decentralized information systems and reuse of (global) taxonomic backbones in combination with collection management systems (Gamer and Kreyenbühl 2022, Hyam 2022, Loh 2017).

命名和分类信息对植物收藏的管理是至关重要的。在系统和分类学研究方法的变化过程中，分类系统随着时间的推移发生了很大变化(Dalla Torre and Harms 1900, Durand and Bentham 1888, Endlicher 1836，被子植物系统发育组等人2016)。已经实施了各种保存材料的方法，其中大多数基于学名(例如科、属、种)，通常与地理来源或收集者等其他标准相结合。收集管理系统JACQ于21世纪初建立，然后发展为支持多家机构。它的特点是集中的数据存储(镜像站点)和通过互联网访问。参与的集合可以随时以逗号分隔值(CSV)格式下载它们的数据。从一开始，JACQ就被设想为一个植物收藏的协作平台，即植物、真菌和藻类群体。这些类群的分类参考来源多种多样，目前首选的是在线资源，如Catalogue of Life、AlgaeBase、Index Fungorum、Mycobank、Tropicos、Plants of World online、International Plant Names Index (IPNI)、World Flora online、Euro+Med、Anthos、Flora of Northamerica、REFLORA、Flora of China、Flora of Cuba、Australian Virtual Herbarium (AVH)等。除了分类群的持久标识符(pid)之外，用于名称(任何分类级别)的持久标识符(pid)对于允许和支持跨机构和专题研究网络的可靠引用至关重要(Agosti et al. 2022)。为此，我们集成了对上面提到的几个资源的引用，并用这些外部pid填充JACQ中使用的名称。例如，Salix rosmarinifolia在《Plants of World Online》中被接受，而Euro+Med Plantbase则认为它是Salix repens subsp的同义词。rosmarinifolia。其中任何一个都可以是JACQ数据库中标本的标识。馆藏资料的检索一个强有力的用例是历史馆藏资料的管理。在历史上应用于材料的过时的分类单元概念的基础上，“旧”同义词在历史典藏中无处不在。为了检索给定分类单元的所有资料，有必要知道所有相关的名称。结合关联数据和IIIF(国际图像互操作性框架)技术的能力，这些ids将成为集成分散信息系统和重用(全球)分类骨干与收集管理系统(Gamer and kreyenb 2022, Hyam 2022, Loh 2017)的关键要素。

{"title":"Community Curation of Nomenclatural and Taxonomic Information in the Context of the Collection Management System JACQ","authors":"Heimo Rainer, Andreas Berger, Tanja Schuster, Johannes Walter, Dieter Reich, Kurt Zernig, Jiří Danihelka, Hana Galušková, Patrik Mráz, Natalia Tkach, Jörn Hentschel, Jochen Müller, Sarah Wagner, Walter Berendsohn, Robert Lücking, Robert Vogt, Lia Pignotti, Francesco Roma-Marzio, Lorenzo Peruzzi","doi":"10.3897/biss.7.112571","DOIUrl":"https://doi.org/10.3897/biss.7.112571","url":null,"abstract":"Nomenclatural and taxonomic information are crucial for curating botanical collections. In the course of changing methods for systematic and taxonomic studies, classification systems changed considerably over time (Dalla Torre and Harms 1900, Durand and Bentham 1888, Endlicher 1836, Angiosperm Phylogeny Group et al. 2016). Various approaches to store preserved material have been implemented, most of them based on scientific names (e.g., families, genera, species) often in combination with other criteria such as geographic provenance or collectors. The collection management system, JACQ, was established in the early 2000s then developed to support multiple institutions. It features a centralised data storage (with mirror sites) and access via the Internet. Participating collections can download their data at any time in a comma-separated values (CSV) format. From the beginning, JACQ was conceived as a collaboration platform for objects housed in botanical collections, i.e., plant, fungal and algal groups. For these groups, various sources of taxonomic reference exist, nowadays online resources are preferred, e.g., Catalogue of Life, AlgaeBase, Index Fungorum, Mycobank, Tropicos, Plants of the World Online, International Plant Names Index (IPNI), World Flora Online, Euro+Med, Anthos, Flora of Northamerica, REFLORA, Flora of China, Flora of Cuba, Australian Virtual Herbarium (AVH). Implementation and (re)use of PIDs Persistent identifiers (PIDs) for names (at any taxonomic rank) apart from PIDs for taxa, are essential to allow and support reliable referencing across institutions and thematic research networks (Agosti et al. 2022). For this purpose we have integrated referencing to several of the above mentioned resources and populate the names used inside JACQ with those external PIDs. For example, Salix rosmarinifolia is accepted in Plants of the World Online while Euro+Med Plantbase considers it a synonym of Salix repens subsp. rosmarinifolia. Either one can be an identification of a specimen in the JACQ database. Retrieval of collection material One strong use case is the curation of material in historic collections. On the basis of outdated taxon concepts that were applied to the material in history, \"old\" synonyms are omnipresent in historical collections. In order to retrieve all material of a given taxon, it is necessary to know all relevant names. Future outlook In combination with the capability of Linked Data and the IIIF (International Image Interoperability Framework) technology, these PIDs serve as crucial elements for the integration of decentralized information systems and reuse of (global) taxonomic backbones in combination with collection management systems (Gamer and Kreyenbühl 2022, Hyam 2022, Loh 2017).","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135879283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Extracting Reproductive Condition and Habitat Information from Text Using a Transformer-based Information Extraction Pipeline 基于变换的信息提取管道从文本中提取生殖条件和栖息地信息

Biodiversity Information Science and Standards

Pub Date : 2023-09-11 DOI: 10.3897/biss.7.112505

Roselyn Gabud, Nelson Pampolina, Vladimir Mariano, Riza Batista-Navarro

Understanding the biology underpinning the natural regeneration of plant species in order to make plans for effective reforestation is a complex task. This can be aided by providing access to databases that contain long-term and wide-scale geographical information on species distribution, habitat, and reproduction. Although there exists widely-used biodiversity databases that contain structured information on species and their occurrences, such as the Global Biodiversity Information Facility (GBIF) and the Atlas of Living Australia (ALA), the bulk of knowledge about biodiversity still remains embedded in textual documents. Unstructured information can be made more accessible and useful for large-scale studies if there are tools and services that automatically extract meaningful information from text and store it in structured formats, e.g., open biodiversity databases, ready to be consumed for analysis (Thessen et al. 2022). We aim to enrich biodiversity occurrence databases with information on species reproductive condition and habitat, derived from text. In previous work, we developed unsupervised approaches to extract related habitats and their locations, and related reproductive condition and temporal expressions (Gabud and Batista-Navarro 2018). We built a new unsupervised hybrid approach for relation extraction (RE), which is a combination of classical rule-based pattern-matching methods and transformer-based language models that framed our RE task as a natural language inference (NLI) task. Using our hybrid approach for RE, we were able to extract related biodiversity entities from text even without a large training dataset. In this work, we implement an information extraction (IE) pipeline comprised of a named entity recognition (NER) tool and our hybrid relation extraction (RE) tool. The NER tool is a transformer-based language model that was pretrained on scientific text and then fine-tuned using COPIOUS (Conserving Philippine Biodiversity by Understanding big data; Nguyen et al. 2019), a gold standard corpus containing named entities relevant to species occurrence. We applied the NER tool to automatically annotate geographical location, temporal expression and habitat information contained within sentences. A dictionary-based approach is then used to identify mentions of reproductive conditions in text (e.g., phrases such as "fruited heavily" and "mass flowering"). We then use our hybrid RE tool to extract reproductive condition - temporal expression and habitat - geographical location entity pairs. We test our IE pipeline on the forestry compendium available in the CABI Digital Library (Centre for Agricultural and Biosciences International), and show that our work enables the enrichment of descriptive information on reproductive and habitat conditions of species. This work is a step towards enhancing a biodiversity database with the inclusion of habitat and reproductive condition information extracted from text.

了解植物物种自然再生的生物学基础，以便制定有效的再造林计划是一项复杂的任务。这可以通过提供对包含物种分布、栖息地和繁殖的长期和大规模地理信息的数据库的访问来帮助实现。尽管存在广泛使用的生物多样性数据库，其中包含物种及其发生的结构化信息，如全球生物多样性信息设施(GBIF)和澳大利亚生活地图集(ALA)，但关于生物多样性的大部分知识仍然嵌入文本文件中。如果有工具和服务可以自动从文本中提取有意义的信息，并将其存储为结构化格式，例如开放的生物多样性数据库，准备用于分析，那么非结构化信息可以更容易获取，对大规模研究也更有用(Thessen et al. 2022)。我们的目标是丰富生物多样性发生数据库与物种的生殖条件和生境信息，源自文本。在之前的工作中，我们开发了无监督的方法来提取相关的栖息地及其位置，以及相关的生殖状况和时间表达(Gabud and Batista-Navarro 2018)。我们构建了一种新的无监督混合方法用于关系提取(RE)，该方法结合了经典的基于规则的模式匹配方法和基于转换器的语言模型，将我们的关系提取任务框架为自然语言推理(NLI)任务。使用我们的混合方法，即使没有大型训练数据集，我们也能够从文本中提取相关的生物多样性实体。在这项工作中，我们实现了一个由命名实体识别(NER)工具和我们的混合关系提取(RE)工具组成的信息提取(IE)管道。NER工具是一种基于转换器的语言模型，该模型在科学文本上进行预训练，然后使用COPIOUS(通过理解大数据保护菲律宾生物多样性;Nguyen et al. 2019)，这是一个包含与物种发生相关的命名实体的金标准语料库。应用NER工具对句子中包含的地理位置、时态表达和栖息地信息进行自动标注。然后使用基于词典的方法来识别文本中提到的生殖条件(例如，短语，如“大量开花”和“大量开花”)。然后，我们使用混合RE工具提取生殖条件-时间表达和栖息地-地理位置实体对。我们在CABI数字图书馆(国际农业和生物科学中心)提供的林业纲要上测试了我们的IE管道，并表明我们的工作能够丰富物种繁殖和栖息地条件的描述性信息。这项工作是朝着加强生物多样性数据库迈出的一步，其中包括从文本中提取的生境和生殖条件信息。

{"title":"Extracting Reproductive Condition and Habitat Information from Text Using a Transformer-based Information Extraction Pipeline","authors":"Roselyn Gabud, Nelson Pampolina, Vladimir Mariano, Riza Batista-Navarro","doi":"10.3897/biss.7.112505","DOIUrl":"https://doi.org/10.3897/biss.7.112505","url":null,"abstract":"Understanding the biology underpinning the natural regeneration of plant species in order to make plans for effective reforestation is a complex task. This can be aided by providing access to databases that contain long-term and wide-scale geographical information on species distribution, habitat, and reproduction. Although there exists widely-used biodiversity databases that contain structured information on species and their occurrences, such as the Global Biodiversity Information Facility (GBIF) and the Atlas of Living Australia (ALA), the bulk of knowledge about biodiversity still remains embedded in textual documents. Unstructured information can be made more accessible and useful for large-scale studies if there are tools and services that automatically extract meaningful information from text and store it in structured formats, e.g., open biodiversity databases, ready to be consumed for analysis (Thessen et al. 2022). We aim to enrich biodiversity occurrence databases with information on species reproductive condition and habitat, derived from text. In previous work, we developed unsupervised approaches to extract related habitats and their locations, and related reproductive condition and temporal expressions (Gabud and Batista-Navarro 2018). We built a new unsupervised hybrid approach for relation extraction (RE), which is a combination of classical rule-based pattern-matching methods and transformer-based language models that framed our RE task as a natural language inference (NLI) task. Using our hybrid approach for RE, we were able to extract related biodiversity entities from text even without a large training dataset. In this work, we implement an information extraction (IE) pipeline comprised of a named entity recognition (NER) tool and our hybrid relation extraction (RE) tool. The NER tool is a transformer-based language model that was pretrained on scientific text and then fine-tuned using COPIOUS (Conserving Philippine Biodiversity by Understanding big data; Nguyen et al. 2019), a gold standard corpus containing named entities relevant to species occurrence. We applied the NER tool to automatically annotate geographical location, temporal expression and habitat information contained within sentences. A dictionary-based approach is then used to identify mentions of reproductive conditions in text (e.g., phrases such as \"fruited heavily\" and \"mass flowering\"). We then use our hybrid RE tool to extract reproductive condition - temporal expression and habitat - geographical location entity pairs. We test our IE pipeline on the forestry compendium available in the CABI Digital Library (Centre for Agricultural and Biosciences International), and show that our work enables the enrichment of descriptive information on reproductive and habitat conditions of species. This work is a step towards enhancing a biodiversity database with the inclusion of habitat and reproductive condition information extracted from text.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135980580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Application of AI-Helped Image Classification of Fish Images: An iDigBio dataset example 人工智能在鱼类图像分类中的应用——以iDigBio数据集为例

Biodiversity Information Science and Standards

Pub Date : 2023-09-11 DOI: 10.3897/biss.7.112438

Bahadir Altintas, Yasin Bakış, Xiojun Wang, Henry Bart

Artificial Intelligence (AI) becomes more prevalent in data science as well as in areas of computational science. Commonly used classification methods in AI can also be used for unorganized databases, if a proper model is trained. Most of the classification work is done on image data for purposes such as object detection and face recognition. If an object is well detected from an image, the classification may be done to organize image data. In this work, we try to identify images from an Integrated Digitized Biocollections (iDigBio) dataset and to classify these images to generate metadata to use as an AI-ready dataset in the future. The main problem of the museum image datasets is the lack of metadata information on images, wrong categorization, or poor image quality. By using AI, it maybe possible to overcome these problems. Automatic tools can help find, eliminate or fix these problems. For our example, we trained a model for 10 classes (e.g., complete fish, photograph, notes/labels, X-ray, CT (computerized tomotography) scan, partial fish, fossil, skeleton) by using a manually tagged iDigBio image dataset. After training a model for each for class, we reclassified the dataset by using these trained models. Some of the results are given in Table 1. As can be seen in the table, even manually classified images can be identified as different classes, and some classes are very similar to each other visually such as CT scans and X-rays or fossils and skeletons. Those kind of similarities are very confusing for the human eye as well as AI results.

人工智能(AI)在数据科学和计算科学领域变得越来越普遍。如果训练了合适的模型，人工智能中常用的分类方法也可以用于无组织的数据库。大多数分类工作都是在图像数据上完成的，用于目标检测和人脸识别等目的。如果从图像中很好地检测到目标，则可以进行分类以组织图像数据。在这项工作中，我们尝试从集成数字化生物收集(iDigBio)数据集中识别图像，并对这些图像进行分类，以生成元数据，以便将来用作ai就绪的数据集。博物馆图像数据集的主要问题是缺乏图像元数据信息、分类错误或图像质量差。通过使用人工智能，有可能克服这些问题。自动工具可以帮助发现、消除或修复这些问题。对于我们的示例，我们通过使用手动标记的iDigBio图像数据集训练了10个类别(例如，完整的鱼、照片、注释/标签、x射线、CT(计算机断层扫描)扫描、部分鱼、化石、骨骼)的模型。在为每个类别训练一个模型之后，我们使用这些训练好的模型对数据集进行重新分类。表1给出了一些结果。从表中可以看出，即使是人工分类的图像也可以被识别为不同的类别，并且有些类别在视觉上非常相似，例如CT扫描和x射线或化石和骨骼。对于人类的眼睛和人工智能的结果来说，这种相似性是非常令人困惑的。

{"title":"Application of AI-Helped Image Classification of Fish Images: An iDigBio dataset example","authors":"Bahadir Altintas, Yasin Bakış, Xiojun Wang, Henry Bart","doi":"10.3897/biss.7.112438","DOIUrl":"https://doi.org/10.3897/biss.7.112438","url":null,"abstract":"Artificial Intelligence (AI) becomes more prevalent in data science as well as in areas of computational science. Commonly used classification methods in AI can also be used for unorganized databases, if a proper model is trained. Most of the classification work is done on image data for purposes such as object detection and face recognition. If an object is well detected from an image, the classification may be done to organize image data. In this work, we try to identify images from an Integrated Digitized Biocollections (iDigBio) dataset and to classify these images to generate metadata to use as an AI-ready dataset in the future. The main problem of the museum image datasets is the lack of metadata information on images, wrong categorization, or poor image quality. By using AI, it maybe possible to overcome these problems. Automatic tools can help find, eliminate or fix these problems. For our example, we trained a model for 10 classes (e.g., complete fish, photograph, notes/labels, X-ray, CT (computerized tomotography) scan, partial fish, fossil, skeleton) by using a manually tagged iDigBio image dataset. After training a model for each for class, we reclassified the dataset by using these trained models. Some of the results are given in Table 1. As can be seen in the table, even manually classified images can be identified as different classes, and some classes are very similar to each other visually such as CT scans and X-rays or fossils and skeletons. Those kind of similarities are very confusing for the human eye as well as AI results.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135980908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An AI-based Wild Animal Detection System and Its Application 基于人工智能的野生动物检测系统及其应用

Biodiversity Information Science and Standards

Pub Date : 2023-09-11 DOI: 10.3897/biss.7.112456

Congtian Lin, Jiangning Wang, Liqiang Ji

Rapid accumulation of biodiversity data and development of deep learning methods bring the opportunities for detecting and identifying wild animals automatically, based on artificial intelligence. In this paper, we introduce an AI-based wild animal detection system. It is composed of acoustic and image sensors, network infrastructures, species recognition models, and data storage and visualization platform, which go through the technical chain learned from Internet of Things (IOT) and applied to biodiversity detection. The workflow of the system is as follows: Deploying sensors for different detection targets . The acoustic sensor is composed of two microphones for picking up sounds from the environment and an edge computing box for judging and sending back the sound files. The acoustic sensor is suitable for monitoring birds, mammals, chirping insects and frogs. The image sensor is composed of a high performance camera that can be controlled to record surroundings automatically and a video analysis edge box running a model for detecting and recording animals. The image sensor is suitable for monitoring waterbirds in locations without visual obstructions. Adopting different networks according to signal availability . Network infrastructures are critical for the detection system and the task of transferring data collected by sensors. We use the existing network when 4/5G signals are available, and build special networks using Mesh Networking technology for the areas without signals. Multiple network strategies lower the cost for monitoring jobs. Recognizing species from sounds, images or videos . AI plays a key role in our system. We have trained acoustic models for more than 800 Chinese birds and some common chirping insects and frogs, which can be identified from sound files recorded by acoustic sensors. For video and image data, we also have trained models for recognizing 1300 Chinese birds and 400 mammals, which help to discover and count animals captured by image sensors. Moreover, we propose a special method for detecting species through features of voices, images and niche features of animals. It is a flexible framework to adapt to different combinations of acoustic and image sensors. All models were trained with labeled voices, images and distribution data from Chinese species database, ESPECIES. Saving and displaying machine observations . The original sound, image and video files with identified results were stored in the data platform deployed on the cloud for extensible computing and storage. We have developed visualization modules in the platform for displaying sensors on maps using WebGIS to show curves of the number of records and species for each day, real time alerts from sensors capturing animals, and other parameters. Deploying sensors for different detection targets . The acoustic sensor is composed of two microphones for picking up sounds from the environment and an edge computing box for judging and sending back the sound fil

我们在平台中开发了可视化模块，用于使用WebGIS在地图上显示传感器，以显示每天记录和物种数量的曲线，来自捕获动物的传感器的实时警报和其他参数。为了存储和交换机器观测记录和传感器信息，以及模型和网络关键节点，我们提出了从达尔文核心扩展的数据字段集合，并建立了一个数据模型来表示哪个传感器在何时何地观测到哪个物种。自去年以来，该系统已在多个项目中得到应用。例如，我们在北京市部署了50个用于鸟类探测的传感器，目前已经收集了3亿多条记录，检测了320种，有效地填补了北京鸟类从分类覆盖到时间维度的数据空白。下一步将专注于改进人工智能模型，以更高的精度识别物种，在生物多样性检测中推广该系统，并建立共享和发布机器观察结果的机制。

{"title":"An AI-based Wild Animal Detection System and Its Application","authors":"Congtian Lin, Jiangning Wang, Liqiang Ji","doi":"10.3897/biss.7.112456","DOIUrl":"https://doi.org/10.3897/biss.7.112456","url":null,"abstract":"Rapid accumulation of biodiversity data and development of deep learning methods bring the opportunities for detecting and identifying wild animals automatically, based on artificial intelligence. In this paper, we introduce an AI-based wild animal detection system. It is composed of acoustic and image sensors, network infrastructures, species recognition models, and data storage and visualization platform, which go through the technical chain learned from Internet of Things (IOT) and applied to biodiversity detection. The workflow of the system is as follows: Deploying sensors for different detection targets . The acoustic sensor is composed of two microphones for picking up sounds from the environment and an edge computing box for judging and sending back the sound files. The acoustic sensor is suitable for monitoring birds, mammals, chirping insects and frogs. The image sensor is composed of a high performance camera that can be controlled to record surroundings automatically and a video analysis edge box running a model for detecting and recording animals. The image sensor is suitable for monitoring waterbirds in locations without visual obstructions. Adopting different networks according to signal availability . Network infrastructures are critical for the detection system and the task of transferring data collected by sensors. We use the existing network when 4/5G signals are available, and build special networks using Mesh Networking technology for the areas without signals. Multiple network strategies lower the cost for monitoring jobs. Recognizing species from sounds, images or videos . AI plays a key role in our system. We have trained acoustic models for more than 800 Chinese birds and some common chirping insects and frogs, which can be identified from sound files recorded by acoustic sensors. For video and image data, we also have trained models for recognizing 1300 Chinese birds and 400 mammals, which help to discover and count animals captured by image sensors. Moreover, we propose a special method for detecting species through features of voices, images and niche features of animals. It is a flexible framework to adapt to different combinations of acoustic and image sensors. All models were trained with labeled voices, images and distribution data from Chinese species database, ESPECIES. Saving and displaying machine observations . The original sound, image and video files with identified results were stored in the data platform deployed on the cloud for extensible computing and storage. We have developed visualization modules in the platform for displaying sensors on maps using WebGIS to show curves of the number of records and species for each day, real time alerts from sensors capturing animals, and other parameters. Deploying sensors for different detection targets . The acoustic sensor is composed of two microphones for picking up sounds from the environment and an edge computing box for judging and sending back the sound fil","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135981784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High Throughput Information Extraction of Printed Specimen Labels from Large-Scale Digitization of Entomological Collections using a Semi-Automated Pipeline 利用半自动化管道从昆虫馆藏大规模数字化中提取高通量印刷标本标签信息

Biodiversity Information Science and Standards

Pub Date : 2023-09-11 DOI: 10.3897/biss.7.112466

Margot Belot, Leonardo Preuss, Joël Tuberosa, Magdalena Claessen, Olha Svezhentseva, Franziska Schuster, Christian Bölling, Théo Léger

Insects account for half of the total described living organisms on Earth, with a vast number of species awaiting description. Insects play a major role in ecosystems but are yet threatened by habitat destruction, intensive farming, and climate change. Museum collections around the world house millions of insect specimens and large-scale digitization initiatives, such as the digitization street digitize! at the Museum für Naturkunde, have been undertaken recently to unlock this data. Accurate and efficient extraction of insect specimen label information is vital for building comprehensive databases and facilitating scientific investigations, sustainability of the collected data, and efficient knowledge transfer. Despite the advancements in high-throughput imaging techniques for specimens and their labels, the process of transcribing label information remains mostly manual and lags behind the pace of digitization efforts. In order to address this issue, we propose a three step semi-automated pipeline that focuses on extracting and processing information from individual insect labels. Our solution is primarily designed for printed insect labels, as the OCR (optical character recognition) technology performs well for printed text while handwritten texts still yield mixed results. The pipeline incorporates computer vision (CV) techniques, OCR, and a clustering algorithm. The initial stage of our pipeline involves image analysis using a convolutional neural network (CNN) model. The model was trained using 2100 images from three distinct insect label datasets, namely AntWeb (ant specimen labels from various collections), Bees & Bytes (bee specimen labels from the Museum für Naturkunde), and LEP_PHIL (Lepidoptera specimen labels from the Museum für Naturkunde). The first model enables the identification and isolation of single labels within an image, effectively segmenting the label region from the rest of the image, and crops them into multiple new, single-label image files. It also assigns the labels to different classes, i.e., printed text or handwritten, with handwritten labels sorted out from the printed ones. In the second step, labels classified as “printed” are then parsed by an OCR engine to extract the text information from the labels. Tesseract and Google Vision OCRs were both tested to assess their performance. While Google Vision OCR is a cloud-based service with limited configurability, Tesseract provides the flexibility to fine-tune settings and enhance its performance for our specific use cases. In the third step, the OCR outputs are aggregated by similarity using a clustering algorithm. This step allows for the identification and formation of clusters that consist of labels sharing identical or highly similar content. Ultimately, these clusters are compared against a curated database of labels and are assigned to a known label or highlighted as new and manually added to the database. In order to assess the efficiency of our pipeline

昆虫占地球上已被描述的生物总数的一半，还有大量的物种有待描述。昆虫在生态系统中发挥着重要作用，但仍受到栖息地破坏、集约化农业和气候变化的威胁。世界各地的博物馆收藏了数以百万计的昆虫标本，并开展了大规模的数字化举措，如数字化街道!最近在自然之光博物馆(Museum r Naturkunde)进行了一项研究，以解锁这些数据。准确、高效地提取昆虫标本标签信息对于建立全面的数据库、促进科学调查、收集数据的可持续性和有效的知识转移至关重要。尽管标本及其标签的高通量成像技术取得了进步，但转录标签信息的过程仍然主要是手动的，落后于数字化工作的步伐。为了解决这一问题，我们提出了一个三步半自动化流水线，重点从单个昆虫标签中提取和处理信息。我们的解决方案主要是为印刷昆虫标签设计的，因为OCR(光学字符识别)技术对印刷文本表现良好，而手写文本仍然产生混合结果。该管道结合了计算机视觉(CV)技术、OCR和聚类算法。我们的管道的初始阶段包括使用卷积神经网络(CNN)模型进行图像分析。该模型使用来自三个不同昆虫标签数据集的2100幅图像进行训练，即AntWeb(来自各种收集的蚂蚁标本标签)，Bees和amp;Bytes(来自Museum fr Naturkunde的蜜蜂标本标签)和LEP_PHIL(来自Museum fr Naturkunde的鳞翅目标本标签)。第一个模型能够识别和隔离图像中的单个标签，有效地将标签区域与图像的其余部分分割开来，并将它们裁剪成多个新的单标签图像文件。它还将标签分配给不同的类别，即印刷文本或手写文本，手写标签从印刷标签中分类。在第二步中，分类为“打印”的标签随后由OCR引擎解析，以从标签中提取文本信息。我们对Tesseract和Google Vision ocr进行了测试，以评估它们的性能。虽然谷歌视觉OCR是一种基于云的服务，可配置性有限，但Tesseract提供了微调设置的灵活性，并为我们的特定用例提高了性能。在第三步中，使用聚类算法根据相似性对OCR输出进行聚合。这一步允许识别和形成由共享相同或高度相似内容的标签组成的集群。最后，将这些集群与精心策划的标签数据库进行比较，并将它们分配给已知的标签，或者突出显示为新的标签，并手动添加到数据库中。为了评估我们的管道的效率，我们使用一组与模型训练的图像相似的图像进行基准测试实验，以及从各种博物馆藏品中获得的额外图像集。我们的管道提供了几个优势，简化了数据输入过程，减少了人工提取的时间和精力，同时也最大限度地减少了标签转录中潜在的人为错误和不一致。该管道有望加速从昆虫标本中提取元数据，促进科学研究，并使大规模分析能够更深入地了解这些标本。

{"title":"High Throughput Information Extraction of Printed Specimen Labels from Large-Scale Digitization of Entomological Collections using a Semi-Automated Pipeline","authors":"Margot Belot, Leonardo Preuss, Joël Tuberosa, Magdalena Claessen, Olha Svezhentseva, Franziska Schuster, Christian Bölling, Théo Léger","doi":"10.3897/biss.7.112466","DOIUrl":"https://doi.org/10.3897/biss.7.112466","url":null,"abstract":"Insects account for half of the total described living organisms on Earth, with a vast number of species awaiting description. Insects play a major role in ecosystems but are yet threatened by habitat destruction, intensive farming, and climate change. Museum collections around the world house millions of insect specimens and large-scale digitization initiatives, such as the digitization street digitize! at the Museum für Naturkunde, have been undertaken recently to unlock this data. Accurate and efficient extraction of insect specimen label information is vital for building comprehensive databases and facilitating scientific investigations, sustainability of the collected data, and efficient knowledge transfer. Despite the advancements in high-throughput imaging techniques for specimens and their labels, the process of transcribing label information remains mostly manual and lags behind the pace of digitization efforts. In order to address this issue, we propose a three step semi-automated pipeline that focuses on extracting and processing information from individual insect labels. Our solution is primarily designed for printed insect labels, as the OCR (optical character recognition) technology performs well for printed text while handwritten texts still yield mixed results. The pipeline incorporates computer vision (CV) techniques, OCR, and a clustering algorithm. The initial stage of our pipeline involves image analysis using a convolutional neural network (CNN) model. The model was trained using 2100 images from three distinct insect label datasets, namely AntWeb (ant specimen labels from various collections), Bees & Bytes (bee specimen labels from the Museum für Naturkunde), and LEP_PHIL (Lepidoptera specimen labels from the Museum für Naturkunde). The first model enables the identification and isolation of single labels within an image, effectively segmenting the label region from the rest of the image, and crops them into multiple new, single-label image files. It also assigns the labels to different classes, i.e., printed text or handwritten, with handwritten labels sorted out from the printed ones. In the second step, labels classified as “printed” are then parsed by an OCR engine to extract the text information from the labels. Tesseract and Google Vision OCRs were both tested to assess their performance. While Google Vision OCR is a cloud-based service with limited configurability, Tesseract provides the flexibility to fine-tune settings and enhance its performance for our specific use cases. In the third step, the OCR outputs are aggregated by similarity using a clustering algorithm. This step allows for the identification and formation of clusters that consist of labels sharing identical or highly similar content. Ultimately, these clusters are compared against a curated database of labels and are assigned to a known label or highlighted as new and manually added to the database. In order to assess the efficiency of our pipeline","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135982444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Biological Collections Data through Human-AI Collaboration 通过人与人工智能协作改善生物收集数据

Biodiversity Information Science and Standards

Pub Date : 2023-09-11 DOI: 10.3897/biss.7.112488

Alan Stenhouse, Nicole Fisher, Brendan Lepschi, Alexander Schmidt-Lebuhn, Juanita Rodriguez, Federica Turco, Emma Toms, Andrew Reeson, Cécile Paris, Pete Thrall

Biological collections play a crucial role in our understanding of biodiversity and inform research in areas such as biosecurity, conservation, human health and climate change. In recent years, the digitisation of biological specimen collections has emerged as a vital mechanism for preserving and facilitating access to these invaluable scientific datasets. However, the growing volume of specimens and associated data presents significant challenges for curation and data management. By leveraging human-Artificial Intelligence (AI) collaborations, we aim to transform the way biological collections are curated and managed, unlocking their full potential in addressing global challenges. We present our initial contribution to this field through the development of a software prototype to improve metadata extraction from digital specimen images in biological collections. The prototype provides an easy-to-use platform for collaborating with web-based AI services, such as Google Vision and OpenAI Generative Pre-trained Transformer (GPT) Large Language Models (LLM). We demonstrate its effectiveness when applied to herbarium and insect specimen images. Machine-human collaboration may occur at various points within the workflows and can significantly affect outcomes. Initial trials suggest that the visual display of AI model uncertainty could be useful during expert data curation. While much work remains to be done, our results indicate that collaboration between humans and AI models can significantly improve the digitisation rate of biological specimens and thereby enable faster global access to this vital data. Finally, we introduce our broader vision for improving biological collection curation and management using human-AI collaborative methods. We explore the rationale behind this approach and the potential benefits of adding AI-based assistants to collection teams. We also examine future possibilities and the concept of creating 'digital colleagues' for seamless collaboration between human and digital curators. This ‘collaborative intelligence’ will enable us to make better use of both human and machine capabilities to achieve the goal of unlocking and improving our use of these vital biodiversity data to tackle real-world problems.

生物收集在我们了解生物多样性方面发挥着至关重要的作用，并为生物安全、保护、人类健康和气候变化等领域的研究提供信息。近年来，生物标本收集的数字化已成为保存和促进获取这些宝贵科学数据集的重要机制。然而，越来越多的标本和相关数据为策展和数据管理带来了重大挑战。通过利用人类与人工智能(AI)的合作，我们的目标是改变生物收藏的策划和管理方式，充分释放它们在应对全球挑战方面的潜力。我们通过开发一个软件原型来改进从生物馆藏的数字标本图像中提取元数据，提出了我们对这一领域的初步贡献。该原型提供了一个易于使用的平台，用于与基于web的人工智能服务(如谷歌Vision和OpenAI生成预训练转换器(GPT)大型语言模型(LLM))进行协作。我们在植物标本馆和昆虫标本图像中验证了该方法的有效性。人机协作可能发生在工作流程中的各个点上，并可能对结果产生重大影响。初步试验表明，人工智能模型不确定性的视觉显示在专家数据管理中可能很有用。虽然还有很多工作要做，但我们的研究结果表明，人类和人工智能模型之间的合作可以显著提高生物标本的数字化率，从而使全球更快地访问这一重要数据。最后，我们介绍了我们使用人类-人工智能协作方法改善生物收集策展和管理的更广泛愿景。我们探讨了这种方法背后的基本原理，以及在收集团队中添加基于人工智能的助手的潜在好处。我们还研究了未来的可能性和创造“数字同事”的概念，以便在人类和数字策展人之间实现无缝协作。这种“协同智能”将使我们能够更好地利用人和机器的能力，以实现解锁和改进我们对这些重要生物多样性数据的使用，以解决现实世界的问题。

{"title":"Improving Biological Collections Data through Human-AI Collaboration","authors":"Alan Stenhouse, Nicole Fisher, Brendan Lepschi, Alexander Schmidt-Lebuhn, Juanita Rodriguez, Federica Turco, Emma Toms, Andrew Reeson, Cécile Paris, Pete Thrall","doi":"10.3897/biss.7.112488","DOIUrl":"https://doi.org/10.3897/biss.7.112488","url":null,"abstract":"Biological collections play a crucial role in our understanding of biodiversity and inform research in areas such as biosecurity, conservation, human health and climate change. In recent years, the digitisation of biological specimen collections has emerged as a vital mechanism for preserving and facilitating access to these invaluable scientific datasets. However, the growing volume of specimens and associated data presents significant challenges for curation and data management. By leveraging human-Artificial Intelligence (AI) collaborations, we aim to transform the way biological collections are curated and managed, unlocking their full potential in addressing global challenges. We present our initial contribution to this field through the development of a software prototype to improve metadata extraction from digital specimen images in biological collections. The prototype provides an easy-to-use platform for collaborating with web-based AI services, such as Google Vision and OpenAI Generative Pre-trained Transformer (GPT) Large Language Models (LLM). We demonstrate its effectiveness when applied to herbarium and insect specimen images. Machine-human collaboration may occur at various points within the workflows and can significantly affect outcomes. Initial trials suggest that the visual display of AI model uncertainty could be useful during expert data curation. While much work remains to be done, our results indicate that collaboration between humans and AI models can significantly improve the digitisation rate of biological specimens and thereby enable faster global access to this vital data. Finally, we introduce our broader vision for improving biological collection curation and management using human-AI collaborative methods. We explore the rationale behind this approach and the potential benefits of adding AI-based assistants to collection teams. We also examine future possibilities and the concept of creating 'digital colleagues' for seamless collaboration between human and digital curators. This ‘collaborative intelligence’ will enable us to make better use of both human and machine capabilities to achieve the goal of unlocking and improving our use of these vital biodiversity data to tackle real-world problems.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135980919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Swedish Biodiversity Data Infrastructure (SBDI): Insights from the Swedish ALA installation 瑞典生物多样性数据基础设施(SBDI):来自瑞典ALA装置的见解

Biodiversity Information Science and Standards

Pub Date : 2023-09-11 DOI: 10.3897/biss.7.112429

Margret Steinthorsdottir, Veronika Johansson, Manash Shah

The Swedish Biodiversity Data Infrastructure (SBDI) is a biodiversity informatics infrastructure and is the key national resource for data-driven biodiversity and ecosystems research. SBDI rests on three pillars: mobilisation and access to biodiversity data; development and operation of tools for analysing these data; and user support. SBDI is funded by the Swedish Research Council (VR) and eleven of Sweden’s major universities and research government authorities (Fig. 1). mobilisation and access to biodiversity data; development and operation of tools for analysing these data; and user support. SBDI is funded by the Swedish Research Council (VR) and eleven of Sweden’s major universities and research government authorities (Fig. 1). SBDI was formed in early 2021 and represents the final step in an amalgamation of national infrastructures for biodiversity and ecosystems research. SBDI includes the Swedish node of the Global Biodiversity Information Facility (GBIF), the key international infrastructure for sharing biodiversity data. SBDI's predecessor Biodiversity Atlas Sweden (BAS) was an early adopter of the Atlas of Living Australia (ALA) platform. SBDI pioneered the container-based deployment of the platform using Docker and Docker Swarm. This container-based approach helps simplify deployment of the platform, which is characterised by a microservice architecture with loosely coupled services. This enables scalability, modularity, integration of services, and new technology insertions. SBDI has customised the BioCollect module to remove region-specific constraints so that it can be more readily improved for environmental monitoring in Sweden. To further support this, there are plans to develop services for the distribution of terrestrial map layers, which will provide important habitat information for artificial intelligence and machine learning research projects. The Amplicon Sequence Variants (ASVs) portal, an interface to sequence-based observations, is an example of integration and new technology insertion. The portal developed in SBDI and seamlessly integrated with the ALA platform provides basic functionalities for searching ASVs and occurrence records using the Basic Local Alignment Search Tool (BLAST) or filters on sequencing details and taxonomy and for submitting metabarcoding dataset Fig. 2. Future developments for SBDI include a continued focus on eDNA and monitoring data as well as the implementation of procedures for handling sensitive data.

瑞典生物多样性数据基础设施(SBDI)是一个生物多样性信息学基础设施，是数据驱动的生物多样性和生态系统研究的关键国家资源。SBDI有三个支柱:生物多样性数据的动员和获取;开发和操作分析这些数据的工具;以及用户支持。SBDI由瑞典研究委员会(VR)和11所瑞典主要大学和政府研究机构资助(图1)。开发和操作分析这些数据的工具;以及用户支持。SBDI由瑞典研究委员会(VR)和11所瑞典主要大学和研究政府机构资助(图1)。SBDI成立于2021年初，是生物多样性和生态系统研究国家基础设施合并的最后一步。SBDI包括全球生物多样性信息设施(GBIF)的瑞典节点，这是共享生物多样性数据的关键国际基础设施。SBDI的前身瑞典生物多样性地图集(BAS)是澳大利亚生活地图集(ALA)平台的早期采用者。SBDI率先使用Docker和Docker Swarm进行基于容器的平台部署。这种基于容器的方法有助于简化平台的部署，平台的特点是微服务架构和松散耦合的服务。这支持可伸缩性、模块化、服务集成和新技术插入。SBDI对BioCollect模块进行了定制，以消除特定区域的限制，从而可以更容易地改进该模块，用于瑞典的环境监测。为了进一步支持这一点，计划开发用于陆地地图层分发的服务，这将为人工智能和机器学习研究项目提供重要的栖息地信息。扩增子序列变体(asv)门户是基于序列的观测接口，是集成和新技术插入的一个例子。该门户由SBDI开发，并与ALA平台无缝集成，提供了使用基本本地对齐搜索工具(basic Local Alignment Search Tool, BLAST)或测序细节和分类过滤器搜索asv和发生记录的基本功能，以及提交元条形码数据集(图2)。SBDI的未来发展包括继续关注eDNA和监测数据，以及执行处理敏感数据的程序。

{"title":"Swedish Biodiversity Data Infrastructure (SBDI): Insights from the Swedish ALA installation","authors":"Margret Steinthorsdottir, Veronika Johansson, Manash Shah","doi":"10.3897/biss.7.112429","DOIUrl":"https://doi.org/10.3897/biss.7.112429","url":null,"abstract":"The Swedish Biodiversity Data Infrastructure (SBDI) is a biodiversity informatics infrastructure and is the key national resource for data-driven biodiversity and ecosystems research. SBDI rests on three pillars: mobilisation and access to biodiversity data; development and operation of tools for analysing these data; and user support. SBDI is funded by the Swedish Research Council (VR) and eleven of Sweden’s major universities and research government authorities (Fig. 1). mobilisation and access to biodiversity data; development and operation of tools for analysing these data; and user support. SBDI is funded by the Swedish Research Council (VR) and eleven of Sweden’s major universities and research government authorities (Fig. 1). SBDI was formed in early 2021 and represents the final step in an amalgamation of national infrastructures for biodiversity and ecosystems research. SBDI includes the Swedish node of the Global Biodiversity Information Facility (GBIF), the key international infrastructure for sharing biodiversity data. SBDI's predecessor Biodiversity Atlas Sweden (BAS) was an early adopter of the Atlas of Living Australia (ALA) platform. SBDI pioneered the container-based deployment of the platform using Docker and Docker Swarm. This container-based approach helps simplify deployment of the platform, which is characterised by a microservice architecture with loosely coupled services. This enables scalability, modularity, integration of services, and new technology insertions. SBDI has customised the BioCollect module to remove region-specific constraints so that it can be more readily improved for environmental monitoring in Sweden. To further support this, there are plans to develop services for the distribution of terrestrial map layers, which will provide important habitat information for artificial intelligence and machine learning research projects. The Amplicon Sequence Variants (ASVs) portal, an interface to sequence-based observations, is an example of integration and new technology insertion. The portal developed in SBDI and seamlessly integrated with the ALA platform provides basic functionalities for searching ASVs and occurrence records using the Basic Local Alignment Search Tool (BLAST) or filters on sequencing details and taxonomy and for submitting metabarcoding dataset Fig. 2. Future developments for SBDI include a continued focus on eDNA and monitoring data as well as the implementation of procedures for handling sensitive data.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135982102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Practice, Pathways and Lessons Learned from Building a Digital Data Flow with Tools: Focusing on alien invasive species, from occurrence via measures to documentation 用工具构建数字数据流的实践、途径和经验教训:关注外来入侵物种，从发生到措施到记录

Biodiversity Information Science and Standards

Pub Date : 2023-09-11 DOI: 10.3897/biss.7.112337

Mora Aronsson, Malin Strand, Holger Dettki, Hanna Illander, Johan Olsson

The SLU Swedish Species Information Centre (SSIC, SLU Artdatabanken) accumulates, analyses and disseminates information concerning species and habitats occurring in Sweden. The SSIC provides an open access biodiversity reporting and analysis infrastructure including the Swedish Species Observation System, the Swedish taxonomic backbone Dyntaxa, and tools for species information including traits, terminology, quality assurance and species identification.*1 The content is available to scientists, conservationists and the public. All systems, databases, APIs and web applications, rely on recognized standards to ensure interoperability. The SSIC is a leading partner within the Swedish Biodiversity Data Infrastructure (SBDI). Here we present a data flow (Fig. 1) that exemplifies the strengthening of the cooperation and transfer of experiences between research, community, non-governmental organizations (NGOs), citizen science and governmental agencies, and also presents solutions to current data challenges (e.g., data fragmentation, taxonomic issues or platform relations). This data flow aimed to facilitate the process for evaluating and understanding the distribution and spread of species (e.g., invasive alien species). It provides Findable, Accessible, Interoperable and Reusable (FAIR) data and links related information between different parties such as universities, NGOs, county administrative boards (CABs) and environmental protection agencies (EPAs). The digital structure is built on the national Swedish taxonomic backbone Dyntaxa, which prevents data fragmentation due to taxonomic issues and acts as a common standard for all users. The chain of information contains systems, tools and a linked data flow for reporting observations, verification procedures, and it can work as an early warning system for surveillance regarding certain species. After an observation is reported, an alert can be activated, field checks can be carried out, and if necessary, eradication measures can be activated. The verification tool that traditionally has been focused on the quality of species identification has been improved, providing verification of geographic precision. This is equally important for eradication actions as is species accuracy. A digital catalogue of eradication methods is in use by the CABs but there are also recommendations on methods for ‘public’ use, and collaboration between Invasive Alien Species (IAS) coordinators in regional CABs is currently being developed. The CABs have a separate tool for documentation of eradication measures and, if/when measures are carried out (by CABs), this information can be fed back from the CAB-tool into the database in SSIC where it is possible to search for, and visualize, this information. Taxonomic integrity over time should be intact and related to the taxon identifier (ID) provided by Dyntaxa. However, metadata, such as geographic position, date, verification status, mitigation results, etc., will be fully us

SLU瑞典物种信息中心(SSIC, SLU Artdatabanken)收集、分析和传播有关在瑞典出现的物种和生境的信息。snic提供开放获取的生物多样性报告和分析基础设施，包括瑞典物种观测系统、瑞典分类学主干Dyntaxa，以及物种信息工具，包括特征、术语、质量保证和物种鉴定。*1内容可供科学家、环保人士和公众使用。所有系统、数据库、api和web应用程序都依赖于公认的标准来确保互操作性。SSIC是瑞典生物多样性数据基础设施(SBDI)的主要合作伙伴。在这里，我们展示了一个数据流(图1)，它举例说明了加强研究、社区、非政府组织(ngo)、公民科学和政府机构之间的合作和经验转移，并提出了当前数据挑战(例如，数据碎片化、分类问题或平台关系)的解决方案。这一数据流旨在促进评估和了解物种(例如外来入侵物种)的分布和传播过程。它提供可查找、可访问、可互操作和可重复使用(FAIR)的数据，并在大学、非政府组织、县行政委员会(cab)和环境保护机构(EPAs)等不同各方之间链接相关信息。数字结构建立在瑞典国家分类学主干Dyntaxa上，它可以防止由于分类学问题而导致的数据碎片，并作为所有用户的通用标准。信息链包含报告观察结果的系统、工具和相关联的数据流、核查程序，并可作为监测某些物种的预警系统。在报告观察结果后，可启动警报，开展现场检查，并在必要时启动根除措施。传统上专注于物种鉴定质量的验证工具得到了改进，提供了地理精度的验证。这对根除行动和物种准确性同样重要。虫害防治中心正在使用一份根除方法的数字目录，但也有关于供“公众”使用的方法的建议，并且目前正在发展区域虫害防治中心外来入侵物种协调员之间的合作。cab有一个单独的工具来记录根除措施，如果/当cab采取措施时，这些信息可以从cab工具反馈到SSIC的数据库中，在那里可以搜索和可视化这些信息。随着时间的推移，分类学的完整性应该是完整的，并且与Dyntaxa提供的分类单元标识符(ID)相关。但是，在根据国际会计准则第1143/2014号条例(欧盟)进行报告时，将充分使用元数据，如地理位置、日期、核查状态、缓解结果等。数字结构的开发是与瑞典环境保护局(naturv rdsverket)和瑞典海洋和水管理署(Havs-och Vattenmyndigheten)合作进行的。

{"title":"Practice, Pathways and Lessons Learned from Building a Digital Data Flow with Tools: Focusing on alien invasive species, from occurrence via measures to documentation","authors":"Mora Aronsson, Malin Strand, Holger Dettki, Hanna Illander, Johan Olsson","doi":"10.3897/biss.7.112337","DOIUrl":"https://doi.org/10.3897/biss.7.112337","url":null,"abstract":"The SLU Swedish Species Information Centre (SSIC, SLU Artdatabanken) accumulates, analyses and disseminates information concerning species and habitats occurring in Sweden. The SSIC provides an open access biodiversity reporting and analysis infrastructure including the Swedish Species Observation System, the Swedish taxonomic backbone Dyntaxa, and tools for species information including traits, terminology, quality assurance and species identification.*1 The content is available to scientists, conservationists and the public. All systems, databases, APIs and web applications, rely on recognized standards to ensure interoperability. The SSIC is a leading partner within the Swedish Biodiversity Data Infrastructure (SBDI). Here we present a data flow (Fig. 1) that exemplifies the strengthening of the cooperation and transfer of experiences between research, community, non-governmental organizations (NGOs), citizen science and governmental agencies, and also presents solutions to current data challenges (e.g., data fragmentation, taxonomic issues or platform relations). This data flow aimed to facilitate the process for evaluating and understanding the distribution and spread of species (e.g., invasive alien species). It provides Findable, Accessible, Interoperable and Reusable (FAIR) data and links related information between different parties such as universities, NGOs, county administrative boards (CABs) and environmental protection agencies (EPAs). The digital structure is built on the national Swedish taxonomic backbone Dyntaxa, which prevents data fragmentation due to taxonomic issues and acts as a common standard for all users. The chain of information contains systems, tools and a linked data flow for reporting observations, verification procedures, and it can work as an early warning system for surveillance regarding certain species. After an observation is reported, an alert can be activated, field checks can be carried out, and if necessary, eradication measures can be activated. The verification tool that traditionally has been focused on the quality of species identification has been improved, providing verification of geographic precision. This is equally important for eradication actions as is species accuracy. A digital catalogue of eradication methods is in use by the CABs but there are also recommendations on methods for ‘public’ use, and collaboration between Invasive Alien Species (IAS) coordinators in regional CABs is currently being developed. The CABs have a separate tool for documentation of eradication measures and, if/when measures are carried out (by CABs), this information can be fed back from the CAB-tool into the database in SSIC where it is possible to search for, and visualize, this information. Taxonomic integrity over time should be intact and related to the taxon identifier (ID) provided by Dyntaxa. However, metadata, such as geographic position, date, verification status, mitigation results, etc., will be fully us","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135981960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Documenting Biodiversity in Underrepresented Languages using Crowdsourcing 用众包方法记录代表性不足语言的生物多样性

Biodiversity Information Science and Standards

Pub Date : 2023-09-11 DOI: 10.3897/biss.7.112431

Mohammed Kamal-Deen Fuseini, Agnes Abah, Andra Waagmeester

Biodiversity is the variety of life on Earth, and it is essential for our planet's health and well-being. Language is also a powerful medium for documenting and preserving cultural heritage, including knowledge about biodiversity. However, many indigenous and underrepresented languages are at risk of disappearing, taking with them valuable information about local ecosystems. Also, many species are at risk of extinction, and much of our knowledge about biodiversity is in underrepresented languages. (Cardoso et al. 2019). This can make it challenging to document and protect biodiversity, as well as to share this knowledge with others. Crowdsourcing is a way to collect information from a large number of people, and it can be a valuable tool for documenting biodiversity in underrepresented languages. By crowdsourcing, leveraging the iNaturalist platform, and volunteer contributors in the open movement including the Dagbani*1 and Igbo*2 Wikimedian communities, we can reach people who have knowledge about local biodiversity, but who may not have been able to share this knowledge before. For instance, the Dagbani and Igbo Wikimedia contributors did not have enough content on biodiversity data until they received education about the need. This can help us to fill in the gaps in our knowledge about biodiversity, and to protect species that are at risk of extinction. In this presentation, we will discuss the use of crowdsourcing to document biodiversity in underrepresented languages, the challenges and opportunities of using crowdsourcing for this purpose, and some examples of successful projects. We will also discuss the importance of sharing knowledge about biodiversity with others and share some ideas on how to do this. We believe that crowdsourcing has the potential to be a powerful tool for documenting biodiversity in underrepresented languages. By working together, we can help protect our planet's biodiversity and ensure that this knowledge is available to future generations.

生物多样性是地球上生命的多样性，对我们星球的健康和福祉至关重要。语言也是记录和保存文化遗产，包括生物多样性知识的有力媒介。然而，许多土著语言和代表性不足的语言正面临消失的危险，并带走了有关当地生态系统的宝贵信息。此外，许多物种正面临灭绝的危险，我们对生物多样性的大部分知识都是在代表性不足的语言中掌握的。(Cardoso et al. 2019)。这使得记录和保护生物多样性以及与他人分享这些知识变得具有挑战性。众包是一种从大量人群中收集信息的方式，它可以成为记录代表性不足语言的生物多样性的宝贵工具。通过众包，利用iNaturalist平台，以及开放运动中的志愿者贡献者，包括Dagbani*1和Igbo*2维基媒体社区，我们可以接触到那些了解当地生物多样性，但以前可能无法分享这些知识的人。例如，达格巴尼族和伊博族维基媒体的贡献者在接受有关需求的教育之前，没有足够的生物多样性数据内容。这可以帮助我们填补生物多样性知识的空白，并保护濒临灭绝的物种。在这次演讲中，我们将讨论使用众包来记录代表性不足的语言的生物多样性，为此目的使用众包的挑战和机遇，以及一些成功项目的例子。我们还将讨论与他人分享生物多样性知识的重要性，并就如何做到这一点分享一些想法。我们相信，众包有潜力成为记录未被充分代表的语言的生物多样性的有力工具。通过共同努力，我们可以帮助保护地球的生物多样性，并确保后代能够获得这些知识。

{"title":"Documenting Biodiversity in Underrepresented Languages using Crowdsourcing","authors":"Mohammed Kamal-Deen Fuseini, Agnes Abah, Andra Waagmeester","doi":"10.3897/biss.7.112431","DOIUrl":"https://doi.org/10.3897/biss.7.112431","url":null,"abstract":"Biodiversity is the variety of life on Earth, and it is essential for our planet's health and well-being. Language is also a powerful medium for documenting and preserving cultural heritage, including knowledge about biodiversity. However, many indigenous and underrepresented languages are at risk of disappearing, taking with them valuable information about local ecosystems. Also, many species are at risk of extinction, and much of our knowledge about biodiversity is in underrepresented languages. (Cardoso et al. 2019). This can make it challenging to document and protect biodiversity, as well as to share this knowledge with others. Crowdsourcing is a way to collect information from a large number of people, and it can be a valuable tool for documenting biodiversity in underrepresented languages. By crowdsourcing, leveraging the iNaturalist platform, and volunteer contributors in the open movement including the Dagbani*1 and Igbo*2 Wikimedian communities, we can reach people who have knowledge about local biodiversity, but who may not have been able to share this knowledge before. For instance, the Dagbani and Igbo Wikimedia contributors did not have enough content on biodiversity data until they received education about the need. This can help us to fill in the gaps in our knowledge about biodiversity, and to protect species that are at risk of extinction. In this presentation, we will discuss the use of crowdsourcing to document biodiversity in underrepresented languages, the challenges and opportunities of using crowdsourcing for this purpose, and some examples of successful projects. We will also discuss the importance of sharing knowledge about biodiversity with others and share some ideas on how to do this. We believe that crowdsourcing has the potential to be a powerful tool for documenting biodiversity in underrepresented languages. By working together, we can help protect our planet's biodiversity and ensure that this knowledge is available to future generations.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135980578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0