The number of openly-accessible digital plant specimen images is growing tremendously and available through data aggregators: Global Biodiversity Information Facility (GBIF) contains 43.2 million images, and Intergrated Digitized Biocollections (iDigBio) contains 32.4 million images (Accessed on 29.06.2023). All these images contain great ecological (morphological, phenological, taxonomic etc.) information, which has the potential to facilitate the conduct of large-scale analyses. However, extracting this information from these images and making it available to analysis tools remains challenging and requires more advanced computer vision algorithms. With the latest advancements in the natural language processing field, it is becoming possible to analyse images with text prompts. For example, with the Contrastive Language-Image Pre-Training (CLIP) model, which was trained on 400 million image-text pairs, it is feasible to classify day-to-day life images by providing different text prompts and an image as an input to the model, then the model can predict the most suitable text prompt for the input image. We explored the feasibility of using the CLIP model to analyse digital plant specimen images. A particular focus of this study was on the generation of appropriate text prompts. This is important as the prompt has a large influence on the results of the model. We experimented with three different methods: a) automatic text prompt based on metadata of the specific image or other datasets, b) automatic generic text prompt of the image (describing what is in the image) and c) manual text prompt by annotating the image. We investigated the suitability of these prompts with an experiment, where we tested whether the CLIP model could recognize a herbarium specimen image using digital plant specimen images and semantically disparate text prompts. Our ultimate goal is to filter the digital plant specimen images based on the availability of intact leaves and measurement scale to reduce the number of specimens that reach the downstream pipeline, for instance, the segmentation task for the leaf trait extraction process. To achieve the goal, we are fine-tuning the CLIP model with a dataset of around 20,000 digital plant specimen image-text prompt pairs, where the text prompts were generated using different datasets, metadata and generic text prompt methods. Since the text prompts can be created automatically, it is possible to eradicate the laborious manual annotating process. In conclusion, we present our experimental testing of the CLIP model on digital plant specimen images with varied settings and how the CLIP model can act as a potential filtering tool. In future, we plan to investigate the possibility of using text prompts to do the instance segmentation to extract leaf trait information using Large Language Models (LLMs).
{"title":"The Role of the CLIP Model in Analysing Herbarium Specimen Images","authors":"Vamsi Krishna Kommineni, Jens Kattge, Jitendra Gaikwad, Susanne Tautenhahn, Birgitta Koenig-ries","doi":"10.3897/biss.7.112566","DOIUrl":"https://doi.org/10.3897/biss.7.112566","url":null,"abstract":"The number of openly-accessible digital plant specimen images is growing tremendously and available through data aggregators: Global Biodiversity Information Facility (GBIF) contains 43.2 million images, and Intergrated Digitized Biocollections (iDigBio) contains 32.4 million images (Accessed on 29.06.2023). All these images contain great ecological (morphological, phenological, taxonomic etc.) information, which has the potential to facilitate the conduct of large-scale analyses. However, extracting this information from these images and making it available to analysis tools remains challenging and requires more advanced computer vision algorithms. With the latest advancements in the natural language processing field, it is becoming possible to analyse images with text prompts. For example, with the Contrastive Language-Image Pre-Training (CLIP) model, which was trained on 400 million image-text pairs, it is feasible to classify day-to-day life images by providing different text prompts and an image as an input to the model, then the model can predict the most suitable text prompt for the input image. We explored the feasibility of using the CLIP model to analyse digital plant specimen images. A particular focus of this study was on the generation of appropriate text prompts. This is important as the prompt has a large influence on the results of the model. We experimented with three different methods: a) automatic text prompt based on metadata of the specific image or other datasets, b) automatic generic text prompt of the image (describing what is in the image) and c) manual text prompt by annotating the image. We investigated the suitability of these prompts with an experiment, where we tested whether the CLIP model could recognize a herbarium specimen image using digital plant specimen images and semantically disparate text prompts. Our ultimate goal is to filter the digital plant specimen images based on the availability of intact leaves and measurement scale to reduce the number of specimens that reach the downstream pipeline, for instance, the segmentation task for the leaf trait extraction process. To achieve the goal, we are fine-tuning the CLIP model with a dataset of around 20,000 digital plant specimen image-text prompt pairs, where the text prompts were generated using different datasets, metadata and generic text prompt methods. Since the text prompts can be created automatically, it is possible to eradicate the laborious manual annotating process. In conclusion, we present our experimental testing of the CLIP model on digital plant specimen images with varied settings and how the CLIP model can act as a potential filtering tool. In future, we plan to investigate the possibility of using text prompts to do the instance segmentation to extract leaf trait information using Large Language Models (LLMs).","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135879248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heimo Rainer, Andreas Berger, Tanja Schuster, Johannes Walter, Dieter Reich, Kurt Zernig, Jiří Danihelka, Hana Galušková, Patrik Mráz, Natalia Tkach, Jörn Hentschel, Jochen Müller, Sarah Wagner, Walter Berendsohn, Robert Lücking, Robert Vogt, Lia Pignotti, Francesco Roma-Marzio, Lorenzo Peruzzi
Nomenclatural and taxonomic information are crucial for curating botanical collections. In the course of changing methods for systematic and taxonomic studies, classification systems changed considerably over time (Dalla Torre and Harms 1900, Durand and Bentham 1888, Endlicher 1836, Angiosperm Phylogeny Group et al. 2016). Various approaches to store preserved material have been implemented, most of them based on scientific names (e.g., families, genera, species) often in combination with other criteria such as geographic provenance or collectors. The collection management system, JACQ, was established in the early 2000s then developed to support multiple institutions. It features a centralised data storage (with mirror sites) and access via the Internet. Participating collections can download their data at any time in a comma-separated values (CSV) format. From the beginning, JACQ was conceived as a collaboration platform for objects housed in botanical collections, i.e., plant, fungal and algal groups. For these groups, various sources of taxonomic reference exist, nowadays online resources are preferred, e.g., Catalogue of Life, AlgaeBase, Index Fungorum, Mycobank, Tropicos, Plants of the World Online, International Plant Names Index (IPNI), World Flora Online, Euro+Med, Anthos, Flora of Northamerica, REFLORA, Flora of China, Flora of Cuba, Australian Virtual Herbarium (AVH). Implementation and (re)use of PIDs Persistent identifiers (PIDs) for names (at any taxonomic rank) apart from PIDs for taxa, are essential to allow and support reliable referencing across institutions and thematic research networks (Agosti et al. 2022). For this purpose we have integrated referencing to several of the above mentioned resources and populate the names used inside JACQ with those external PIDs. For example, Salix rosmarinifolia is accepted in Plants of the World Online while Euro+Med Plantbase considers it a synonym of Salix repens subsp. rosmarinifolia. Either one can be an identification of a specimen in the JACQ database. Retrieval of collection material One strong use case is the curation of material in historic collections. On the basis of outdated taxon concepts that were applied to the material in history, "old" synonyms are omnipresent in historical collections. In order to retrieve all material of a given taxon, it is necessary to know all relevant names. Future outlook In combination with the capability of Linked Data and the IIIF (International Image Interoperability Framework) technology, these PIDs serve as crucial elements for the integration of decentralized information systems and reuse of (global) taxonomic backbones in combination with collection management systems (Gamer and Kreyenbühl 2022, Hyam 2022, Loh 2017).
命名和分类信息对植物收藏的管理是至关重要的。在系统和分类学研究方法的变化过程中,分类系统随着时间的推移发生了很大变化(Dalla Torre and Harms 1900, Durand and Bentham 1888, Endlicher 1836,被子植物系统发育组等人2016)。已经实施了各种保存材料的方法,其中大多数基于学名(例如科、属、种),通常与地理来源或收集者等其他标准相结合。收集管理系统JACQ于21世纪初建立,然后发展为支持多家机构。它的特点是集中的数据存储(镜像站点)和通过互联网访问。参与的集合可以随时以逗号分隔值(CSV)格式下载它们的数据。从一开始,JACQ就被设想为一个植物收藏的协作平台,即植物、真菌和藻类群体。这些类群的分类参考来源多种多样,目前首选的是在线资源,如Catalogue of Life、AlgaeBase、Index Fungorum、Mycobank、Tropicos、Plants of World online、International Plant Names Index (IPNI)、World Flora online、Euro+Med、Anthos、Flora of Northamerica、REFLORA、Flora of China、Flora of Cuba、Australian Virtual Herbarium (AVH)等。除了分类群的持久标识符(pid)之外,用于名称(任何分类级别)的持久标识符(pid)对于允许和支持跨机构和专题研究网络的可靠引用至关重要(Agosti et al. 2022)。为此,我们集成了对上面提到的几个资源的引用,并用这些外部pid填充JACQ中使用的名称。例如,Salix rosmarinifolia在《Plants of World Online》中被接受,而Euro+Med Plantbase则认为它是Salix repens subsp的同义词。rosmarinifolia。其中任何一个都可以是JACQ数据库中标本的标识。馆藏资料的检索一个强有力的用例是历史馆藏资料的管理。在历史上应用于材料的过时的分类单元概念的基础上,“旧”同义词在历史典藏中无处不在。为了检索给定分类单元的所有资料,有必要知道所有相关的名称。结合关联数据和IIIF(国际图像互操作性框架)技术的能力,这些ids将成为集成分散信息系统和重用(全球)分类骨干与收集管理系统(Gamer and kreyenb 2022, Hyam 2022, Loh 2017)的关键要素。
{"title":"Community Curation of Nomenclatural and Taxonomic Information in the Context of the Collection Management System JACQ","authors":"Heimo Rainer, Andreas Berger, Tanja Schuster, Johannes Walter, Dieter Reich, Kurt Zernig, Jiří Danihelka, Hana Galušková, Patrik Mráz, Natalia Tkach, Jörn Hentschel, Jochen Müller, Sarah Wagner, Walter Berendsohn, Robert Lücking, Robert Vogt, Lia Pignotti, Francesco Roma-Marzio, Lorenzo Peruzzi","doi":"10.3897/biss.7.112571","DOIUrl":"https://doi.org/10.3897/biss.7.112571","url":null,"abstract":"Nomenclatural and taxonomic information are crucial for curating botanical collections. In the course of changing methods for systematic and taxonomic studies, classification systems changed considerably over time (Dalla Torre and Harms 1900, Durand and Bentham 1888, Endlicher 1836, Angiosperm Phylogeny Group et al. 2016). Various approaches to store preserved material have been implemented, most of them based on scientific names (e.g., families, genera, species) often in combination with other criteria such as geographic provenance or collectors. The collection management system, JACQ, was established in the early 2000s then developed to support multiple institutions. It features a centralised data storage (with mirror sites) and access via the Internet. Participating collections can download their data at any time in a comma-separated values (CSV) format. From the beginning, JACQ was conceived as a collaboration platform for objects housed in botanical collections, i.e., plant, fungal and algal groups. For these groups, various sources of taxonomic reference exist, nowadays online resources are preferred, e.g., Catalogue of Life, AlgaeBase, Index Fungorum, Mycobank, Tropicos, Plants of the World Online, International Plant Names Index (IPNI), World Flora Online, Euro+Med, Anthos, Flora of Northamerica, REFLORA, Flora of China, Flora of Cuba, Australian Virtual Herbarium (AVH). Implementation and (re)use of PIDs Persistent identifiers (PIDs) for names (at any taxonomic rank) apart from PIDs for taxa, are essential to allow and support reliable referencing across institutions and thematic research networks (Agosti et al. 2022). For this purpose we have integrated referencing to several of the above mentioned resources and populate the names used inside JACQ with those external PIDs. For example, Salix rosmarinifolia is accepted in Plants of the World Online while Euro+Med Plantbase considers it a synonym of Salix repens subsp. rosmarinifolia. Either one can be an identification of a specimen in the JACQ database. Retrieval of collection material One strong use case is the curation of material in historic collections. On the basis of outdated taxon concepts that were applied to the material in history, \"old\" synonyms are omnipresent in historical collections. In order to retrieve all material of a given taxon, it is necessary to know all relevant names. Future outlook In combination with the capability of Linked Data and the IIIF (International Image Interoperability Framework) technology, these PIDs serve as crucial elements for the integration of decentralized information systems and reuse of (global) taxonomic backbones in combination with collection management systems (Gamer and Kreyenbühl 2022, Hyam 2022, Loh 2017).","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135879283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roselyn Gabud, Nelson Pampolina, Vladimir Mariano, Riza Batista-Navarro
Understanding the biology underpinning the natural regeneration of plant species in order to make plans for effective reforestation is a complex task. This can be aided by providing access to databases that contain long-term and wide-scale geographical information on species distribution, habitat, and reproduction. Although there exists widely-used biodiversity databases that contain structured information on species and their occurrences, such as the Global Biodiversity Information Facility (GBIF) and the Atlas of Living Australia (ALA), the bulk of knowledge about biodiversity still remains embedded in textual documents. Unstructured information can be made more accessible and useful for large-scale studies if there are tools and services that automatically extract meaningful information from text and store it in structured formats, e.g., open biodiversity databases, ready to be consumed for analysis (Thessen et al. 2022). We aim to enrich biodiversity occurrence databases with information on species reproductive condition and habitat, derived from text. In previous work, we developed unsupervised approaches to extract related habitats and their locations, and related reproductive condition and temporal expressions (Gabud and Batista-Navarro 2018). We built a new unsupervised hybrid approach for relation extraction (RE), which is a combination of classical rule-based pattern-matching methods and transformer-based language models that framed our RE task as a natural language inference (NLI) task. Using our hybrid approach for RE, we were able to extract related biodiversity entities from text even without a large training dataset. In this work, we implement an information extraction (IE) pipeline comprised of a named entity recognition (NER) tool and our hybrid relation extraction (RE) tool. The NER tool is a transformer-based language model that was pretrained on scientific text and then fine-tuned using COPIOUS (Conserving Philippine Biodiversity by Understanding big data; Nguyen et al. 2019), a gold standard corpus containing named entities relevant to species occurrence. We applied the NER tool to automatically annotate geographical location, temporal expression and habitat information contained within sentences. A dictionary-based approach is then used to identify mentions of reproductive conditions in text (e.g., phrases such as "fruited heavily" and "mass flowering"). We then use our hybrid RE tool to extract reproductive condition - temporal expression and habitat - geographical location entity pairs. We test our IE pipeline on the forestry compendium available in the CABI Digital Library (Centre for Agricultural and Biosciences International), and show that our work enables the enrichment of descriptive information on reproductive and habitat conditions of species. This work is a step towards enhancing a biodiversity database with the inclusion of habitat and reproductive condition information extracted from text.
了解植物物种自然再生的生物学基础,以便制定有效的再造林计划是一项复杂的任务。这可以通过提供对包含物种分布、栖息地和繁殖的长期和大规模地理信息的数据库的访问来帮助实现。尽管存在广泛使用的生物多样性数据库,其中包含物种及其发生的结构化信息,如全球生物多样性信息设施(GBIF)和澳大利亚生活地图集(ALA),但关于生物多样性的大部分知识仍然嵌入文本文件中。如果有工具和服务可以自动从文本中提取有意义的信息,并将其存储为结构化格式,例如开放的生物多样性数据库,准备用于分析,那么非结构化信息可以更容易获取,对大规模研究也更有用(Thessen et al. 2022)。我们的目标是丰富生物多样性发生数据库与物种的生殖条件和生境信息,源自文本。在之前的工作中,我们开发了无监督的方法来提取相关的栖息地及其位置,以及相关的生殖状况和时间表达(Gabud and Batista-Navarro 2018)。我们构建了一种新的无监督混合方法用于关系提取(RE),该方法结合了经典的基于规则的模式匹配方法和基于转换器的语言模型,将我们的关系提取任务框架为自然语言推理(NLI)任务。使用我们的混合方法,即使没有大型训练数据集,我们也能够从文本中提取相关的生物多样性实体。在这项工作中,我们实现了一个由命名实体识别(NER)工具和我们的混合关系提取(RE)工具组成的信息提取(IE)管道。NER工具是一种基于转换器的语言模型,该模型在科学文本上进行预训练,然后使用COPIOUS(通过理解大数据保护菲律宾生物多样性;Nguyen et al. 2019),这是一个包含与物种发生相关的命名实体的金标准语料库。应用NER工具对句子中包含的地理位置、时态表达和栖息地信息进行自动标注。然后使用基于词典的方法来识别文本中提到的生殖条件(例如,短语,如“大量开花”和“大量开花”)。然后,我们使用混合RE工具提取生殖条件-时间表达和栖息地-地理位置实体对。我们在CABI数字图书馆(国际农业和生物科学中心)提供的林业纲要上测试了我们的IE管道,并表明我们的工作能够丰富物种繁殖和栖息地条件的描述性信息。这项工作是朝着加强生物多样性数据库迈出的一步,其中包括从文本中提取的生境和生殖条件信息。
{"title":"Extracting Reproductive Condition and Habitat Information from Text Using a Transformer-based Information Extraction Pipeline","authors":"Roselyn Gabud, Nelson Pampolina, Vladimir Mariano, Riza Batista-Navarro","doi":"10.3897/biss.7.112505","DOIUrl":"https://doi.org/10.3897/biss.7.112505","url":null,"abstract":"Understanding the biology underpinning the natural regeneration of plant species in order to make plans for effective reforestation is a complex task. This can be aided by providing access to databases that contain long-term and wide-scale geographical information on species distribution, habitat, and reproduction. Although there exists widely-used biodiversity databases that contain structured information on species and their occurrences, such as the Global Biodiversity Information Facility (GBIF) and the Atlas of Living Australia (ALA), the bulk of knowledge about biodiversity still remains embedded in textual documents. Unstructured information can be made more accessible and useful for large-scale studies if there are tools and services that automatically extract meaningful information from text and store it in structured formats, e.g., open biodiversity databases, ready to be consumed for analysis (Thessen et al. 2022). We aim to enrich biodiversity occurrence databases with information on species reproductive condition and habitat, derived from text. In previous work, we developed unsupervised approaches to extract related habitats and their locations, and related reproductive condition and temporal expressions (Gabud and Batista-Navarro 2018). We built a new unsupervised hybrid approach for relation extraction (RE), which is a combination of classical rule-based pattern-matching methods and transformer-based language models that framed our RE task as a natural language inference (NLI) task. Using our hybrid approach for RE, we were able to extract related biodiversity entities from text even without a large training dataset. In this work, we implement an information extraction (IE) pipeline comprised of a named entity recognition (NER) tool and our hybrid relation extraction (RE) tool. The NER tool is a transformer-based language model that was pretrained on scientific text and then fine-tuned using COPIOUS (Conserving Philippine Biodiversity by Understanding big data; Nguyen et al. 2019), a gold standard corpus containing named entities relevant to species occurrence. We applied the NER tool to automatically annotate geographical location, temporal expression and habitat information contained within sentences. A dictionary-based approach is then used to identify mentions of reproductive conditions in text (e.g., phrases such as \"fruited heavily\" and \"mass flowering\"). We then use our hybrid RE tool to extract reproductive condition - temporal expression and habitat - geographical location entity pairs. We test our IE pipeline on the forestry compendium available in the CABI Digital Library (Centre for Agricultural and Biosciences International), and show that our work enables the enrichment of descriptive information on reproductive and habitat conditions of species. This work is a step towards enhancing a biodiversity database with the inclusion of habitat and reproductive condition information extracted from text.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135980580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bahadir Altintas, Yasin Bakış, Xiojun Wang, Henry Bart
Artificial Intelligence (AI) becomes more prevalent in data science as well as in areas of computational science. Commonly used classification methods in AI can also be used for unorganized databases, if a proper model is trained. Most of the classification work is done on image data for purposes such as object detection and face recognition. If an object is well detected from an image, the classification may be done to organize image data. In this work, we try to identify images from an Integrated Digitized Biocollections (iDigBio) dataset and to classify these images to generate metadata to use as an AI-ready dataset in the future. The main problem of the museum image datasets is the lack of metadata information on images, wrong categorization, or poor image quality. By using AI, it maybe possible to overcome these problems. Automatic tools can help find, eliminate or fix these problems. For our example, we trained a model for 10 classes (e.g., complete fish, photograph, notes/labels, X-ray, CT (computerized tomotography) scan, partial fish, fossil, skeleton) by using a manually tagged iDigBio image dataset. After training a model for each for class, we reclassified the dataset by using these trained models. Some of the results are given in Table 1. As can be seen in the table, even manually classified images can be identified as different classes, and some classes are very similar to each other visually such as CT scans and X-rays or fossils and skeletons. Those kind of similarities are very confusing for the human eye as well as AI results.
{"title":"Application of AI-Helped Image Classification of Fish Images: An iDigBio dataset example","authors":"Bahadir Altintas, Yasin Bakış, Xiojun Wang, Henry Bart","doi":"10.3897/biss.7.112438","DOIUrl":"https://doi.org/10.3897/biss.7.112438","url":null,"abstract":"Artificial Intelligence (AI) becomes more prevalent in data science as well as in areas of computational science. Commonly used classification methods in AI can also be used for unorganized databases, if a proper model is trained. Most of the classification work is done on image data for purposes such as object detection and face recognition. If an object is well detected from an image, the classification may be done to organize image data. In this work, we try to identify images from an Integrated Digitized Biocollections (iDigBio) dataset and to classify these images to generate metadata to use as an AI-ready dataset in the future. The main problem of the museum image datasets is the lack of metadata information on images, wrong categorization, or poor image quality. By using AI, it maybe possible to overcome these problems. Automatic tools can help find, eliminate or fix these problems. For our example, we trained a model for 10 classes (e.g., complete fish, photograph, notes/labels, X-ray, CT (computerized tomotography) scan, partial fish, fossil, skeleton) by using a manually tagged iDigBio image dataset. After training a model for each for class, we reclassified the dataset by using these trained models. Some of the results are given in Table 1. As can be seen in the table, even manually classified images can be identified as different classes, and some classes are very similar to each other visually such as CT scans and X-rays or fossils and skeletons. Those kind of similarities are very confusing for the human eye as well as AI results.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135980908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rapid accumulation of biodiversity data and development of deep learning methods bring the opportunities for detecting and identifying wild animals automatically, based on artificial intelligence. In this paper, we introduce an AI-based wild animal detection system. It is composed of acoustic and image sensors, network infrastructures, species recognition models, and data storage and visualization platform, which go through the technical chain learned from Internet of Things (IOT) and applied to biodiversity detection. The workflow of the system is as follows: Deploying sensors for different detection targets . The acoustic sensor is composed of two microphones for picking up sounds from the environment and an edge computing box for judging and sending back the sound files. The acoustic sensor is suitable for monitoring birds, mammals, chirping insects and frogs. The image sensor is composed of a high performance camera that can be controlled to record surroundings automatically and a video analysis edge box running a model for detecting and recording animals. The image sensor is suitable for monitoring waterbirds in locations without visual obstructions. Adopting different networks according to signal availability . Network infrastructures are critical for the detection system and the task of transferring data collected by sensors. We use the existing network when 4/5G signals are available, and build special networks using Mesh Networking technology for the areas without signals. Multiple network strategies lower the cost for monitoring jobs. Recognizing species from sounds, images or videos . AI plays a key role in our system. We have trained acoustic models for more than 800 Chinese birds and some common chirping insects and frogs, which can be identified from sound files recorded by acoustic sensors. For video and image data, we also have trained models for recognizing 1300 Chinese birds and 400 mammals, which help to discover and count animals captured by image sensors. Moreover, we propose a special method for detecting species through features of voices, images and niche features of animals. It is a flexible framework to adapt to different combinations of acoustic and image sensors. All models were trained with labeled voices, images and distribution data from Chinese species database, ESPECIES. Saving and displaying machine observations . The original sound, image and video files with identified results were stored in the data platform deployed on the cloud for extensible computing and storage. We have developed visualization modules in the platform for displaying sensors on maps using WebGIS to show curves of the number of records and species for each day, real time alerts from sensors capturing animals, and other parameters. Deploying sensors for different detection targets . The acoustic sensor is composed of two microphones for picking up sounds from the environment and an edge computing box for judging and sending back the sound fil
{"title":"An AI-based Wild Animal Detection System and Its Application","authors":"Congtian Lin, Jiangning Wang, Liqiang Ji","doi":"10.3897/biss.7.112456","DOIUrl":"https://doi.org/10.3897/biss.7.112456","url":null,"abstract":"Rapid accumulation of biodiversity data and development of deep learning methods bring the opportunities for detecting and identifying wild animals automatically, based on artificial intelligence. In this paper, we introduce an AI-based wild animal detection system. It is composed of acoustic and image sensors, network infrastructures, species recognition models, and data storage and visualization platform, which go through the technical chain learned from Internet of Things (IOT) and applied to biodiversity detection. The workflow of the system is as follows: Deploying sensors for different detection targets . The acoustic sensor is composed of two microphones for picking up sounds from the environment and an edge computing box for judging and sending back the sound files. The acoustic sensor is suitable for monitoring birds, mammals, chirping insects and frogs. The image sensor is composed of a high performance camera that can be controlled to record surroundings automatically and a video analysis edge box running a model for detecting and recording animals. The image sensor is suitable for monitoring waterbirds in locations without visual obstructions. Adopting different networks according to signal availability . Network infrastructures are critical for the detection system and the task of transferring data collected by sensors. We use the existing network when 4/5G signals are available, and build special networks using Mesh Networking technology for the areas without signals. Multiple network strategies lower the cost for monitoring jobs. Recognizing species from sounds, images or videos . AI plays a key role in our system. We have trained acoustic models for more than 800 Chinese birds and some common chirping insects and frogs, which can be identified from sound files recorded by acoustic sensors. For video and image data, we also have trained models for recognizing 1300 Chinese birds and 400 mammals, which help to discover and count animals captured by image sensors. Moreover, we propose a special method for detecting species through features of voices, images and niche features of animals. It is a flexible framework to adapt to different combinations of acoustic and image sensors. All models were trained with labeled voices, images and distribution data from Chinese species database, ESPECIES. Saving and displaying machine observations . The original sound, image and video files with identified results were stored in the data platform deployed on the cloud for extensible computing and storage. We have developed visualization modules in the platform for displaying sensors on maps using WebGIS to show curves of the number of records and species for each day, real time alerts from sensors capturing animals, and other parameters. Deploying sensors for different detection targets . The acoustic sensor is composed of two microphones for picking up sounds from the environment and an edge computing box for judging and sending back the sound fil","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135981784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Insects account for half of the total described living organisms on Earth, with a vast number of species awaiting description. Insects play a major role in ecosystems but are yet threatened by habitat destruction, intensive farming, and climate change. Museum collections around the world house millions of insect specimens and large-scale digitization initiatives, such as the digitization street digitize! at the Museum für Naturkunde, have been undertaken recently to unlock this data. Accurate and efficient extraction of insect specimen label information is vital for building comprehensive databases and facilitating scientific investigations, sustainability of the collected data, and efficient knowledge transfer. Despite the advancements in high-throughput imaging techniques for specimens and their labels, the process of transcribing label information remains mostly manual and lags behind the pace of digitization efforts. In order to address this issue, we propose a three step semi-automated pipeline that focuses on extracting and processing information from individual insect labels. Our solution is primarily designed for printed insect labels, as the OCR (optical character recognition) technology performs well for printed text while handwritten texts still yield mixed results. The pipeline incorporates computer vision (CV) techniques, OCR, and a clustering algorithm. The initial stage of our pipeline involves image analysis using a convolutional neural network (CNN) model. The model was trained using 2100 images from three distinct insect label datasets, namely AntWeb (ant specimen labels from various collections), Bees & Bytes (bee specimen labels from the Museum für Naturkunde), and LEP_PHIL (Lepidoptera specimen labels from the Museum für Naturkunde). The first model enables the identification and isolation of single labels within an image, effectively segmenting the label region from the rest of the image, and crops them into multiple new, single-label image files. It also assigns the labels to different classes, i.e., printed text or handwritten, with handwritten labels sorted out from the printed ones. In the second step, labels classified as “printed” are then parsed by an OCR engine to extract the text information from the labels. Tesseract and Google Vision OCRs were both tested to assess their performance. While Google Vision OCR is a cloud-based service with limited configurability, Tesseract provides the flexibility to fine-tune settings and enhance its performance for our specific use cases. In the third step, the OCR outputs are aggregated by similarity using a clustering algorithm. This step allows for the identification and formation of clusters that consist of labels sharing identical or highly similar content. Ultimately, these clusters are compared against a curated database of labels and are assigned to a known label or highlighted as new and manually added to the database. In order to assess the efficiency of our pipeline
昆虫占地球上已被描述的生物总数的一半,还有大量的物种有待描述。昆虫在生态系统中发挥着重要作用,但仍受到栖息地破坏、集约化农业和气候变化的威胁。世界各地的博物馆收藏了数以百万计的昆虫标本,并开展了大规模的数字化举措,如数字化街道!最近在自然之光博物馆(Museum r Naturkunde)进行了一项研究,以解锁这些数据。准确、高效地提取昆虫标本标签信息对于建立全面的数据库、促进科学调查、收集数据的可持续性和有效的知识转移至关重要。尽管标本及其标签的高通量成像技术取得了进步,但转录标签信息的过程仍然主要是手动的,落后于数字化工作的步伐。为了解决这一问题,我们提出了一个三步半自动化流水线,重点从单个昆虫标签中提取和处理信息。我们的解决方案主要是为印刷昆虫标签设计的,因为OCR(光学字符识别)技术对印刷文本表现良好,而手写文本仍然产生混合结果。该管道结合了计算机视觉(CV)技术、OCR和聚类算法。我们的管道的初始阶段包括使用卷积神经网络(CNN)模型进行图像分析。该模型使用来自三个不同昆虫标签数据集的2100幅图像进行训练,即AntWeb(来自各种收集的蚂蚁标本标签),Bees和amp;Bytes(来自Museum fr Naturkunde的蜜蜂标本标签)和LEP_PHIL(来自Museum fr Naturkunde的鳞翅目标本标签)。第一个模型能够识别和隔离图像中的单个标签,有效地将标签区域与图像的其余部分分割开来,并将它们裁剪成多个新的单标签图像文件。它还将标签分配给不同的类别,即印刷文本或手写文本,手写标签从印刷标签中分类。在第二步中,分类为“打印”的标签随后由OCR引擎解析,以从标签中提取文本信息。我们对Tesseract和Google Vision ocr进行了测试,以评估它们的性能。虽然谷歌视觉OCR是一种基于云的服务,可配置性有限,但Tesseract提供了微调设置的灵活性,并为我们的特定用例提高了性能。在第三步中,使用聚类算法根据相似性对OCR输出进行聚合。这一步允许识别和形成由共享相同或高度相似内容的标签组成的集群。最后,将这些集群与精心策划的标签数据库进行比较,并将它们分配给已知的标签,或者突出显示为新的标签,并手动添加到数据库中。为了评估我们的管道的效率,我们使用一组与模型训练的图像相似的图像进行基准测试实验,以及从各种博物馆藏品中获得的额外图像集。我们的管道提供了几个优势,简化了数据输入过程,减少了人工提取的时间和精力,同时也最大限度地减少了标签转录中潜在的人为错误和不一致。该管道有望加速从昆虫标本中提取元数据,促进科学研究,并使大规模分析能够更深入地了解这些标本。
{"title":"High Throughput Information Extraction of Printed Specimen Labels from Large-Scale Digitization of Entomological Collections using a Semi-Automated Pipeline","authors":"Margot Belot, Leonardo Preuss, Joël Tuberosa, Magdalena Claessen, Olha Svezhentseva, Franziska Schuster, Christian Bölling, Théo Léger","doi":"10.3897/biss.7.112466","DOIUrl":"https://doi.org/10.3897/biss.7.112466","url":null,"abstract":"Insects account for half of the total described living organisms on Earth, with a vast number of species awaiting description. Insects play a major role in ecosystems but are yet threatened by habitat destruction, intensive farming, and climate change. Museum collections around the world house millions of insect specimens and large-scale digitization initiatives, such as the digitization street digitize! at the Museum für Naturkunde, have been undertaken recently to unlock this data. Accurate and efficient extraction of insect specimen label information is vital for building comprehensive databases and facilitating scientific investigations, sustainability of the collected data, and efficient knowledge transfer. Despite the advancements in high-throughput imaging techniques for specimens and their labels, the process of transcribing label information remains mostly manual and lags behind the pace of digitization efforts. In order to address this issue, we propose a three step semi-automated pipeline that focuses on extracting and processing information from individual insect labels. Our solution is primarily designed for printed insect labels, as the OCR (optical character recognition) technology performs well for printed text while handwritten texts still yield mixed results. The pipeline incorporates computer vision (CV) techniques, OCR, and a clustering algorithm. The initial stage of our pipeline involves image analysis using a convolutional neural network (CNN) model. The model was trained using 2100 images from three distinct insect label datasets, namely AntWeb (ant specimen labels from various collections), Bees & Bytes (bee specimen labels from the Museum für Naturkunde), and LEP_PHIL (Lepidoptera specimen labels from the Museum für Naturkunde). The first model enables the identification and isolation of single labels within an image, effectively segmenting the label region from the rest of the image, and crops them into multiple new, single-label image files. It also assigns the labels to different classes, i.e., printed text or handwritten, with handwritten labels sorted out from the printed ones. In the second step, labels classified as “printed” are then parsed by an OCR engine to extract the text information from the labels. Tesseract and Google Vision OCRs were both tested to assess their performance. While Google Vision OCR is a cloud-based service with limited configurability, Tesseract provides the flexibility to fine-tune settings and enhance its performance for our specific use cases. In the third step, the OCR outputs are aggregated by similarity using a clustering algorithm. This step allows for the identification and formation of clusters that consist of labels sharing identical or highly similar content. Ultimately, these clusters are compared against a curated database of labels and are assigned to a known label or highlighted as new and manually added to the database. In order to assess the efficiency of our pipeline","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135982444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alan Stenhouse, Nicole Fisher, Brendan Lepschi, Alexander Schmidt-Lebuhn, Juanita Rodriguez, Federica Turco, Emma Toms, Andrew Reeson, Cécile Paris, Pete Thrall
Biological collections play a crucial role in our understanding of biodiversity and inform research in areas such as biosecurity, conservation, human health and climate change. In recent years, the digitisation of biological specimen collections has emerged as a vital mechanism for preserving and facilitating access to these invaluable scientific datasets. However, the growing volume of specimens and associated data presents significant challenges for curation and data management. By leveraging human-Artificial Intelligence (AI) collaborations, we aim to transform the way biological collections are curated and managed, unlocking their full potential in addressing global challenges. We present our initial contribution to this field through the development of a software prototype to improve metadata extraction from digital specimen images in biological collections. The prototype provides an easy-to-use platform for collaborating with web-based AI services, such as Google Vision and OpenAI Generative Pre-trained Transformer (GPT) Large Language Models (LLM). We demonstrate its effectiveness when applied to herbarium and insect specimen images. Machine-human collaboration may occur at various points within the workflows and can significantly affect outcomes. Initial trials suggest that the visual display of AI model uncertainty could be useful during expert data curation. While much work remains to be done, our results indicate that collaboration between humans and AI models can significantly improve the digitisation rate of biological specimens and thereby enable faster global access to this vital data. Finally, we introduce our broader vision for improving biological collection curation and management using human-AI collaborative methods. We explore the rationale behind this approach and the potential benefits of adding AI-based assistants to collection teams. We also examine future possibilities and the concept of creating 'digital colleagues' for seamless collaboration between human and digital curators. This ‘collaborative intelligence’ will enable us to make better use of both human and machine capabilities to achieve the goal of unlocking and improving our use of these vital biodiversity data to tackle real-world problems.
{"title":"Improving Biological Collections Data through Human-AI Collaboration","authors":"Alan Stenhouse, Nicole Fisher, Brendan Lepschi, Alexander Schmidt-Lebuhn, Juanita Rodriguez, Federica Turco, Emma Toms, Andrew Reeson, Cécile Paris, Pete Thrall","doi":"10.3897/biss.7.112488","DOIUrl":"https://doi.org/10.3897/biss.7.112488","url":null,"abstract":"Biological collections play a crucial role in our understanding of biodiversity and inform research in areas such as biosecurity, conservation, human health and climate change. In recent years, the digitisation of biological specimen collections has emerged as a vital mechanism for preserving and facilitating access to these invaluable scientific datasets. However, the growing volume of specimens and associated data presents significant challenges for curation and data management. By leveraging human-Artificial Intelligence (AI) collaborations, we aim to transform the way biological collections are curated and managed, unlocking their full potential in addressing global challenges. We present our initial contribution to this field through the development of a software prototype to improve metadata extraction from digital specimen images in biological collections. The prototype provides an easy-to-use platform for collaborating with web-based AI services, such as Google Vision and OpenAI Generative Pre-trained Transformer (GPT) Large Language Models (LLM). We demonstrate its effectiveness when applied to herbarium and insect specimen images. Machine-human collaboration may occur at various points within the workflows and can significantly affect outcomes. Initial trials suggest that the visual display of AI model uncertainty could be useful during expert data curation. While much work remains to be done, our results indicate that collaboration between humans and AI models can significantly improve the digitisation rate of biological specimens and thereby enable faster global access to this vital data. Finally, we introduce our broader vision for improving biological collection curation and management using human-AI collaborative methods. We explore the rationale behind this approach and the potential benefits of adding AI-based assistants to collection teams. We also examine future possibilities and the concept of creating 'digital colleagues' for seamless collaboration between human and digital curators. This ‘collaborative intelligence’ will enable us to make better use of both human and machine capabilities to achieve the goal of unlocking and improving our use of these vital biodiversity data to tackle real-world problems.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135980919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Margret Steinthorsdottir, Veronika Johansson, Manash Shah
The Swedish Biodiversity Data Infrastructure (SBDI) is a biodiversity informatics infrastructure and is the key national resource for data-driven biodiversity and ecosystems research. SBDI rests on three pillars: mobilisation and access to biodiversity data; development and operation of tools for analysing these data; and user support. SBDI is funded by the Swedish Research Council (VR) and eleven of Sweden’s major universities and research government authorities (Fig. 1). mobilisation and access to biodiversity data; development and operation of tools for analysing these data; and user support. SBDI is funded by the Swedish Research Council (VR) and eleven of Sweden’s major universities and research government authorities (Fig. 1). SBDI was formed in early 2021 and represents the final step in an amalgamation of national infrastructures for biodiversity and ecosystems research. SBDI includes the Swedish node of the Global Biodiversity Information Facility (GBIF), the key international infrastructure for sharing biodiversity data. SBDI's predecessor Biodiversity Atlas Sweden (BAS) was an early adopter of the Atlas of Living Australia (ALA) platform. SBDI pioneered the container-based deployment of the platform using Docker and Docker Swarm. This container-based approach helps simplify deployment of the platform, which is characterised by a microservice architecture with loosely coupled services. This enables scalability, modularity, integration of services, and new technology insertions. SBDI has customised the BioCollect module to remove region-specific constraints so that it can be more readily improved for environmental monitoring in Sweden. To further support this, there are plans to develop services for the distribution of terrestrial map layers, which will provide important habitat information for artificial intelligence and machine learning research projects. The Amplicon Sequence Variants (ASVs) portal, an interface to sequence-based observations, is an example of integration and new technology insertion. The portal developed in SBDI and seamlessly integrated with the ALA platform provides basic functionalities for searching ASVs and occurrence records using the Basic Local Alignment Search Tool (BLAST) or filters on sequencing details and taxonomy and for submitting metabarcoding dataset Fig. 2. Future developments for SBDI include a continued focus on eDNA and monitoring data as well as the implementation of procedures for handling sensitive data.
瑞典生物多样性数据基础设施(SBDI)是一个生物多样性信息学基础设施,是数据驱动的生物多样性和生态系统研究的关键国家资源。SBDI有三个支柱:生物多样性数据的动员和获取;开发和操作分析这些数据的工具;以及用户支持。SBDI由瑞典研究委员会(VR)和11所瑞典主要大学和政府研究机构资助(图1)。开发和操作分析这些数据的工具;以及用户支持。SBDI由瑞典研究委员会(VR)和11所瑞典主要大学和研究政府机构资助(图1)。SBDI成立于2021年初,是生物多样性和生态系统研究国家基础设施合并的最后一步。SBDI包括全球生物多样性信息设施(GBIF)的瑞典节点,这是共享生物多样性数据的关键国际基础设施。SBDI的前身瑞典生物多样性地图集(BAS)是澳大利亚生活地图集(ALA)平台的早期采用者。SBDI率先使用Docker和Docker Swarm进行基于容器的平台部署。这种基于容器的方法有助于简化平台的部署,平台的特点是微服务架构和松散耦合的服务。这支持可伸缩性、模块化、服务集成和新技术插入。SBDI对BioCollect模块进行了定制,以消除特定区域的限制,从而可以更容易地改进该模块,用于瑞典的环境监测。为了进一步支持这一点,计划开发用于陆地地图层分发的服务,这将为人工智能和机器学习研究项目提供重要的栖息地信息。扩增子序列变体(asv)门户是基于序列的观测接口,是集成和新技术插入的一个例子。该门户由SBDI开发,并与ALA平台无缝集成,提供了使用基本本地对齐搜索工具(basic Local Alignment Search Tool, BLAST)或测序细节和分类过滤器搜索asv和发生记录的基本功能,以及提交元条形码数据集(图2)。SBDI的未来发展包括继续关注eDNA和监测数据,以及执行处理敏感数据的程序。
{"title":"Swedish Biodiversity Data Infrastructure (SBDI): Insights from the Swedish ALA installation","authors":"Margret Steinthorsdottir, Veronika Johansson, Manash Shah","doi":"10.3897/biss.7.112429","DOIUrl":"https://doi.org/10.3897/biss.7.112429","url":null,"abstract":"The Swedish Biodiversity Data Infrastructure (SBDI) is a biodiversity informatics infrastructure and is the key national resource for data-driven biodiversity and ecosystems research. SBDI rests on three pillars: mobilisation and access to biodiversity data; development and operation of tools for analysing these data; and user support. SBDI is funded by the Swedish Research Council (VR) and eleven of Sweden’s major universities and research government authorities (Fig. 1). mobilisation and access to biodiversity data; development and operation of tools for analysing these data; and user support. SBDI is funded by the Swedish Research Council (VR) and eleven of Sweden’s major universities and research government authorities (Fig. 1). SBDI was formed in early 2021 and represents the final step in an amalgamation of national infrastructures for biodiversity and ecosystems research. SBDI includes the Swedish node of the Global Biodiversity Information Facility (GBIF), the key international infrastructure for sharing biodiversity data. SBDI's predecessor Biodiversity Atlas Sweden (BAS) was an early adopter of the Atlas of Living Australia (ALA) platform. SBDI pioneered the container-based deployment of the platform using Docker and Docker Swarm. This container-based approach helps simplify deployment of the platform, which is characterised by a microservice architecture with loosely coupled services. This enables scalability, modularity, integration of services, and new technology insertions. SBDI has customised the BioCollect module to remove region-specific constraints so that it can be more readily improved for environmental monitoring in Sweden. To further support this, there are plans to develop services for the distribution of terrestrial map layers, which will provide important habitat information for artificial intelligence and machine learning research projects. The Amplicon Sequence Variants (ASVs) portal, an interface to sequence-based observations, is an example of integration and new technology insertion. The portal developed in SBDI and seamlessly integrated with the ALA platform provides basic functionalities for searching ASVs and occurrence records using the Basic Local Alignment Search Tool (BLAST) or filters on sequencing details and taxonomy and for submitting metabarcoding dataset Fig. 2. Future developments for SBDI include a continued focus on eDNA and monitoring data as well as the implementation of procedures for handling sensitive data.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135982102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mora Aronsson, Malin Strand, Holger Dettki, Hanna Illander, Johan Olsson
The SLU Swedish Species Information Centre (SSIC, SLU Artdatabanken) accumulates, analyses and disseminates information concerning species and habitats occurring in Sweden. The SSIC provides an open access biodiversity reporting and analysis infrastructure including the Swedish Species Observation System, the Swedish taxonomic backbone Dyntaxa, and tools for species information including traits, terminology, quality assurance and species identification.*1 The content is available to scientists, conservationists and the public. All systems, databases, APIs and web applications, rely on recognized standards to ensure interoperability. The SSIC is a leading partner within the Swedish Biodiversity Data Infrastructure (SBDI). Here we present a data flow (Fig. 1) that exemplifies the strengthening of the cooperation and transfer of experiences between research, community, non-governmental organizations (NGOs), citizen science and governmental agencies, and also presents solutions to current data challenges (e.g., data fragmentation, taxonomic issues or platform relations). This data flow aimed to facilitate the process for evaluating and understanding the distribution and spread of species (e.g., invasive alien species). It provides Findable, Accessible, Interoperable and Reusable (FAIR) data and links related information between different parties such as universities, NGOs, county administrative boards (CABs) and environmental protection agencies (EPAs). The digital structure is built on the national Swedish taxonomic backbone Dyntaxa, which prevents data fragmentation due to taxonomic issues and acts as a common standard for all users. The chain of information contains systems, tools and a linked data flow for reporting observations, verification procedures, and it can work as an early warning system for surveillance regarding certain species. After an observation is reported, an alert can be activated, field checks can be carried out, and if necessary, eradication measures can be activated. The verification tool that traditionally has been focused on the quality of species identification has been improved, providing verification of geographic precision. This is equally important for eradication actions as is species accuracy. A digital catalogue of eradication methods is in use by the CABs but there are also recommendations on methods for ‘public’ use, and collaboration between Invasive Alien Species (IAS) coordinators in regional CABs is currently being developed. The CABs have a separate tool for documentation of eradication measures and, if/when measures are carried out (by CABs), this information can be fed back from the CAB-tool into the database in SSIC where it is possible to search for, and visualize, this information. Taxonomic integrity over time should be intact and related to the taxon identifier (ID) provided by Dyntaxa. However, metadata, such as geographic position, date, verification status, mitigation results, etc., will be fully us
{"title":"Practice, Pathways and Lessons Learned from Building a Digital Data Flow with Tools: Focusing on alien invasive species, from occurrence via measures to documentation","authors":"Mora Aronsson, Malin Strand, Holger Dettki, Hanna Illander, Johan Olsson","doi":"10.3897/biss.7.112337","DOIUrl":"https://doi.org/10.3897/biss.7.112337","url":null,"abstract":"The SLU Swedish Species Information Centre (SSIC, SLU Artdatabanken) accumulates, analyses and disseminates information concerning species and habitats occurring in Sweden. The SSIC provides an open access biodiversity reporting and analysis infrastructure including the Swedish Species Observation System, the Swedish taxonomic backbone Dyntaxa, and tools for species information including traits, terminology, quality assurance and species identification.*1 The content is available to scientists, conservationists and the public. All systems, databases, APIs and web applications, rely on recognized standards to ensure interoperability. The SSIC is a leading partner within the Swedish Biodiversity Data Infrastructure (SBDI). Here we present a data flow (Fig. 1) that exemplifies the strengthening of the cooperation and transfer of experiences between research, community, non-governmental organizations (NGOs), citizen science and governmental agencies, and also presents solutions to current data challenges (e.g., data fragmentation, taxonomic issues or platform relations). This data flow aimed to facilitate the process for evaluating and understanding the distribution and spread of species (e.g., invasive alien species). It provides Findable, Accessible, Interoperable and Reusable (FAIR) data and links related information between different parties such as universities, NGOs, county administrative boards (CABs) and environmental protection agencies (EPAs). The digital structure is built on the national Swedish taxonomic backbone Dyntaxa, which prevents data fragmentation due to taxonomic issues and acts as a common standard for all users. The chain of information contains systems, tools and a linked data flow for reporting observations, verification procedures, and it can work as an early warning system for surveillance regarding certain species. After an observation is reported, an alert can be activated, field checks can be carried out, and if necessary, eradication measures can be activated. The verification tool that traditionally has been focused on the quality of species identification has been improved, providing verification of geographic precision. This is equally important for eradication actions as is species accuracy. A digital catalogue of eradication methods is in use by the CABs but there are also recommendations on methods for ‘public’ use, and collaboration between Invasive Alien Species (IAS) coordinators in regional CABs is currently being developed. The CABs have a separate tool for documentation of eradication measures and, if/when measures are carried out (by CABs), this information can be fed back from the CAB-tool into the database in SSIC where it is possible to search for, and visualize, this information. Taxonomic integrity over time should be intact and related to the taxon identifier (ID) provided by Dyntaxa. However, metadata, such as geographic position, date, verification status, mitigation results, etc., will be fully us","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135981960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammed Kamal-Deen Fuseini, Agnes Abah, Andra Waagmeester
Biodiversity is the variety of life on Earth, and it is essential for our planet's health and well-being. Language is also a powerful medium for documenting and preserving cultural heritage, including knowledge about biodiversity. However, many indigenous and underrepresented languages are at risk of disappearing, taking with them valuable information about local ecosystems. Also, many species are at risk of extinction, and much of our knowledge about biodiversity is in underrepresented languages. (Cardoso et al. 2019). This can make it challenging to document and protect biodiversity, as well as to share this knowledge with others. Crowdsourcing is a way to collect information from a large number of people, and it can be a valuable tool for documenting biodiversity in underrepresented languages. By crowdsourcing, leveraging the iNaturalist platform, and volunteer contributors in the open movement including the Dagbani*1 and Igbo*2 Wikimedian communities, we can reach people who have knowledge about local biodiversity, but who may not have been able to share this knowledge before. For instance, the Dagbani and Igbo Wikimedia contributors did not have enough content on biodiversity data until they received education about the need. This can help us to fill in the gaps in our knowledge about biodiversity, and to protect species that are at risk of extinction. In this presentation, we will discuss the use of crowdsourcing to document biodiversity in underrepresented languages, the challenges and opportunities of using crowdsourcing for this purpose, and some examples of successful projects. We will also discuss the importance of sharing knowledge about biodiversity with others and share some ideas on how to do this. We believe that crowdsourcing has the potential to be a powerful tool for documenting biodiversity in underrepresented languages. By working together, we can help protect our planet's biodiversity and ensure that this knowledge is available to future generations.
生物多样性是地球上生命的多样性,对我们星球的健康和福祉至关重要。语言也是记录和保存文化遗产,包括生物多样性知识的有力媒介。然而,许多土著语言和代表性不足的语言正面临消失的危险,并带走了有关当地生态系统的宝贵信息。此外,许多物种正面临灭绝的危险,我们对生物多样性的大部分知识都是在代表性不足的语言中掌握的。(Cardoso et al. 2019)。这使得记录和保护生物多样性以及与他人分享这些知识变得具有挑战性。众包是一种从大量人群中收集信息的方式,它可以成为记录代表性不足语言的生物多样性的宝贵工具。通过众包,利用iNaturalist平台,以及开放运动中的志愿者贡献者,包括Dagbani*1和Igbo*2维基媒体社区,我们可以接触到那些了解当地生物多样性,但以前可能无法分享这些知识的人。例如,达格巴尼族和伊博族维基媒体的贡献者在接受有关需求的教育之前,没有足够的生物多样性数据内容。这可以帮助我们填补生物多样性知识的空白,并保护濒临灭绝的物种。在这次演讲中,我们将讨论使用众包来记录代表性不足的语言的生物多样性,为此目的使用众包的挑战和机遇,以及一些成功项目的例子。我们还将讨论与他人分享生物多样性知识的重要性,并就如何做到这一点分享一些想法。我们相信,众包有潜力成为记录未被充分代表的语言的生物多样性的有力工具。通过共同努力,我们可以帮助保护地球的生物多样性,并确保后代能够获得这些知识。
{"title":"Documenting Biodiversity in Underrepresented Languages using Crowdsourcing","authors":"Mohammed Kamal-Deen Fuseini, Agnes Abah, Andra Waagmeester","doi":"10.3897/biss.7.112431","DOIUrl":"https://doi.org/10.3897/biss.7.112431","url":null,"abstract":"Biodiversity is the variety of life on Earth, and it is essential for our planet's health and well-being. Language is also a powerful medium for documenting and preserving cultural heritage, including knowledge about biodiversity. However, many indigenous and underrepresented languages are at risk of disappearing, taking with them valuable information about local ecosystems. Also, many species are at risk of extinction, and much of our knowledge about biodiversity is in underrepresented languages. (Cardoso et al. 2019). This can make it challenging to document and protect biodiversity, as well as to share this knowledge with others. Crowdsourcing is a way to collect information from a large number of people, and it can be a valuable tool for documenting biodiversity in underrepresented languages. By crowdsourcing, leveraging the iNaturalist platform, and volunteer contributors in the open movement including the Dagbani*1 and Igbo*2 Wikimedian communities, we can reach people who have knowledge about local biodiversity, but who may not have been able to share this knowledge before. For instance, the Dagbani and Igbo Wikimedia contributors did not have enough content on biodiversity data until they received education about the need. This can help us to fill in the gaps in our knowledge about biodiversity, and to protect species that are at risk of extinction. In this presentation, we will discuss the use of crowdsourcing to document biodiversity in underrepresented languages, the challenges and opportunities of using crowdsourcing for this purpose, and some examples of successful projects. We will also discuss the importance of sharing knowledge about biodiversity with others and share some ideas on how to do this. We believe that crowdsourcing has the potential to be a powerful tool for documenting biodiversity in underrepresented languages. By working together, we can help protect our planet's biodiversity and ensure that this knowledge is available to future generations.","PeriodicalId":9011,"journal":{"name":"Biodiversity Information Science and Standards","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135980578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}