Automated annotation of scientific texts for ML-based keyphrase extraction and validation.

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Database: The Journal of Biological Databases and Curation Pub Date : 2024-09-27 DOI:10.1093/database/baae093

Oluwamayowa O Amusat, Harshad Hegde, Christopher J Mungall, Anna Giannakou, Neil P Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan

{"title":"Automated annotation of scientific texts for ML-based keyphrase extraction and validation.","authors":"Oluwamayowa O Amusat, Harshad Hegde, Christopher J Mungall, Anna Giannakou, Neil P Byers, Dan Gunter, Kjiersten Fagnan, Lavanya Ramakrishnan","doi":"10.1093/database/baae093","DOIUrl":null,"url":null,"abstract":"<p><p>Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lack the essential metadata required for researchers to find, curate, and search them effectively. The lack of metadata poses a significant challenge in the utilization of these data sets. Machine learning (ML)-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific data sets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming and not always feasible; thus, there is a need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining data sets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information that is only available for select documents within a corpus to validate ML models, which can then be used to describe the remaining documents in the corpus. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches in the context of environmental genomics research for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database: The Journal of Biological Databases and Curation","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/database/baae093","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lack the essential metadata required for researchers to find, curate, and search them effectively. The lack of metadata poses a significant challenge in the utilization of these data sets. Machine learning (ML)-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific data sets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming and not always feasible; thus, there is a need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining data sets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information that is only available for select documents within a corpus to validate ML models, which can then be used to describe the remaining documents in the corpus. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches in the context of environmental genomics research for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为基于 ML 的关键词提取和验证自动标注科学文本。

先进的分子生物学技术和设施每天都会产生大量有价值的数据；然而，这些数据往往缺乏研究人员有效查找、整理和搜索所需的基本元数据。元数据的缺乏给这些数据集的利用带来了巨大挑战。基于机器学习（ML）的元数据提取技术已经成为自动为科学数据集标注有效搜索所需元数据的潜在可行方法。文本标注通常由人工完成，在验证机器提取的元数据方面起着至关重要的作用。然而，人工标注既耗时又不一定可行；因此，有必要开发自动文本标注技术，以加快科学创新的进程。这一需求在环境基因组学和微生物组科学等领域尤为迫切，这些领域在元数据整理和创建黄金标准文本挖掘数据集方面历来受到的关注较少。在本文中，我们介绍了两种新颖的自动文本标注方法，用于验证人工智能生成的未标注文本元数据，具体应用于环境基因组学。我们的技术展示了两种新方法的潜力，即利用仅适用于语料库中特定文档的现有信息来验证 ML 模型，然后用这些信息来描述语料库中的其余文档。第一种技术是利用与同一研究相关的不同类型数据源之间的关系，如出版物和提案。第二种技术利用的是特定领域的受控词汇表或本体。在本文中，我们详细介绍了在环境基因组学研究中应用这些方法对人工智能生成的元数据进行验证的情况。我们的研究结果表明，所提出的标签分配方法既能为无标签文本生成通用的文本标签，也能生成高度特定的文本标签，其中高达 44% 的标签与人工智能关键词提取算法所建议的标签相匹配。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Database: The Journal of Biological Databases and Curation MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

9.00

自引率

3.40%

发文量

100

审稿时长

>12 weeks

期刊介绍： Huge volumes of primary data are archived in numerous open-access databases, and with new generation technologies becoming more common in laboratories, large datasets will become even more prevalent. The archiving, curation, analysis and interpretation of all of these data are a challenge. Database development and biocuration are at the forefront of the endeavor to make sense of this mounting deluge of data. Database: The Journal of Biological Databases and Curation provides an open access platform for the presentation of novel ideas in database research and biocuration, and aims to help strengthen the bridge between database developers, curators, and users.