Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition

IF 3.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS Journal of Proteome Research Pub Date : 2024-05-11 DOI:10.1021/acs.jproteome.3c00367

Meiqi Wang, Avish Vijayaraghavan, Tim Beck* and Joram M. Posma*,

{"title":"Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition","authors":"Meiqi Wang, Avish Vijayaraghavan, Tim Beck* and Joram M. Posma*, ","doi":"10.1021/acs.jproteome.3c00367","DOIUrl":null,"url":null,"abstract":"Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.","PeriodicalId":48,"journal":{"name":"Journal of Proteome Research","volume":null,"pages":null},"PeriodicalIF":3.8000,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.acs.org/doi/epdf/10.1021/acs.jproteome.3c00367","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Proteome Research","FirstCategoryId":"99","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jproteome.3c00367","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

词汇很重要：用于酶命名实体识别的注释管道和四种深度学习算法。

酶在许多生物过程中都是不可或缺的，而随着生物医学文献呈指数级增长，有效的文献综述变得越来越具有挑战性。自然语言处理方法为简化这一过程提供了解决方案。本研究旨在开发一个有注释的酶语料库，用于训练和评估酶命名实体识别（NER）模型。一个结合了词典匹配和基于规则的关键词搜索的新型管道自动注释了 >4800 篇全文出版物中的酶实体。利用不同的词汇表（BioBERT/SciBERT）和架构（BiLSTM/transformer）创建了四个深度学习 NER 模型，并在 526 篇人工标注的全文出版物上进行了评估。注释管道的 F1 分数为 0.86（精确度 = 1.00，召回率 = 0.76），微调转换器的 F1 分数（BioBERT：0.89，SciBERT：0.88）和召回率（0.86）超过了注释管道，BiLSTM 模型的精确度（0.94）高于转换器（0.92）。注释管道在标准笔记本电脑上的运行时间仅为几秒钟，精确度几乎完美，但在 F1 分数和召回率方面却优于经过微调的转换器，这证明了训练数据之外的通用性。相比之下，基于 SciBERT 的模型表现出更高的精确度，而基于 BioBERT 的模型表现出更高的召回率，这凸显了词汇和架构的重要性。这些模型代表了第一种酶核酸还原算法，能更有效地进行酶文本挖掘和信息提取。自动注释和模型生成的代码可从 https://github.com/omicsNLP/enzymeNER 和 https://zenodo.org/doi/10.5281/zenodo.10581586 获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Proteome Research 生物-生化研究方法

CiteScore

9.00

自引率

4.50%

发文量

251

审稿时长

3 months

期刊介绍： Journal of Proteome Research publishes content encompassing all aspects of global protein analysis and function, including the dynamic aspects of genomics, spatio-temporal proteomics, metabonomics and metabolomics, clinical and agricultural proteomics, as well as advances in methodology including bioinformatics. The theme and emphasis is on a multidisciplinary approach to the life sciences through the synergy between the different types of "omics".