Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition

IF 3.8 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Journal of Proteome Research Pub Date : 2024-05-11 DOI:10.1021/acs.jproteome.3c00367
Meiqi Wang, Avish Vijayaraghavan, Tim Beck* and Joram M. Posma*, 
{"title":"Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition","authors":"Meiqi Wang,&nbsp;Avish Vijayaraghavan,&nbsp;Tim Beck* and Joram M. Posma*,&nbsp;","doi":"10.1021/acs.jproteome.3c00367","DOIUrl":null,"url":null,"abstract":"<p >Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in &gt;4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an <i>F</i>1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for <i>F</i>1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of <i>F</i>1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.</p>","PeriodicalId":48,"journal":{"name":"Journal of Proteome Research","volume":null,"pages":null},"PeriodicalIF":3.8000,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.acs.org/doi/epdf/10.1021/acs.jproteome.3c00367","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Proteome Research","FirstCategoryId":"99","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jproteome.3c00367","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
词汇很重要:用于酶命名实体识别的注释管道和四种深度学习算法。
酶在许多生物过程中都是不可或缺的,而随着生物医学文献呈指数级增长,有效的文献综述变得越来越具有挑战性。自然语言处理方法为简化这一过程提供了解决方案。本研究旨在开发一个有注释的酶语料库,用于训练和评估酶命名实体识别(NER)模型。一个结合了词典匹配和基于规则的关键词搜索的新型管道自动注释了 >4800 篇全文出版物中的酶实体。利用不同的词汇表(BioBERT/SciBERT)和架构(BiLSTM/transformer)创建了四个深度学习 NER 模型,并在 526 篇人工标注的全文出版物上进行了评估。注释管道的 F1 分数为 0.86(精确度 = 1.00,召回率 = 0.76),微调转换器的 F1 分数(BioBERT:0.89,SciBERT:0.88)和召回率(0.86)超过了注释管道,BiLSTM 模型的精确度(0.94)高于转换器(0.92)。注释管道在标准笔记本电脑上的运行时间仅为几秒钟,精确度几乎完美,但在 F1 分数和召回率方面却优于经过微调的转换器,这证明了训练数据之外的通用性。相比之下,基于 SciBERT 的模型表现出更高的精确度,而基于 BioBERT 的模型表现出更高的召回率,这凸显了词汇和架构的重要性。这些模型代表了第一种酶核酸还原算法,能更有效地进行酶文本挖掘和信息提取。自动注释和模型生成的代码可从 https://github.com/omicsNLP/enzymeNER 和 https://zenodo.org/doi/10.5281/zenodo.10581586 获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Proteome Research
Journal of Proteome Research 生物-生化研究方法
CiteScore
9.00
自引率
4.50%
发文量
251
审稿时长
3 months
期刊介绍: Journal of Proteome Research publishes content encompassing all aspects of global protein analysis and function, including the dynamic aspects of genomics, spatio-temporal proteomics, metabonomics and metabolomics, clinical and agricultural proteomics, as well as advances in methodology including bioinformatics. The theme and emphasis is on a multidisciplinary approach to the life sciences through the synergy between the different types of "omics".
期刊最新文献
Exploring Infantile Epileptic Spasm Syndrome: A Proteomic Analysis of Plasma Using the Data-Independent Acquisition Approach. Chronic Exposure to Petroleum-Derived Hydrocarbons Alters Human Skin Microbiome and Metabolome Profiles: A Pilot Study. Unveiling Pathophysiological Insights: Serum Metabolic Dysregulation in Acute Respiratory Distress Syndrome Patients with Acute Kidney Injury. Valuable Contributions and Lessons Learned from Proteomics and Metabolomics Studies of COVID-19. Characteristics of Myocardial Structure and Central Carbon Metabolism during the Early and Compensatory Stages of Cardiac Hypertrophy.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1