CoNECo：蛋白质复合体命名实体识别和规范化语料库。

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY Bioinformatics advances Pub Date : 2024-08-20 eCollection Date: 2024-01-01 DOI:10.1093/bioadv/vbae116

Katerina Nastou, Mikaela Koutrouli, Sampo Pyysalo, Lars Juhl Jensen

{"title":"CoNECo：蛋白质复合体命名实体识别和规范化语料库。","authors":"Katerina Nastou, Mikaela Koutrouli, Sampo Pyysalo, Lars Juhl Jensen","doi":"10.1093/bioadv/vbae116","DOIUrl":null,"url":null,"abstract":"Motivation: Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus.Results: We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1621 documents with 2052 entities, 1976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature.Availability and implementation: All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae116"},"PeriodicalIF":2.8000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11474106/pdf/","citationCount":"0","resultStr":"{\"title\":\"CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes.\",\"authors\":\"Katerina Nastou, Mikaela Koutrouli, Sampo Pyysalo, Lars Juhl Jensen\",\"doi\":\"10.1093/bioadv/vbae116\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus.Results: We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1621 documents with 2052 entities, 1976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature.Availability and implementation: All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.\",\"PeriodicalId\":72368,\"journal\":{\"name\":\"Bioinformatics advances\",\"volume\":\"4 1\",\"pages\":\"vbae116\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11474106/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics advances\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioadv/vbae116\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbae116","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

动机：尽管在生物医学信息提取方面取得了重大进展，但在含蛋白质复合物的命名实体识别（NER）和命名实体规范化（NEN）方面却缺乏资源。目前的资源不足以解决不同生物体中含蛋白质复合物名称的识别问题，这突出表明了对专用语料库的迫切需要：结果：我们介绍了复杂命名实体语料库（CoNECo），这是一个用于复合体 NER 和 NEN 的注释语料库。CoNECo 由 1621 篇文档和 2052 个实体组成，其中 1976 个实体已规范化为基因本体。我们将该语料库分为训练集、开发集和测试集，并对它们进行了基于转换器和基于词典的标记训练。在测试集上的评估结果表明该方法性能稳定，F 值分别为 73.7% 和 61.2%。随后，我们应用最佳标记器对所有可公开获取的生物医学文献进行了全面标记：所有资源，包括注释语料库、训练数据和代码，都可通过 Zenodo https://zenodo.org/records/11263147 和 GitHub https://zenodo.org/records/10693653 向社区提供。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes.

Motivation: Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus.

Results: We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1621 documents with 2052 entities, 1976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature.

Availability and implementation: All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量