The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop.

IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Database: The Journal of Biological Databases and Curation Pub Date : 2024-08-09 DOI:10.1093/database/baae071
Rezarta Islamaj, Chih-Hsuan Wei, Po-Ting Lai, Ling Luo, Cathleen Coss, Preeti Gokal Kochar, Nicholas Miliaras, Oleg Rodionov, Keiko Sekiya, Dorothy Trinh, Deborah Whitman, Zhiyong Lu
{"title":"The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop.","authors":"Rezarta Islamaj, Chih-Hsuan Wei, Po-Ting Lai, Ling Luo, Cathleen Coss, Preeti Gokal Kochar, Nicholas Miliaras, Oleg Rodionov, Keiko Sekiya, Dorothy Trinh, Deborah Whitman, Zhiyong Lu","doi":"10.1093/database/baae071","DOIUrl":null,"url":null,"abstract":"<p><p>The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11315767/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database: The Journal of Biological Databases and Curation","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/database/baae071","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
BioCreative VIII 挑战赛和研讨会上的 BioRED 轨道生物医学关系语料库。
自动识别生物医学关系是对已发表文献的非结构化文本中所含信息进行语义理解的重要一步。第八届生物创想大会的 BioRED 赛道旨在通过向与会者提供 BioRED-BC8 语料库来促进此类方法的发展。BioRED-BC8 语料库由 1000 篇 PubMed 文档组成,这些文档经过人工编辑,包括疾病、基因/蛋白、化学物质、细胞系、基因变异和物种,以及它们之间的配对关系,即疾病-基因、化学物质-基因、疾病-变异、基因-基因、化学物质-疾病、化学物质-化学物质、化学物质-变异和变异-变异。此外,关系还分为以下语义类别:正相关、负相关、结合、转换、药物相互作用、比较、共处理和关联。与之前大多数公开可用的语料库不同,所有关系都是在文档级别而非句子级别上表达的,因此,实体被规范化为标准化词汇表中相应的概念标识符,即疾病和化学物质被规范化为 MeSH,基因(和蛋白质)被规范化为美国国家生物技术信息中心(NCBI)基因,物种被规范化为 NCBI 分类,细胞系被规范化为 Cellosaurus,基因/蛋白质变体被规范化为单核苷酸多态性数据库。最后,每种注释关系都被归类为 "新颖",这取决于它在出版物中是新发现还是实验验证。这种区分有助于将新发现与同一文本中提供已知事实和/或背景知识的其他关系区分开来。BioRED-BC8 语料库使用了之前由 600 篇 PubMed 文章组成的 BioRED 语料库作为训练数据集,并包括一组新发表的 400 篇文章作为挑战赛的测试数据。所有测试文章都是由美国国家医学图书馆的专家生物学家为 BioCreative VIII 挑战赛进行人工标注的,标注过程采用原始标注指南,每篇文章都要经过三轮标注过程进行双重标注,直到所有馆员达成完全一致为止。本手稿详细介绍了 BioRED-BC8 语料库作为生物医学命名实体识别和关系提取的重要资源的特点。利用这一新资源,我们展示了生物医学文本挖掘算法开发的进展。数据库网址:https://codalab.lisn.upsaclay.fr/competitions/16381.
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Database: The Journal of Biological Databases and Curation
Database: The Journal of Biological Databases and Curation MATHEMATICAL & COMPUTATIONAL BIOLOGY-
CiteScore
9.00
自引率
3.40%
发文量
100
审稿时长
>12 weeks
期刊介绍: Huge volumes of primary data are archived in numerous open-access databases, and with new generation technologies becoming more common in laboratories, large datasets will become even more prevalent. The archiving, curation, analysis and interpretation of all of these data are a challenge. Database development and biocuration are at the forefront of the endeavor to make sense of this mounting deluge of data. Database: The Journal of Biological Databases and Curation provides an open access platform for the presentation of novel ideas in database research and biocuration, and aims to help strengthen the bridge between database developers, curators, and users.
期刊最新文献
BuffExDb: web-based tissue-specific gene expression resource for breeding and conservation programmes in Bubalus bubalis. Standardized pipelines support and facilitate integration of diverse datasets at the Rat Genome Database. A change language for ontologies and knowledge graphs. Correction to: The landscape of microRNA interaction annotation: analysis of three rare disorders as a case study. LSD600: the first corpus of biomedical abstracts annotated with lifestyle-disease relations.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1