LSD600: the first corpus of biomedical abstracts annotated with lifestyle-disease relations.

IF 3.4 4区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Database: The Journal of Biological Databases and Curation Pub Date : 2025-01-13 DOI:10.1093/database/baae129
Esmaeil Nourani, Evangelia-Mantelena Makri, Xiqing Mao, Sampo Pyysalo, Søren Brunak, Katerina Nastou, Lars Juhl Jensen
{"title":"LSD600: the first corpus of biomedical abstracts annotated with lifestyle-disease relations.","authors":"Esmaeil Nourani, Evangelia-Mantelena Makri, Xiqing Mao, Sampo Pyysalo, Søren Brunak, Katerina Nastou, Lars Juhl Jensen","doi":"10.1093/database/baae129","DOIUrl":null,"url":null,"abstract":"<p><p>Lifestyle factors (LSFs) are increasingly recognized as instrumental in both the development and control of diseases. Despite their importance, there is a lack of methods to extract relations between LSFs and diseases from the literature, a step necessary to consolidate the currently available knowledge into a structured form. As simple co-occurrence-based relation extraction (RE) approaches are unable to distinguish between the different types of LSF-disease relations, context-aware models such as transformers are required to extract and classify these relations into specific relation types. However, no comprehensive LSF-disease RE system existed, nor a corpus suitable for developing one. We present LSD600 (available at https://zenodo.org/records/13952449), the first corpus specifically designed for LSF-disease RE, comprising 600 abstracts with 1900 relations of eight distinct types between 5027 diseases and 6930 LSF entities. We evaluated LSD600's quality by training a RoBERTa model on the corpus, achieving an F-score of 68.5% for the multilabel RE task on the held-out test set. We further validated LSD600 by using the trained model on the two Nutrition-Disease and FoodDisease datasets, where it achieved F-scores of 70.7% and 80.7%, respectively. Building on these performance results, LSD600 and the RE system trained on it can be valuable resources to fill the existing gap in this area and pave the way for downstream applications. Database URL: https://zenodo.org/records/13952449.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database: The Journal of Biological Databases and Curation","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/database/baae129","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Lifestyle factors (LSFs) are increasingly recognized as instrumental in both the development and control of diseases. Despite their importance, there is a lack of methods to extract relations between LSFs and diseases from the literature, a step necessary to consolidate the currently available knowledge into a structured form. As simple co-occurrence-based relation extraction (RE) approaches are unable to distinguish between the different types of LSF-disease relations, context-aware models such as transformers are required to extract and classify these relations into specific relation types. However, no comprehensive LSF-disease RE system existed, nor a corpus suitable for developing one. We present LSD600 (available at https://zenodo.org/records/13952449), the first corpus specifically designed for LSF-disease RE, comprising 600 abstracts with 1900 relations of eight distinct types between 5027 diseases and 6930 LSF entities. We evaluated LSD600's quality by training a RoBERTa model on the corpus, achieving an F-score of 68.5% for the multilabel RE task on the held-out test set. We further validated LSD600 by using the trained model on the two Nutrition-Disease and FoodDisease datasets, where it achieved F-scores of 70.7% and 80.7%, respectively. Building on these performance results, LSD600 and the RE system trained on it can be valuable resources to fill the existing gap in this area and pave the way for downstream applications. Database URL: https://zenodo.org/records/13952449.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
LSD600:第一个带有生活方式与疾病关系注释的生物医学摘要语料库。
人们越来越认识到生活方式因素在疾病的发展和控制中起着重要作用。尽管它们很重要,但缺乏从文献中提取lsf与疾病之间关系的方法,这是将当前可用知识整合为结构化形式的必要步骤。由于简单的基于共现的关系提取(RE)方法无法区分不同类型的lsf疾病关系,因此需要诸如transformer之类的上下文感知模型来提取这些关系并将其分类为特定的关系类型。然而,目前还没有一个全面的lsf疾病RE系统,也没有一个适合开发的语料库。我们提出了LSD600(可在https://zenodo.org/records/13952449上获得),这是第一个专门为LSF疾病RE设计的语料库,包含600篇摘要,其中包含5027种疾病和6930种LSF实体之间8种不同类型的1900种关系。我们通过在语料库上训练RoBERTa模型来评估LSD600的质量,在hold - hold测试集中,多标签RE任务的f值达到68.5%。我们通过在两个Nutrition-Disease和FoodDisease数据集上使用训练好的模型进一步验证了LSD600,其f得分分别达到70.7%和80.7%。在这些性能结果的基础上,LSD600和在其上训练的RE系统可以成为填补该领域现有空白的宝贵资源,并为下游应用铺平道路。数据库地址:https://zenodo.org/records/13952449。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Database: The Journal of Biological Databases and Curation
Database: The Journal of Biological Databases and Curation MATHEMATICAL & COMPUTATIONAL BIOLOGY-
CiteScore
9.00
自引率
3.40%
发文量
100
审稿时长
>12 weeks
期刊介绍: Huge volumes of primary data are archived in numerous open-access databases, and with new generation technologies becoming more common in laboratories, large datasets will become even more prevalent. The archiving, curation, analysis and interpretation of all of these data are a challenge. Database development and biocuration are at the forefront of the endeavor to make sense of this mounting deluge of data. Database: The Journal of Biological Databases and Curation provides an open access platform for the presentation of novel ideas in database research and biocuration, and aims to help strengthen the bridge between database developers, curators, and users.
期刊最新文献
Standardized pipelines support and facilitate integration of diverse datasets at the Rat Genome Database. A change language for ontologies and knowledge graphs. Correction to: The landscape of microRNA interaction annotation: analysis of three rare disorders as a case study. LSD600: the first corpus of biomedical abstracts annotated with lifestyle-disease relations. DisGeNet: a disease-centric interaction database among diseases and various associated genes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1