RDscan: Extracting RNA-disease relationship from the literature based on pre-training model

IF 4.2 3区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Methods Pub Date : 2024-05-22 DOI:10.1016/j.ymeth.2024.05.012
Yang Zhang , Yu Yang , Liping Ren , Lin Ning , Quan Zou , Nanchao Luo , Yinghui Zhang , Ruijun Liu
{"title":"RDscan: Extracting RNA-disease relationship from the literature based on pre-training model","authors":"Yang Zhang ,&nbsp;Yu Yang ,&nbsp;Liping Ren ,&nbsp;Lin Ning ,&nbsp;Quan Zou ,&nbsp;Nanchao Luo ,&nbsp;Yinghui Zhang ,&nbsp;Ruijun Liu","doi":"10.1016/j.ymeth.2024.05.012","DOIUrl":null,"url":null,"abstract":"<div><p>With the rapid advancements in molecular biology and genomics, a multitude of connections between RNA and diseases has been unveiled, making the efficient and accurate extraction of RNA-disease (RD) relationships from extensive biomedical literature crucial for advancing research in this field. This study introduces RDscan, a novel text mining method developed based on the pre-training and fine-tuning strategy, aimed at automatically extracting RD-related information from a vast corpus of literature using pre-trained biomedical large language models (LLM). Initially, we constructed a dedicated RD corpus by manually curating from literature, comprising 2,082 positive and 2,000 negative sentences, alongside an independent test dataset (comprising 500 positive and 500 negative sentences) for training and evaluating RDscan. Subsequently, by fine-tuning the Bioformer and BioBERT pre-trained models, RDscan demonstrated exceptional performance in text classification and named entity recognition (NER) tasks. In 5-fold cross-validation, RDscan significantly outperformed traditional machine learning methods (Support Vector Machine, Logistic Regression and Random Forest). In addition, we have developed an accessible webserver that assists users in extracting RD relationships from text. In summary, RDscan represents the first text mining tool specifically designed for RD relationship extraction, and is poised to emerge as an invaluable tool for researchers dedicated to exploring the intricate interactions between RNA and diseases. Webserver of RDscan is free available at <span>https://cellknowledge.com.cn/RDscan/</span><svg><path></path></svg>.</p></div>","PeriodicalId":390,"journal":{"name":"Methods","volume":"228 ","pages":"Pages 48-54"},"PeriodicalIF":4.2000,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1046202324001312","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

With the rapid advancements in molecular biology and genomics, a multitude of connections between RNA and diseases has been unveiled, making the efficient and accurate extraction of RNA-disease (RD) relationships from extensive biomedical literature crucial for advancing research in this field. This study introduces RDscan, a novel text mining method developed based on the pre-training and fine-tuning strategy, aimed at automatically extracting RD-related information from a vast corpus of literature using pre-trained biomedical large language models (LLM). Initially, we constructed a dedicated RD corpus by manually curating from literature, comprising 2,082 positive and 2,000 negative sentences, alongside an independent test dataset (comprising 500 positive and 500 negative sentences) for training and evaluating RDscan. Subsequently, by fine-tuning the Bioformer and BioBERT pre-trained models, RDscan demonstrated exceptional performance in text classification and named entity recognition (NER) tasks. In 5-fold cross-validation, RDscan significantly outperformed traditional machine learning methods (Support Vector Machine, Logistic Regression and Random Forest). In addition, we have developed an accessible webserver that assists users in extracting RD relationships from text. In summary, RDscan represents the first text mining tool specifically designed for RD relationship extraction, and is poised to emerge as an invaluable tool for researchers dedicated to exploring the intricate interactions between RNA and diseases. Webserver of RDscan is free available at https://cellknowledge.com.cn/RDscan/.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
RDscan:基于预训练模型从文献中提取 RNA 与疾病的关系
随着分子生物学和基因组学的飞速发展,人们发现了 RNA 与疾病之间的多种联系,因此从大量生物医学文献中高效、准确地提取 RNA 与疾病(RD)的关系对于推动该领域的研究至关重要。本研究介绍的 RDscan 是一种基于预训练和微调策略开发的新型文本挖掘方法,旨在利用预训练的生物医学大语言模型(LLM)从大量文献中自动提取 RD 相关信息。最初,我们通过人工从文献中整理出一个专门的 RD 语料库,其中包括 2,082 个正面句子和 2,000 个负面句子,以及一个独立的测试数据集(包括 500 个正面句子和 500 个负面句子),用于训练和评估 RDscan。随后,通过对 Bioformer 和 BioBERT 预训练模型进行微调,RDscan 在文本分类和命名实体识别(NER)任务中表现出卓越的性能。在 5 倍交叉验证中,RDscan 的表现明显优于传统的机器学习方法(支持向量机、逻辑回归和随机森林)。此外,我们还开发了一个可访问的网络服务器,帮助用户从文本中提取 RD 关系。总之,RDscan 是首个专为提取 RD 关系而设计的文本挖掘工具,有望成为致力于探索 RNA 与疾病之间错综复杂的相互作用的研究人员的宝贵工具。RDscan 的网络服务器可在 https://cellknowledge.com.cn/RDscan/ 免费获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Methods
Methods 生物-生化研究方法
CiteScore
9.80
自引率
2.10%
发文量
222
审稿时长
11.3 weeks
期刊介绍: Methods focuses on rapidly developing techniques in the experimental biological and medical sciences. Each topical issue, organized by a guest editor who is an expert in the area covered, consists solely of invited quality articles by specialist authors, many of them reviews. Issues are devoted to specific technical approaches with emphasis on clear detailed descriptions of protocols that allow them to be reproduced easily. The background information provided enables researchers to understand the principles underlying the methods; other helpful sections include comparisons of alternative methods giving the advantages and disadvantages of particular methods, guidance on avoiding potential pitfalls, and suggestions for troubleshooting.
期刊最新文献
Design and characterization of hollow microneedles for localized intrascleral drug delivery of ocular formulations. Deepstack-ACE: A deep stacking-based ensemble learning framework for the accelerated discovery of ACE inhibitory peptides. Predicting cyclins based on key features and machine learning methods. BCDB: A Dual-Branch Network Based on Transformer for Predicting Transcription Factor Binding Sites. Nanotechnological approaches to improve corticosteroids ocular therapy.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1