Literature mining discerns latent disease-gene relationships.

IF 4.4 3区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Bioinformatics Pub Date : 2024-04-12 DOI:10.1093/bioinformatics/btae185
Priyadarshini Rai, Atishay Jain, Shivani Kumar, Divya Sharma, Neha Jha, Smriti Chawla, Abhijith S. Raj, Apoorva Gupta, Sarita Poonia, A. Majumdar, Tanmoy Chakraborty, Gaurav Ahuja, Debarka Sengupta
{"title":"Literature mining discerns latent disease-gene relationships.","authors":"Priyadarshini Rai, Atishay Jain, Shivani Kumar, Divya Sharma, Neha Jha, Smriti Chawla, Abhijith S. Raj, Apoorva Gupta, Sarita Poonia, A. Majumdar, Tanmoy Chakraborty, Gaurav Ahuja, Debarka Sengupta","doi":"10.1093/bioinformatics/btae185","DOIUrl":null,"url":null,"abstract":"MOTIVATION\nDysregulation of a gene's function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene-pathology relationships is an ever-daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape (HCL) project (Han et al., 2020), researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger-tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite diverse and is confounded by several clinical and demographic covariates.\n\n\nRESULTS\nTo circumvent this, we mined ∼18 million PubMed abstracts published till May 2019 and automatically selected ∼4.5 million of them that describe roles of particular genes in disease pathogenesis. Further, we fine-tuned the pretrained Bidirectional Encoder Representations from Transformers (BERT) for language modeling from the domain of Natural Language Processing (NLP) to learn vector representation of entities such as genes, diseases, tissues, cell-types etc., in a way such that their relationship is preserved in a vector space. The repurposed BERT predicted disease-gene associations that are not cited in the training data, thereby highlighting the feasibility of in-silico synthesis of hypotheses linking different biological entities such as genes and conditions.\n\n\nAVAILABILITY AND IMPLEMENTATION\nPathoBERT pretrained model: https://github.com/Priyadarshini-Rai/Pathomap-ModelBioSentVec based abstract classification model: https://github.com/Priyadarshini-Rai/Pathomap-ModelPathomap R package: https://github.com/Priyadarshini-Rai/Pathomap.\n\n\nSUPPLEMENTARY INFORMATION\nSupplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.4000,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae185","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

MOTIVATION Dysregulation of a gene's function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene-pathology relationships is an ever-daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape (HCL) project (Han et al., 2020), researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger-tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite diverse and is confounded by several clinical and demographic covariates. RESULTS To circumvent this, we mined ∼18 million PubMed abstracts published till May 2019 and automatically selected ∼4.5 million of them that describe roles of particular genes in disease pathogenesis. Further, we fine-tuned the pretrained Bidirectional Encoder Representations from Transformers (BERT) for language modeling from the domain of Natural Language Processing (NLP) to learn vector representation of entities such as genes, diseases, tissues, cell-types etc., in a way such that their relationship is preserved in a vector space. The repurposed BERT predicted disease-gene associations that are not cited in the training data, thereby highlighting the feasibility of in-silico synthesis of hypotheses linking different biological entities such as genes and conditions. AVAILABILITY AND IMPLEMENTATION PathoBERT pretrained model: https://github.com/Priyadarshini-Rai/Pathomap-ModelBioSentVec based abstract classification model: https://github.com/Priyadarshini-Rai/Pathomap-ModelPathomap R package: https://github.com/Priyadarshini-Rai/Pathomap. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
文献挖掘发现潜在的疾病基因关系。
动机基因突变或调控网络受损导致基因功能失调,往往会引发受影响组织的病理状态。主要由于遗传多效性和缺乏合适的计算方法,全面绘制这些明显的基因病理关系图谱是一项艰巨的任务。随着高通量基因组学平台和人类细胞景观(Human Cell Landscape,HCL)项目(Han 等人,2020 年)等社区规模计划的出现,研究人员已经能够在单细胞水平上绘制健康组织的基因表达图谱。然而,在疾病方面,我们目前还无法获得类似的丰富知识。结果为了避免这种情况,我们挖掘了截至 2019 年 5 月发表的 1,800 万篇 PubMed 摘要,并自动选择了其中 450 万篇描述特定基因在疾病发病机制中作用的摘要。此外,我们从自然语言处理(NLP)领域微调了用于语言建模的预训练转换器双向编码器表征(BERT),以学习基因、疾病、组织、细胞类型等实体的向量表征,从而在向量空间中保留它们之间的关系。经过重新利用的 BERT 预测了训练数据中没有引用的疾病-基因关联,从而突显了对连接基因和病症等不同生物实体的假设进行内部合成的可行性。可用性和实施PathoBERT 预训练模型:https://github.com/Priyadarshini-Rai/Pathomap-ModelBioSentVec 基于抽象分类模型:https://github.com/Priyadarshini-Rai/Pathomap-ModelPathomap R软件包:https://github.com/Priyadarshini-Rai/Pathomap.SUPPLEMENTARY 信息补充数据可从生物信息学在线网站获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Bioinformatics
Bioinformatics 生物-生化研究方法
CiteScore
11.20
自引率
5.20%
发文量
753
审稿时长
2.1 months
期刊介绍: The leading journal in its field, Bioinformatics publishes the highest quality scientific papers and review articles of interest to academic and industrial researchers. Its main focus is on new developments in genome bioinformatics and computational biology. Two distinct sections within the journal - Discovery Notes and Application Notes- focus on shorter papers; the former reporting biologically interesting discoveries using computational methods, the latter exploring the applications used for experiments.
期刊最新文献
MEHunter: Transformer-based mobile element variant detection from long reads PQSDC: a parallel lossless compressor for quality scores data via sequences partition and Run-Length prediction mapping. MUSE-XAE: MUtational Signature Extraction with eXplainable AutoEncoder enhances tumour types classification. CopyVAE: a variational autoencoder-based approach for copy number variation inference using single-cell transcriptomics CORDAX web server: An online platform for the prediction and 3D visualization of aggregation motifs in protein sequences.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1