{"title":"Literature mining discerns latent disease-gene relationships.","authors":"Priyadarshini Rai, Atishay Jain, Shivani Kumar, Divya Sharma, Neha Jha, Smriti Chawla, Abhijith S. Raj, Apoorva Gupta, Sarita Poonia, A. Majumdar, Tanmoy Chakraborty, Gaurav Ahuja, Debarka Sengupta","doi":"10.1093/bioinformatics/btae185","DOIUrl":null,"url":null,"abstract":"MOTIVATION\nDysregulation of a gene's function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene-pathology relationships is an ever-daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape (HCL) project (Han et al., 2020), researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger-tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite diverse and is confounded by several clinical and demographic covariates.\n\n\nRESULTS\nTo circumvent this, we mined ∼18 million PubMed abstracts published till May 2019 and automatically selected ∼4.5 million of them that describe roles of particular genes in disease pathogenesis. Further, we fine-tuned the pretrained Bidirectional Encoder Representations from Transformers (BERT) for language modeling from the domain of Natural Language Processing (NLP) to learn vector representation of entities such as genes, diseases, tissues, cell-types etc., in a way such that their relationship is preserved in a vector space. The repurposed BERT predicted disease-gene associations that are not cited in the training data, thereby highlighting the feasibility of in-silico synthesis of hypotheses linking different biological entities such as genes and conditions.\n\n\nAVAILABILITY AND IMPLEMENTATION\nPathoBERT pretrained model: https://github.com/Priyadarshini-Rai/Pathomap-ModelBioSentVec based abstract classification model: https://github.com/Priyadarshini-Rai/Pathomap-ModelPathomap R package: https://github.com/Priyadarshini-Rai/Pathomap.\n\n\nSUPPLEMENTARY INFORMATION\nSupplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.4000,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae185","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
MOTIVATION
Dysregulation of a gene's function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene-pathology relationships is an ever-daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape (HCL) project (Han et al., 2020), researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger-tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite diverse and is confounded by several clinical and demographic covariates.
RESULTS
To circumvent this, we mined ∼18 million PubMed abstracts published till May 2019 and automatically selected ∼4.5 million of them that describe roles of particular genes in disease pathogenesis. Further, we fine-tuned the pretrained Bidirectional Encoder Representations from Transformers (BERT) for language modeling from the domain of Natural Language Processing (NLP) to learn vector representation of entities such as genes, diseases, tissues, cell-types etc., in a way such that their relationship is preserved in a vector space. The repurposed BERT predicted disease-gene associations that are not cited in the training data, thereby highlighting the feasibility of in-silico synthesis of hypotheses linking different biological entities such as genes and conditions.
AVAILABILITY AND IMPLEMENTATION
PathoBERT pretrained model: https://github.com/Priyadarshini-Rai/Pathomap-ModelBioSentVec based abstract classification model: https://github.com/Priyadarshini-Rai/Pathomap-ModelPathomap R package: https://github.com/Priyadarshini-Rai/Pathomap.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
期刊介绍:
The leading journal in its field, Bioinformatics publishes the highest quality scientific papers and review articles of interest to academic and industrial researchers. Its main focus is on new developments in genome bioinformatics and computational biology. Two distinct sections within the journal - Discovery Notes and Application Notes- focus on shorter papers; the former reporting biologically interesting discoveries using computational methods, the latter exploring the applications used for experiments.