文献挖掘发现潜在的疾病基因关系。

IF 5.5 3区材料科学 Q2 CHEMISTRY, PHYSICAL ACS Applied Energy Materials Pub Date : 2024-04-12 DOI:10.1093/bioinformatics/btae185

Priyadarshini Rai, Atishay Jain, Shivani Kumar, Divya Sharma, Neha Jha, Smriti Chawla, Abhijith S. Raj, Apoorva Gupta, Sarita Poonia, A. Majumdar, Tanmoy Chakraborty, Gaurav Ahuja, Debarka Sengupta

{"title":"文献挖掘发现潜在的疾病基因关系。","authors":"Priyadarshini Rai, Atishay Jain, Shivani Kumar, Divya Sharma, Neha Jha, Smriti Chawla, Abhijith S. Raj, Apoorva Gupta, Sarita Poonia, A. Majumdar, Tanmoy Chakraborty, Gaurav Ahuja, Debarka Sengupta","doi":"10.1093/bioinformatics/btae185","DOIUrl":null,"url":null,"abstract":"MOTIVATION\nDysregulation of a gene's function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene-pathology relationships is an ever-daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape (HCL) project (Han et al., 2020), researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger-tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite diverse and is confounded by several clinical and demographic covariates.\n\n\nRESULTS\nTo circumvent this, we mined ∼18 million PubMed abstracts published till May 2019 and automatically selected ∼4.5 million of them that describe roles of particular genes in disease pathogenesis. Further, we fine-tuned the pretrained Bidirectional Encoder Representations from Transformers (BERT) for language modeling from the domain of Natural Language Processing (NLP) to learn vector representation of entities such as genes, diseases, tissues, cell-types etc., in a way such that their relationship is preserved in a vector space. The repurposed BERT predicted disease-gene associations that are not cited in the training data, thereby highlighting the feasibility of in-silico synthesis of hypotheses linking different biological entities such as genes and conditions.\n\n\nAVAILABILITY AND IMPLEMENTATION\nPathoBERT pretrained model: https://github.com/Priyadarshini-Rai/Pathomap-ModelBioSentVec based abstract classification model: https://github.com/Priyadarshini-Rai/Pathomap-ModelPathomap R package: https://github.com/Priyadarshini-Rai/Pathomap.\n\n\nSUPPLEMENTARY INFORMATION\nSupplementary data are available at Bioinformatics online.","PeriodicalId":4,"journal":{"name":"ACS Applied Energy Materials","volume":"16 23","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Literature mining discerns latent disease-gene relationships.\",\"authors\":\"Priyadarshini Rai, Atishay Jain, Shivani Kumar, Divya Sharma, Neha Jha, Smriti Chawla, Abhijith S. Raj, Apoorva Gupta, Sarita Poonia, A. Majumdar, Tanmoy Chakraborty, Gaurav Ahuja, Debarka Sengupta\",\"doi\":\"10.1093/bioinformatics/btae185\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MOTIVATION\\nDysregulation of a gene's function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene-pathology relationships is an ever-daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape (HCL) project (Han et al., 2020), researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger-tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite diverse and is confounded by several clinical and demographic covariates.\\n\\n\\nRESULTS\\nTo circumvent this, we mined ∼18 million PubMed abstracts published till May 2019 and automatically selected ∼4.5 million of them that describe roles of particular genes in disease pathogenesis. Further, we fine-tuned the pretrained Bidirectional Encoder Representations from Transformers (BERT) for language modeling from the domain of Natural Language Processing (NLP) to learn vector representation of entities such as genes, diseases, tissues, cell-types etc., in a way such that their relationship is preserved in a vector space. The repurposed BERT predicted disease-gene associations that are not cited in the training data, thereby highlighting the feasibility of in-silico synthesis of hypotheses linking different biological entities such as genes and conditions.\\n\\n\\nAVAILABILITY AND IMPLEMENTATION\\nPathoBERT pretrained model: https://github.com/Priyadarshini-Rai/Pathomap-ModelBioSentVec based abstract classification model: https://github.com/Priyadarshini-Rai/Pathomap-ModelPathomap R package: https://github.com/Priyadarshini-Rai/Pathomap.\\n\\n\\nSUPPLEMENTARY INFORMATION\\nSupplementary data are available at Bioinformatics online.\",\"PeriodicalId\":4,\"journal\":{\"name\":\"ACS Applied Energy Materials\",\"volume\":\"16 23\",\"pages\":\"\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-04-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Applied Energy Materials\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btae185\",\"RegionNum\":3,\"RegionCategory\":\"材料科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Energy Materials","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae185","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

摘要

动机基因突变或调控网络受损导致基因功能失调，往往会引发受影响组织的病理状态。主要由于遗传多效性和缺乏合适的计算方法，全面绘制这些明显的基因病理关系图谱是一项艰巨的任务。随着高通量基因组学平台和人类细胞景观（Human Cell Landscape，HCL）项目（Han 等人，2020 年）等社区规模计划的出现，研究人员已经能够在单细胞水平上绘制健康组织的基因表达图谱。然而，在疾病方面，我们目前还无法获得类似的丰富知识。结果为了避免这种情况，我们挖掘了截至 2019 年 5 月发表的 1,800 万篇 PubMed 摘要，并自动选择了其中 450 万篇描述特定基因在疾病发病机制中作用的摘要。此外，我们从自然语言处理（NLP）领域微调了用于语言建模的预训练转换器双向编码器表征（BERT），以学习基因、疾病、组织、细胞类型等实体的向量表征，从而在向量空间中保留它们之间的关系。经过重新利用的 BERT 预测了训练数据中没有引用的疾病-基因关联，从而突显了对连接基因和病症等不同生物实体的假设进行内部合成的可行性。可用性和实施PathoBERT 预训练模型：https://github.com/Priyadarshini-Rai/Pathomap-ModelBioSentVec 基于抽象分类模型：https://github.com/Priyadarshini-Rai/Pathomap-ModelPathomap R软件包：https://github.com/Priyadarshini-Rai/Pathomap.SUPPLEMENTARY 信息补充数据可从生物信息学在线网站获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Literature mining discerns latent disease-gene relationships.

MOTIVATION Dysregulation of a gene's function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene-pathology relationships is an ever-daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape (HCL) project (Han et al., 2020), researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger-tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite diverse and is confounded by several clinical and demographic covariates. RESULTS To circumvent this, we mined ∼18 million PubMed abstracts published till May 2019 and automatically selected ∼4.5 million of them that describe roles of particular genes in disease pathogenesis. Further, we fine-tuned the pretrained Bidirectional Encoder Representations from Transformers (BERT) for language modeling from the domain of Natural Language Processing (NLP) to learn vector representation of entities such as genes, diseases, tissues, cell-types etc., in a way such that their relationship is preserved in a vector space. The repurposed BERT predicted disease-gene associations that are not cited in the training data, thereby highlighting the feasibility of in-silico synthesis of hypotheses linking different biological entities such as genes and conditions. AVAILABILITY AND IMPLEMENTATION PathoBERT pretrained model: https://github.com/Priyadarshini-Rai/Pathomap-ModelBioSentVec based abstract classification model: https://github.com/Priyadarshini-Rai/Pathomap-ModelPathomap R package: https://github.com/Priyadarshini-Rai/Pathomap. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACS Applied Energy Materials Materials Science-Materials Chemistry

CiteScore

10.30

自引率

6.20%

发文量

1368

期刊介绍： ACS Applied Energy Materials is an interdisciplinary journal publishing original research covering all aspects of materials, engineering, chemistry, physics and biology relevant to energy conversion and storage. The journal is devoted to reports of new and original experimental and theoretical research of an applied nature that integrate knowledge in the areas of materials, engineering, physics, bioscience, and chemistry into important energy applications.