评估用于疾病基因发现的网络引导随机森林

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Biodata Mining Pub Date : 2024-04-16 DOI:10.1186/s13040-024-00361-5

Jianchang Hu, Silke Szymczak

{"title":"评估用于疾病基因发现的网络引导随机森林","authors":"Jianchang Hu, Silke Szymczak","doi":"10.1186/s13040-024-00361-5","DOIUrl":null,"url":null,"abstract":"Gene network information is believed to be beneficial for disease module and pathway identification, but has not been explicitly utilized in the standard random forest (RF) algorithm for gene expression data analysis. We investigate the performance of a network-guided RF where the network information is summarized into a sampling probability of predictor variables which is further used in the construction of the RF. Our simulation results suggest that network-guided RF does not provide better disease prediction than the standard RF. In terms of disease gene discovery, if disease genes form module(s), network-guided RF identifies them more accurately. In addition, when disease status is independent from genes in the given network, spurious gene selection results can occur when using network information, especially on hub genes. Our empirical analysis on two balanced microarray and RNA-Seq breast cancer datasets from The Cancer Genome Atlas (TCGA) for classification of progesterone receptor (PR) status also demonstrates that network-guided RF can identify genes from PGR-related pathways, which leads to a better connected module of identified genes. Gene networks can provide additional information to aid the gene expression analysis for disease module and pathway identification. But they need to be used with caution and validation on the results need to be carried out to guard against spurious gene selection. More robust approaches to incorporate such information into RF construction also warrant further study.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"55 1","pages":""},"PeriodicalIF":6.1000,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation of network-guided random forest for disease gene discovery\",\"authors\":\"Jianchang Hu, Silke Szymczak\",\"doi\":\"10.1186/s13040-024-00361-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Gene network information is believed to be beneficial for disease module and pathway identification, but has not been explicitly utilized in the standard random forest (RF) algorithm for gene expression data analysis. We investigate the performance of a network-guided RF where the network information is summarized into a sampling probability of predictor variables which is further used in the construction of the RF. Our simulation results suggest that network-guided RF does not provide better disease prediction than the standard RF. In terms of disease gene discovery, if disease genes form module(s), network-guided RF identifies them more accurately. In addition, when disease status is independent from genes in the given network, spurious gene selection results can occur when using network information, especially on hub genes. Our empirical analysis on two balanced microarray and RNA-Seq breast cancer datasets from The Cancer Genome Atlas (TCGA) for classification of progesterone receptor (PR) status also demonstrates that network-guided RF can identify genes from PGR-related pathways, which leads to a better connected module of identified genes. Gene networks can provide additional information to aid the gene expression analysis for disease module and pathway identification. But they need to be used with caution and validation on the results need to be carried out to guard against spurious gene selection. More robust approaches to incorporate such information into RF construction also warrant further study.\",\"PeriodicalId\":48947,\"journal\":{\"name\":\"Biodata Mining\",\"volume\":\"55 1\",\"pages\":\"\"},\"PeriodicalIF\":6.1000,\"publicationDate\":\"2024-04-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodata Mining\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s13040-024-00361-5\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-024-00361-5","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

基因网络信息被认为有利于疾病模块和通路的识别，但在用于基因表达数据分析的标准随机森林（RF）算法中尚未得到明确利用。我们研究了网络引导的 RF 的性能，在这种 RF 中，网络信息被归纳为预测变量的抽样概率，并进一步用于构建 RF。我们的模拟结果表明，与标准 RF 相比，网络引导 RF 并不能提供更好的疾病预测。在疾病基因发现方面，如果疾病基因形成模块，网络引导 RF 能更准确地识别它们。此外，当疾病状态与给定网络中的基因无关时，使用网络信息可能会出现虚假的基因选择结果，尤其是在枢纽基因上。我们对来自癌症基因组图谱（TCGA）的两个平衡微阵列和 RNA-Seq 乳腺癌数据集进行了实证分析，以对孕酮受体（PR）状态进行分类，结果也表明网络引导的 RF 可以识别 PGR 相关通路中的基因，从而产生连接性更好的已识别基因模块。基因网络可以为疾病模块和通路识别的基因表达分析提供额外的辅助信息。但需要谨慎使用，并对结果进行验证，以防止虚假的基因选择。将此类信息纳入 RF 构建的更稳健方法也值得进一步研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Evaluation of network-guided random forest for disease gene discovery

Gene network information is believed to be beneficial for disease module and pathway identification, but has not been explicitly utilized in the standard random forest (RF) algorithm for gene expression data analysis. We investigate the performance of a network-guided RF where the network information is summarized into a sampling probability of predictor variables which is further used in the construction of the RF. Our simulation results suggest that network-guided RF does not provide better disease prediction than the standard RF. In terms of disease gene discovery, if disease genes form module(s), network-guided RF identifies them more accurately. In addition, when disease status is independent from genes in the given network, spurious gene selection results can occur when using network information, especially on hub genes. Our empirical analysis on two balanced microarray and RNA-Seq breast cancer datasets from The Cancer Genome Atlas (TCGA) for classification of progesterone receptor (PR) status also demonstrates that network-guided RF can identify genes from PGR-related pathways, which leads to a better connected module of identified genes. Gene networks can provide additional information to aid the gene expression analysis for disease module and pathway identification. But they need to be used with caution and validation on the results need to be carried out to guard against spurious gene selection. More robust approaches to incorporate such information into RF construction also warrant further study.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.