Gene Selection with Sequential Classification and Regression Tree Algorithm.

Biostatistics, bioinformatics and biomathematics Pub Date : 2011-08-01

Caleb D Bastian, Grzegorz A Rempala

{"title":"Gene Selection with Sequential Classification and Regression Tree Algorithm.","authors":"Caleb D Bastian, Grzegorz A Rempala","doi":"","DOIUrl":null,"url":null,"abstract":"Background: In the typical setting of gene-selection problems from high-dimensional data, e.g., gene expression data from microarray or next-generation sequencing-based technologies, an enormous volume of high-throughput data is generated, and there is often a need for a simple, computationally-inexpensive, non-parametric screening procedure than can quickly and accurately find a low-dimensional variable subset that preserves biological information from the original very high-dimensional data (dimension p > 40,000). This is in contrast to the very sophisticated variable selection methods that are computationally expensive, need pre-processing routines, and often require calibration of priors.Results: We present a tree-based sequential CART (S-CART) approach to variable selection in the binary classification setting and compare it against the more sophisticated procedures using simulated and real biological data. In simulated data, we analyze S-CART performance versus (i) a random forest (RF), (ii) a fully-parametric Bayesian stochastic search variable selection (SSVS), and (iii) the moderated t-test statistic from the LIMMA package in R. The simulation study is based on a hierarchical Bayesian model, where dataset dimensionality, percentage of significant variables, and substructure via dependency vary. Selection efficacy is measured through false-discovery and missed-discovery rates. In all scenarios, the S-CART method is seen to consistently outperform SSVS and RF in both speed and detection accuracy. We demonstrate the utility of the S-CART technique both on simulated data and in a control-treatment mouse study. We show that the network analysis based on the S-CART-selected gene subset in essence recapitulates the biological findings of the study using only a fraction of the original set of genes considered in the study's analysis.Conclusions: The relatively simple-minded gene selection algorithms like S-CART may often in practical circumstances be preferred over much more sophisticated ones. The advantage of the \"greedy\" selection methods utilized by S-CART and the likes is that they scale well with the problem size and require virtually no tuning or training while remaining efficient in extracting the relevant information from microarray-like datasets containing large number of redundant or irrelevant variables.Availability: The MATLAB 7.4b code for the S-CART implementation is available for download from https://neyman.mcg.edu/posts/scart.zip.","PeriodicalId":90456,"journal":{"name":"Biostatistics, bioinformatics and biomathematics","volume":"2 4","pages":"157-186"},"PeriodicalIF":0.0000,"publicationDate":"2011-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4214923/pdf/nihms376173.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biostatistics, bioinformatics and biomathematics","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: In the typical setting of gene-selection problems from high-dimensional data, e.g., gene expression data from microarray or next-generation sequencing-based technologies, an enormous volume of high-throughput data is generated, and there is often a need for a simple, computationally-inexpensive, non-parametric screening procedure than can quickly and accurately find a low-dimensional variable subset that preserves biological information from the original very high-dimensional data (dimension p > 40,000). This is in contrast to the very sophisticated variable selection methods that are computationally expensive, need pre-processing routines, and often require calibration of priors.

Results: We present a tree-based sequential CART (S-CART) approach to variable selection in the binary classification setting and compare it against the more sophisticated procedures using simulated and real biological data. In simulated data, we analyze S-CART performance versus (i) a random forest (RF), (ii) a fully-parametric Bayesian stochastic search variable selection (SSVS), and (iii) the moderated t-test statistic from the LIMMA package in R. The simulation study is based on a hierarchical Bayesian model, where dataset dimensionality, percentage of significant variables, and substructure via dependency vary. Selection efficacy is measured through false-discovery and missed-discovery rates. In all scenarios, the S-CART method is seen to consistently outperform SSVS and RF in both speed and detection accuracy. We demonstrate the utility of the S-CART technique both on simulated data and in a control-treatment mouse study. We show that the network analysis based on the S-CART-selected gene subset in essence recapitulates the biological findings of the study using only a fraction of the original set of genes considered in the study's analysis.

Conclusions: The relatively simple-minded gene selection algorithms like S-CART may often in practical circumstances be preferred over much more sophisticated ones. The advantage of the "greedy" selection methods utilized by S-CART and the likes is that they scale well with the problem size and require virtually no tuning or training while remaining efficient in extracting the relevant information from microarray-like datasets containing large number of redundant or irrelevant variables.

Availability: The MATLAB 7.4b code for the S-CART implementation is available for download from https://neyman.mcg.edu/posts/scart.zip.

微信好友朋友圈 QQ好友复制链接

本刊更多论文

序列分类与回归树算法的基因选择。

背景:在高维数据的基因选择问题的典型设置中，例如，来自微阵列或基于下一代测序技术的基因表达数据，产生了大量高通量数据，并且通常需要一种简单的，计算上便宜的，非参数筛选程序可以快速准确地从原始的非常高维数据(维度p > 40,000)中找到保留生物信息的低维变量子集。这与非常复杂的变量选择方法形成对比，后者计算成本高，需要预处理例程，并且通常需要校准先验。结果:我们提出了一种基于树的顺序CART (S-CART)方法来进行二元分类设置中的变量选择，并将其与使用模拟和真实生物学数据的更复杂的程序进行比较。在模拟数据中，我们分析了S-CART性能与(i)随机森林(RF)， (ii)全参数贝叶斯随机搜索变量选择(SSVS)，以及(iii) r中LIMMA软件包的缓和t检验统计量的关系。模拟研究基于分层贝叶斯模型，其中数据集维度，显著变量百分比和依赖关系的子结构各不相同。选择效率是通过错误发现率和未发现率来衡量的。在所有情况下，S-CART方法在速度和检测精度方面都优于SSVS和RF。我们展示了S-CART技术在模拟数据和对照处理小鼠研究中的实用性。我们表明，基于s - cart选择的基因子集的网络分析本质上概括了该研究的生物学发现，仅使用了研究分析中考虑的原始基因集的一小部分。结论:相对简单的基因选择算法，如S-CART，在实际情况下可能比更复杂的算法更受欢迎。S-CART等使用的“贪婪”选择方法的优点是，它们可以很好地随问题规模扩展，几乎不需要调整或训练，同时在从包含大量冗余或不相关变量的微阵列类数据集中提取相关信息方面保持高效。可用性:S-CART实现的MATLAB 7.4b代码可从https://neyman.mcg.edu/posts/scart.zip下载。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Biostatistics, bioinformatics and biomathematics

自引率

0.00%

发文量

期刊最新文献

An improvement of the 2ˆ(-delta delta CT) method for quantitative real-time polymerase chain reaction data analysis. Gene Selection with Sequential Classification and Regression Tree Algorithm.