{"title":"基于模型的特征选择和 rna-seq 数据聚类,用于无监督亚型发现。","authors":"David K Lim, Naim U Rashid, Joseph G Ibrahim","doi":"10.1214/20-aoas1407","DOIUrl":null,"url":null,"abstract":"<p><p>Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown <i>a priori</i> what genes may be informative in discriminating between clusters, and what the optimal number of clusters are. Few methods exist for performing unsupervised clustering of RNA-seq samples, and none currently adjust for between-sample global normalization factors, select cluster-discriminatory genes, or account for potential confounding variables during clustering. To address these issues, we propose the Feature Selection and Clustering of RNA-seq (FSCseq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and the quadratic penalty method with a Smoothly-Clipped Absolute Deviation (SCAD) penalty. The maximization is done by a penalized Classification EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership, even in the presence of batch effects. Based on simulations and real data analysis, we show the advantages of our method relative to competing approaches.</p>","PeriodicalId":50772,"journal":{"name":"Annals of Applied Statistics","volume":"15 1","pages":"481-508"},"PeriodicalIF":1.3000,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8386505/pdf/nihms-1716637.pdf","citationCount":"0","resultStr":"{\"title\":\"MODEL-BASED FEATURE SELECTION AND CLUSTERING OF RNA-SEQ DATA FOR UNSUPERVISED SUBTYPE DISCOVERY.\",\"authors\":\"David K Lim, Naim U Rashid, Joseph G Ibrahim\",\"doi\":\"10.1214/20-aoas1407\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown <i>a priori</i> what genes may be informative in discriminating between clusters, and what the optimal number of clusters are. Few methods exist for performing unsupervised clustering of RNA-seq samples, and none currently adjust for between-sample global normalization factors, select cluster-discriminatory genes, or account for potential confounding variables during clustering. To address these issues, we propose the Feature Selection and Clustering of RNA-seq (FSCseq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and the quadratic penalty method with a Smoothly-Clipped Absolute Deviation (SCAD) penalty. The maximization is done by a penalized Classification EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership, even in the presence of batch effects. Based on simulations and real data analysis, we show the advantages of our method relative to competing approaches.</p>\",\"PeriodicalId\":50772,\"journal\":{\"name\":\"Annals of Applied Statistics\",\"volume\":\"15 1\",\"pages\":\"481-508\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2021-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8386505/pdf/nihms-1716637.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annals of Applied Statistics\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1214/20-aoas1407\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2021/3/18 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Applied Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/20-aoas1407","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/3/18 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
摘要
聚类是一种无监督学习,旨在根据一组特征的相似性发现数据中的潜在群体。这种方法在生物医学研究中的一个常见应用是,在给定一组信息基因的情况下,从病人的基因表达数据中划分出新的癌症亚型。然而,人们通常不知道哪些基因在区分群组时可能具有参考价值,也不知道最佳群组数目是多少。对 RNA-seq 样本进行无监督聚类的方法寥寥无几,目前没有一种方法能调整样本间的全局归一化因子、选择聚类区分基因或在聚类过程中考虑潜在的混杂变量。为了解决这些问题,我们提出了 RNA-seq 特征选择和聚类(FSCseq):一种基于模型的聚类算法,它利用有限混合回归(FMR)模型和带有平滑绝对偏差(SCAD)惩罚的二次惩罚法。最大化是通过受惩罚的分类 EM 算法完成的,这样我们就可以在建模框架中加入归一化因素和混杂因素。有了拟合模型,即使存在批次效应,我们的框架也能通过群组成员的后验概率对新患者进行亚型预测。基于模拟和真实数据分析,我们展示了我们的方法相对于其他方法的优势。
MODEL-BASED FEATURE SELECTION AND CLUSTERING OF RNA-SEQ DATA FOR UNSUPERVISED SUBTYPE DISCOVERY.
Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown a priori what genes may be informative in discriminating between clusters, and what the optimal number of clusters are. Few methods exist for performing unsupervised clustering of RNA-seq samples, and none currently adjust for between-sample global normalization factors, select cluster-discriminatory genes, or account for potential confounding variables during clustering. To address these issues, we propose the Feature Selection and Clustering of RNA-seq (FSCseq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and the quadratic penalty method with a Smoothly-Clipped Absolute Deviation (SCAD) penalty. The maximization is done by a penalized Classification EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership, even in the presence of batch effects. Based on simulations and real data analysis, we show the advantages of our method relative to competing approaches.
期刊介绍:
Statistical research spans an enormous range from direct subject-matter collaborations to pure mathematical theory. The Annals of Applied Statistics, the newest journal from the IMS, is aimed at papers in the applied half of this range. Published quarterly in both print and electronic form, our goal is to provide a timely and unified forum for all areas of applied statistics.