{"title":"Knowledge-Guided Biclustering via Sparse Variational EM Algorithm.","authors":"Changgee Chang, Jihwan Oh, Eun Jeong Min, Qi Long","doi":"10.1109/icbk.2019.00012","DOIUrl":null,"url":null,"abstract":"<p><p>A biclustering in the analysis of a gene expression data matrix, for example, is defined as a set of biclusters where each bicluster is a group of genes and a group of samples for which the genes are differentially expressed. Although many data mining approaches for biclustering exist in the literature, only few are able to incorporate prior knowledge to the analysis, which can lead to great improvements in terms of accuracy and interpretability, and all are limited in handling discrete data types. We propose a generalized biclustering approach that can be used for integrative analysis of multi-omics data with different data types. Our method is capable of utilizing biological information that can be represented by graph such as functional genomics and functional proteomics and accommodating a combination of continuous and discrete data types. The proposed method builds on a generalized Bayesian factor analysis framework and a variational EM approach is used to obtain parameter estimates, where the latent quantities in the loglikelihood are iteratively imputed by their conditional expectations. The biclusters are retrieved via the sparse estimates of the factor loadings and the conditional expectation of the latent factors. In order to obtain the sparse conditional expectation of the latent factors, a novel sparse variational EM algorithm is used. We demonstrate the superiority of our method over several existing biclustering methods in extensive simulation experiements and in integrative analysis of multi-omics data.</p>","PeriodicalId":93240,"journal":{"name":"10th IEEE International Conference on Big Knowledge : proceedings : 10-11 November 2019, Beijing, China. IEEE International Conference on Big Knowledge (10th : 2019 : Beijing, China)","volume":"2019 ","pages":"25-32"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8291726/pdf/nihms-1588833.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"10th IEEE International Conference on Big Knowledge : proceedings : 10-11 November 2019, Beijing, China. IEEE International Conference on Big Knowledge (10th : 2019 : Beijing, China)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icbk.2019.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2019/12/30 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
A biclustering in the analysis of a gene expression data matrix, for example, is defined as a set of biclusters where each bicluster is a group of genes and a group of samples for which the genes are differentially expressed. Although many data mining approaches for biclustering exist in the literature, only few are able to incorporate prior knowledge to the analysis, which can lead to great improvements in terms of accuracy and interpretability, and all are limited in handling discrete data types. We propose a generalized biclustering approach that can be used for integrative analysis of multi-omics data with different data types. Our method is capable of utilizing biological information that can be represented by graph such as functional genomics and functional proteomics and accommodating a combination of continuous and discrete data types. The proposed method builds on a generalized Bayesian factor analysis framework and a variational EM approach is used to obtain parameter estimates, where the latent quantities in the loglikelihood are iteratively imputed by their conditional expectations. The biclusters are retrieved via the sparse estimates of the factor loadings and the conditional expectation of the latent factors. In order to obtain the sparse conditional expectation of the latent factors, a novel sparse variational EM algorithm is used. We demonstrate the superiority of our method over several existing biclustering methods in extensive simulation experiements and in integrative analysis of multi-omics data.
例如,在分析基因表达数据矩阵时,双聚类被定义为一组双聚类,其中每个双聚类都是一组基因和一组样本,这些基因在这些样本中有差异表达。尽管文献中存在许多双簇数据挖掘方法,但只有少数方法能够将先验知识纳入分析,从而在准确性和可解释性方面带来巨大改进,而且所有方法在处理离散数据类型方面都受到限制。我们提出了一种通用的双聚类方法,可用于不同数据类型的多组学数据的综合分析。我们的方法能够利用功能基因组学和功能蛋白质组学等可以用图表表示的生物信息,并兼顾连续和离散数据类型。所提出的方法建立在广义贝叶斯因子分析框架之上,并使用变异 EM 方法来获得参数估计,其中对数似然中的潜在量由其条件期望值迭代估算。通过对潜在因子载荷和条件期望的稀疏估计来检索双簇。为了获得潜在因子的稀疏条件期望,我们使用了一种新颖的稀疏变异 EM 算法。我们在大量模拟实验和多组学数据综合分析中证明了我们的方法优于现有的几种双聚类方法。