首页 > 最新文献

Statistical Applications in Genetics and Molecular Biology最新文献

英文 中文
Bayesian approach to discriminant problems for count data with application to multilocus short tandem repeat dataset. 计数数据判别问题的贝叶斯方法及其在多位点短串联重复数据集上的应用。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2020-05-04 DOI: 10.1515/sagmb-2018-0044
Koji Tsukuda, Shuhei Mano, Toshimichi Yamamoto

Short Tandem Repeats (STRs) are a type of DNA polymorphism. This study considers discriminant analysis to determine the population of test individuals using an STR database containing the lengths of STRs observed at more than one locus. The discriminant method based on the Bayes factor is discussed and an improved method is proposed. The main issues are to develop a method that is relatively robust to sample size imbalance, identify a procedure to select loci, and treat the parameter in the prior distribution. A previous study achieved a classification accuracy of 0.748 for the g-mean (geometric mean of classification accuracies for two populations) and 0.867 for the AUC (area under the receiver operating characteristic curve). We improve the maximum values for the g-mean to 0.830 and the AUC to 0.935. Computer simulations indicate that the previous method is susceptible to sample size imbalance, whereas the proposed method is more robust while achieving almost identical classification accuracy. Furthermore, the results confirm that threshold adjustment is an effective countermeasure to sample size imbalance.

短串联重复序列(STRs)是一种DNA多态性。本研究采用判别分析来确定测试个体的总体,使用包含在多个位点观察到的STR长度的STR数据库。讨论了基于贝叶斯因子的判别方法,提出了一种改进方法。主要问题是开发一种对样本量不平衡具有相对鲁棒性的方法,确定一个选择位点的程序,并处理先验分布中的参数。先前的研究中,g-mean(两个种群分类精度的几何平均值)和AUC(接收者工作特征曲线下面积)的分类精度分别为0.748和0.867。我们将g均值的最大值提高到0.830,AUC提高到0.935。计算机模拟表明,之前的方法容易受到样本量不平衡的影响,而提出的方法在获得几乎相同的分类精度的同时具有更强的鲁棒性。进一步验证了阈值调整是解决样本数量失衡的有效对策。
{"title":"Bayesian approach to discriminant problems for count data with application to multilocus short tandem repeat dataset.","authors":"Koji Tsukuda,&nbsp;Shuhei Mano,&nbsp;Toshimichi Yamamoto","doi":"10.1515/sagmb-2018-0044","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0044","url":null,"abstract":"<p><p>Short Tandem Repeats (STRs) are a type of DNA polymorphism. This study considers discriminant analysis to determine the population of test individuals using an STR database containing the lengths of STRs observed at more than one locus. The discriminant method based on the Bayes factor is discussed and an improved method is proposed. The main issues are to develop a method that is relatively robust to sample size imbalance, identify a procedure to select loci, and treat the parameter in the prior distribution. A previous study achieved a classification accuracy of 0.748 for the g-mean (geometric mean of classification accuracies for two populations) and 0.867 for the AUC (area under the receiver operating characteristic curve). We improve the maximum values for the g-mean to 0.830 and the AUC to 0.935. Computer simulations indicate that the previous method is susceptible to sample size imbalance, whereas the proposed method is more robust while achieving almost identical classification accuracy. Furthermore, the results confirm that threshold adjustment is an effective countermeasure to sample size imbalance.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2020-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0044","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37896963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identification of supervised and sparse functional genomic pathways. 有监督和稀疏功能基因组通路的鉴定。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2020-02-29 DOI: 10.1515/sagmb-2018-0026
Fan Zhang, Jeffrey C Miecznikowski, David L Tritchler

Functional pathways involve a series of biological alterations that may result in the occurrence of many diseases including cancer. With the availability of various "omics" technologies it becomes feasible to integrate information from a hierarchy of biological layers to provide a more comprehensive understanding to the disease. In many diseases, it is believed that only a small number of networks, each relatively small in size, drive the disease. Our goal in this study is to develop methods to discover these functional networks across biological layers correlated with the phenotype. We derive a novel Network Summary Matrix (NSM) that highlights potential pathways conforming to least squares regression relationships. An algorithm called Decomposition of Network Summary Matrix via Instability (DNSMI) involving decomposition of NSM using instability regularization is proposed. Simulations and real data analysis from The Cancer Genome Atlas (TCGA) program will be shown to demonstrate the performance of the algorithm.

功能途径涉及一系列可能导致包括癌症在内的许多疾病发生的生物学改变。随着各种“组学”技术的可用性,整合来自生物层次的信息以提供对疾病更全面的了解变得可行。在许多疾病中,人们认为只有少数网络(每个网络的规模相对较小)驱动疾病。我们在这项研究中的目标是开发方法来发现这些跨生物层与表型相关的功能网络。我们推导了一个新颖的网络总结矩阵(NSM),突出了符合最小二乘回归关系的潜在路径。提出了一种基于不稳定的网络汇总矩阵分解(DNSMI)算法,该算法涉及到使用不稳定正则化方法对NSM进行分解。通过癌症基因组图谱(TCGA)程序的仿真和真实数据分析,验证了该算法的性能。
{"title":"Identification of supervised and sparse functional genomic pathways.","authors":"Fan Zhang,&nbsp;Jeffrey C Miecznikowski,&nbsp;David L Tritchler","doi":"10.1515/sagmb-2018-0026","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0026","url":null,"abstract":"<p><p>Functional pathways involve a series of biological alterations that may result in the occurrence of many diseases including cancer. With the availability of various \"omics\" technologies it becomes feasible to integrate information from a hierarchy of biological layers to provide a more comprehensive understanding to the disease. In many diseases, it is believed that only a small number of networks, each relatively small in size, drive the disease. Our goal in this study is to develop methods to discover these functional networks across biological layers correlated with the phenotype. We derive a novel Network Summary Matrix (NSM) that highlights potential pathways conforming to least squares regression relationships. An algorithm called Decomposition of Network Summary Matrix via Instability (DNSMI) involving decomposition of NSM using instability regularization is proposed. Simulations and real data analysis from The Cancer Genome Atlas (TCGA) program will be shown to demonstrate the performance of the algorithm.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2020-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37686142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Joint variable selection and network modeling for detecting eQTLs. eqtl检测的联合变量选择与网络建模。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2020-02-20 DOI: 10.1515/sagmb-2019-0032
Xuan Cao, Lili Ding, Tesfaye B Mersha

In this study, we conduct a comparison of three most recent statistical methods for joint variable selection and covariance estimation with application of detecting expression quantitative trait loci (eQTL) and gene network estimation, and introduce a new hierarchical Bayesian method to be included in the comparison. Unlike the traditional univariate regression approach in eQTL, all four methods correlate phenotypes and genotypes by multivariate regression models that incorporate the dependence information among phenotypes, and use Bayesian multiplicity adjustment to avoid multiple testing burdens raised by traditional multiple testing correction methods. We presented the performance of three methods (MSSL - Multivariate Spike and Slab Lasso, SSUR - Sparse Seemingly Unrelated Bayesian Regression, and OBFBF - Objective Bayes Fractional Bayes Factor), along with the proposed, JDAG (Joint estimation via a Gaussian Directed Acyclic Graph model) method through simulation experiments, and publicly available HapMap real data, taking asthma as an example. Compared with existing methods, JDAG identified networks with higher sensitivity and specificity under row-wise sparse settings. JDAG requires less execution in small-to-moderate dimensions, but is not currently applicable to high dimensional data. The eQTL analysis in asthma data showed a number of known gene regulations such as STARD3, IKZF3 and PGAP3, all reported in asthma studies. The code of the proposed method is freely available at GitHub (https://github.com/xuan-cao/Joint-estimation-for-eQTL).

在本研究中,我们比较了三种最新的联合变量选择和协方差估计统计方法与检测表达数量性状位点(eQTL)和基因网络估计的应用,并引入了一种新的分层贝叶斯方法进行比较。与传统的单变量回归方法不同,这四种方法均通过纳入表型间依赖信息的多变量回归模型将表型和基因型关联起来,并使用贝叶斯多重性调整来避免传统多重检验校正方法带来的多重检验负担。我们介绍了三种方法(MSSL -多元Spike and Slab Lasso, SSUR -稀疏看似无关贝叶斯回归,OBFBF -客观贝叶斯分数阶贝叶斯因子)的性能,以及通过仿真实验提出的JDAG(基于高斯有向无环图模型的联合估计)方法,以及公开的HapMap真实数据,以哮喘为例。与现有方法相比,JDAG在逐行稀疏设置下识别网络具有更高的灵敏度和特异性。JDAG在小维度到中等维度上需要较少的执行,但目前不适用于高维数据。哮喘数据中的eQTL分析显示了许多已知的基因调控,如STARD3、IKZF3和PGAP3,均在哮喘研究中报道。建议的方法的代码可以在GitHub (https://github.com/xuan-cao/Joint-estimation-for-eQTL)上免费获得。
{"title":"Joint variable selection and network modeling for detecting eQTLs.","authors":"Xuan Cao,&nbsp;Lili Ding,&nbsp;Tesfaye B Mersha","doi":"10.1515/sagmb-2019-0032","DOIUrl":"https://doi.org/10.1515/sagmb-2019-0032","url":null,"abstract":"<p><p>In this study, we conduct a comparison of three most recent statistical methods for joint variable selection and covariance estimation with application of detecting expression quantitative trait loci (eQTL) and gene network estimation, and introduce a new hierarchical Bayesian method to be included in the comparison. Unlike the traditional univariate regression approach in eQTL, all four methods correlate phenotypes and genotypes by multivariate regression models that incorporate the dependence information among phenotypes, and use Bayesian multiplicity adjustment to avoid multiple testing burdens raised by traditional multiple testing correction methods. We presented the performance of three methods (MSSL - Multivariate Spike and Slab Lasso, SSUR - Sparse Seemingly Unrelated Bayesian Regression, and OBFBF - Objective Bayes Fractional Bayes Factor), along with the proposed, JDAG (Joint estimation via a Gaussian Directed Acyclic Graph model) method through simulation experiments, and publicly available HapMap real data, taking asthma as an example. Compared with existing methods, JDAG identified networks with higher sensitivity and specificity under row-wise sparse settings. JDAG requires less execution in small-to-moderate dimensions, but is not currently applicable to high dimensional data. The eQTL analysis in asthma data showed a number of known gene regulations such as STARD3, IKZF3 and PGAP3, all reported in asthma studies. The code of the proposed method is freely available at GitHub (https://github.com/xuan-cao/Joint-estimation-for-eQTL).</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2020-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2019-0032","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37660750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An extended model for phylogenetic maximum likelihood based on discrete morphological characters. 基于离散形态特征的系统发育最大似然扩展模型。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2020-02-20 DOI: 10.1515/sagmb-2019-0029
David A Spade

Maximum likelihood is a common method of estimating a phylogenetic tree based on a set of genetic data. However, models of evolution for certain types of genetic data are highly flawed in their specification, and this misspecification can have an adverse impact on phylogenetic inference. Our attention here is focused on extending an existing class of models for estimating phylogenetic trees from discrete morphological characters. The main advance of this work is a model that allows unequal equilibrium frequencies in the estimation of phylogenetic trees from discrete morphological character data using likelihood methods. Possible extensions of the proposed model will also be discussed.

最大似然是一种基于一组遗传数据估计系统发育树的常用方法。然而,某些类型的遗传数据的进化模型在其规范中存在严重缺陷,这种错误的规范可能对系统发育推断产生不利影响。我们的注意力集中在扩展现有的一类模型,用于从离散形态特征估计系统发育树。这项工作的主要进展是一个模型,该模型允许使用似然方法从离散形态特征数据估计系统发育树的不相等平衡频率。还将讨论拟议模型的可能扩展。
{"title":"An extended model for phylogenetic maximum likelihood based on discrete morphological characters.","authors":"David A Spade","doi":"10.1515/sagmb-2019-0029","DOIUrl":"https://doi.org/10.1515/sagmb-2019-0029","url":null,"abstract":"<p><p>Maximum likelihood is a common method of estimating a phylogenetic tree based on a set of genetic data. However, models of evolution for certain types of genetic data are highly flawed in their specification, and this misspecification can have an adverse impact on phylogenetic inference. Our attention here is focused on extending an existing class of models for estimating phylogenetic trees from discrete morphological characters. The main advance of this work is a model that allows unequal equilibrium frequencies in the estimation of phylogenetic trees from discrete morphological character data using likelihood methods. Possible extensions of the proposed model will also be discussed.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2020-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2019-0029","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37660749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Sparse latent factor regression models for genome-wide and epigenome-wide association studies 全基因组和表观全基因组关联研究的稀疏潜在因子回归模型
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2020-02-07 DOI: 10.1101/2020.02.07.938381
B. Jumentier, Kévin Caye, B. Heude, J. Lepeule, O. François
Abstract Association of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to account for variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. Here, we present least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. In simulated data, sparse latent factor regression models generally achieved higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator and a Bayesian sparse linear mixed model. In generative model simulations, statistical performance was slightly lower (while being comparable) to non-sparse methods, but in simulations based on empirical data, sparse latent factor regression models were more robust to departure from the model than the non-sparse approaches. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while overcoming multiple testing issues. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.
表型或暴露与基因组和表观基因组数据的关联面临着重要的统计学挑战。其中一个挑战是解释由于未观察到的混杂因素引起的变异,例如个体祖先或组织中的细胞类型组成。这个问题可以通过惩罚潜在因素回归模型来解决,其中引入惩罚来处理数据中的高维。如果相对较小比例的基因组或表观基因组标记与感兴趣的变量相关,稀疏度惩罚可能有助于捕获相关关联,但非稀疏方法的改进尚未得到充分评估。在这里,我们提出了最小二乘算法,联合估计稀疏潜在因素回归模型中的效应大小和混杂因素。在模拟数据中,稀疏潜因子回归模型通常比其他稀疏方法具有更高的统计性能,包括最小绝对收缩和选择算子以及贝叶斯稀疏线性混合模型。在生成模型模拟中,统计性能略低于非稀疏方法(但与之相当),但在基于经验数据的模拟中,稀疏潜在因素回归模型比非稀疏方法对偏离模型的鲁棒性更强。我们将稀疏潜在因子回归模型应用于拟南芥开花性状的全基因组关联研究和孕妇吸烟状况的全基因组关联研究。对于这两种应用,稀疏潜在因素回归模型有助于估计非零效应大小,同时克服了多个测试问题。结果不仅与先前的发现一致,而且他们还确定了与每种应用相关的功能注释的新基因。
{"title":"Sparse latent factor regression models for genome-wide and epigenome-wide association studies","authors":"B. Jumentier, Kévin Caye, B. Heude, J. Lepeule, O. François","doi":"10.1101/2020.02.07.938381","DOIUrl":"https://doi.org/10.1101/2020.02.07.938381","url":null,"abstract":"Abstract Association of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to account for variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. Here, we present least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. In simulated data, sparse latent factor regression models generally achieved higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator and a Bayesian sparse linear mixed model. In generative model simulations, statistical performance was slightly lower (while being comparable) to non-sparse methods, but in simulations based on empirical data, sparse latent factor regression models were more robust to departure from the model than the non-sparse approaches. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while overcoming multiple testing issues. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2020-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46182011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics. Dirichlet过程混合物中变量选择的快速近似推断,并在泛癌症蛋白质组学中的应用。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-12-12 DOI: 10.1515/sagmb-2018-0065
Oliver M Crook, Laurent Gatto, Paul D W Kirk

The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel.

Dirichlet过程(DP)混合模型已成为基于模型的聚类的一种流行选择,主要是因为它允许推断聚类的数量。顺序更新和贪婪搜索(SUGS)算法(Wang & Dunson, 2011)被提出作为一种快速的方法,在DP混合模型中执行近似贝叶斯推理,通过将聚类作为贝叶斯模型选择(BMS)问题,避免使用计算代价高昂的马尔可夫链蒙特卡罗方法。在这里,我们考虑如何将这种方法扩展到允许聚类的变量选择,并演示贝叶斯模型平均(BMA)代替BMS的好处。通过一系列模拟示例和来自癌症转录组学的充分研究示例,我们表明我们的方法与当前最先进的方法相比具有竞争力,同时也提供了计算优势。我们将我们的方法应用于来自癌症基因组图谱(TCGA)的反相蛋白质阵列(RPPA)数据,以便对5157个肿瘤样本进行泛癌症蛋白质组学表征。我们已经在一个名为sugsvarsel的开源R包中实现了我们的方法,以及原始的SUGS算法,该包通过在c++中执行密集计算来加速分析,并提供自动并行处理。R包可以从https://github.com/ococrook/sugsvarsel免费获得。
{"title":"Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics.","authors":"Oliver M Crook,&nbsp;Laurent Gatto,&nbsp;Paul D W Kirk","doi":"10.1515/sagmb-2018-0065","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0065","url":null,"abstract":"<p><p>The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0065","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10481523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
AdaReg: data adaptive robust estimation in linear regression with application in GTEx gene expressions AdaReg:线性回归中的数据自适应稳健估计及其在GTEx基因表达中的应用
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-12-10 DOI: 10.1101/869362
Meng Wang, Lihua Jiang, M. Snyder
Abstract The Genotype-Tissue Expression (GTEx) project provides a valuable resource of large-scale gene expressions across multiple tissue types. Under various technical noise and unknown or unmeasured factors, how to robustly estimate the major tissue effect becomes challenging. Moreover, different genes exhibit heterogeneous expressions across different tissue types. Therefore, we need a robust method which adapts to the heterogeneities of gene expressions to improve the estimation for the tissue effect. We followed the approach of the robust estimation based on γ-density-power-weight in the works of Fujisawa, H. and Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal. 99: 2053–2081 and Windham, M.P. (1995). Robustifying model fitting. J. Roy. Stat. Soc. B: 599–609, where γ is the exponent of density weight which controls the balance between bias and variance. As far as we know, our work is the first to propose a procedure to tune the parameter γ to balance the bias-variance trade-off under the mixture models. We constructed a robust likelihood criterion based on weighted densities in the mixture model of Gaussian population distribution mixed with unknown outlier distribution, and developed a data-adaptive γ-selection procedure embedded into the robust estimation. We provided a heuristic analysis on the selection criterion and found that our practical selection trend under various γ’s in average performance has similar capability to capture minimizer γ as the inestimable mean squared error (MSE) trend from our simulation studies under a series of settings. Our data-adaptive robustifying procedure in the linear regression problem (AdaReg) showed a significant advantage in both simulation studies and real data application in estimating tissue effect of heart samples from the GTEx project, compared to the fixed γ procedure and other robust methods. At the end, the paper discussed some limitations on this method and future work.
基因型-组织表达(GTEx)项目提供了跨多种组织类型的大规模基因表达的宝贵资源。在各种技术噪声和未知或不可测因素的影响下,如何稳健地估计主要组织效应成为一个挑战。此外,不同的基因在不同的组织类型中表现出异质表达。因此,我们需要一种适应基因表达异质性的鲁棒性方法来提高对组织效应的估计。我们采用了Fujisawa, H.和Eguchi, S.(2008)的基于γ-密度-功率权值的稳健估计方法。对重污染具有小偏差的鲁棒参数估计。[j] .地理科学与管理,1999(1):1 - 3。鲁棒模型拟合。j·罗伊。统计,Soc。B: 599-609,其中γ是控制偏差和方差之间平衡的密度权重的指数。据我们所知,我们的工作是第一个提出一个过程来调整参数γ,以平衡混合模型下的偏差-方差权衡。在高斯总体分布与未知离群分布混合的混合模型中,构建了基于加权密度的稳健似然准则,并开发了嵌入稳健估计的数据自适应γ选择程序。我们对选择准则进行了启发式分析,发现我们在各种平均性能γ下的实际选择趋势与我们在一系列设置下的模拟研究中不可估计的均方误差(MSE)趋势具有相似的捕获最小化γ的能力。与固定γ方法和其他鲁棒方法相比,我们在线性回归问题(AdaReg)中的数据自适应鲁棒化方法在GTEx项目中估计心脏样本组织效应的模拟研究和实际数据应用中都显示出显著的优势。最后,对该方法的局限性和今后的工作进行了讨论。
{"title":"AdaReg: data adaptive robust estimation in linear regression with application in GTEx gene expressions","authors":"Meng Wang, Lihua Jiang, M. Snyder","doi":"10.1101/869362","DOIUrl":"https://doi.org/10.1101/869362","url":null,"abstract":"Abstract The Genotype-Tissue Expression (GTEx) project provides a valuable resource of large-scale gene expressions across multiple tissue types. Under various technical noise and unknown or unmeasured factors, how to robustly estimate the major tissue effect becomes challenging. Moreover, different genes exhibit heterogeneous expressions across different tissue types. Therefore, we need a robust method which adapts to the heterogeneities of gene expressions to improve the estimation for the tissue effect. We followed the approach of the robust estimation based on γ-density-power-weight in the works of Fujisawa, H. and Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal. 99: 2053–2081 and Windham, M.P. (1995). Robustifying model fitting. J. Roy. Stat. Soc. B: 599–609, where γ is the exponent of density weight which controls the balance between bias and variance. As far as we know, our work is the first to propose a procedure to tune the parameter γ to balance the bias-variance trade-off under the mixture models. We constructed a robust likelihood criterion based on weighted densities in the mixture model of Gaussian population distribution mixed with unknown outlier distribution, and developed a data-adaptive γ-selection procedure embedded into the robust estimation. We provided a heuristic analysis on the selection criterion and found that our practical selection trend under various γ’s in average performance has similar capability to capture minimizer γ as the inestimable mean squared error (MSE) trend from our simulation studies under a series of settings. Our data-adaptive robustifying procedure in the linear regression problem (AdaReg) showed a significant advantage in both simulation studies and real data application in estimating tissue effect of heart samples from the GTEx project, compared to the fixed γ procedure and other robust methods. At the end, the paper discussed some limitations on this method and future work.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41415775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Bayesian framework for identifying consistent patterns of microbial abundance between body sites. 用于识别身体部位之间微生物丰度一致模式的贝叶斯框架。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-11-08 DOI: 10.1515/sagmb-2019-0027
Richard Meier, Jeffrey A Thompson, Mei Chung, Naisi Zhao, Karl T Kelsey, Dominique S Michaud, Devin C Koestler

Recent studies have found that the microbiome in both gut and mouth are associated with diseases of the gut, including cancer. If resident microbes could be found to exhibit consistent patterns between the mouth and gut, disease status could potentially be assessed non-invasively through profiling of oral samples. Currently, there exists no generally applicable method to test for such associations. Here we present a Bayesian framework to identify microbes that exhibit consistent patterns between body sites, with respect to a phenotypic variable. For a given operational taxonomic unit (OTU), a Bayesian regression model is used to obtain Markov-Chain Monte Carlo estimates of abundance among strata, calculate a correlation statistic, and conduct a formal test based on its posterior distribution. Extensive simulation studies demonstrate overall viability of the approach, and provide information on what factors affect its performance. Applying our method to a dataset containing oral and gut microbiome samples from 77 pancreatic cancer patients revealed several OTUs exhibiting consistent patterns between gut and mouth with respect to disease subtype. Our method is well powered for modest sample sizes and moderate strength of association and can be flexibly extended to other research settings using any currently established Bayesian analysis programs.

最近的研究发现,肠道和口腔中的微生物组与包括癌症在内的肠道疾病有关。如果可以发现常驻微生物在口腔和肠道之间表现出一致的模式,那么可以通过口腔样本的分析来非侵入性地评估疾病状态。目前,没有普遍适用的方法来测试这种关联。在这里,我们提出了一个贝叶斯框架来识别在表型变量方面,身体部位之间表现出一致模式的微生物。对于给定的操作分类单元(OTU),使用贝叶斯回归模型来获得地层间丰度的马尔可夫链蒙特卡罗估计,计算相关统计量,并基于其后验分布进行形式检验。大量的模拟研究证明了该方法的整体可行性,并提供了影响其性能的因素的信息。将我们的方法应用于包含来自77名癌症患者的口腔和肠道微生物组样本的数据集,发现几个OTU在疾病亚型方面在肠道和口腔之间表现出一致的模式。我们的方法适用于适度的样本量和适度的关联强度,并且可以使用任何当前建立的贝叶斯分析程序灵活地扩展到其他研究环境。
{"title":"A Bayesian framework for identifying consistent patterns of microbial abundance between body sites.","authors":"Richard Meier,&nbsp;Jeffrey A Thompson,&nbsp;Mei Chung,&nbsp;Naisi Zhao,&nbsp;Karl T Kelsey,&nbsp;Dominique S Michaud,&nbsp;Devin C Koestler","doi":"10.1515/sagmb-2019-0027","DOIUrl":"https://doi.org/10.1515/sagmb-2019-0027","url":null,"abstract":"<p><p>Recent studies have found that the microbiome in both gut and mouth are associated with diseases of the gut, including cancer. If resident microbes could be found to exhibit consistent patterns between the mouth and gut, disease status could potentially be assessed non-invasively through profiling of oral samples. Currently, there exists no generally applicable method to test for such associations. Here we present a Bayesian framework to identify microbes that exhibit consistent patterns between body sites, with respect to a phenotypic variable. For a given operational taxonomic unit (OTU), a Bayesian regression model is used to obtain Markov-Chain Monte Carlo estimates of abundance among strata, calculate a correlation statistic, and conduct a formal test based on its posterior distribution. Extensive simulation studies demonstrate overall viability of the approach, and provide information on what factors affect its performance. Applying our method to a dataset containing oral and gut microbiome samples from 77 pancreatic cancer patients revealed several OTUs exhibiting consistent patterns between gut and mouth with respect to disease subtype. Our method is well powered for modest sample sizes and moderate strength of association and can be flexibly extended to other research settings using any currently established Bayesian analysis programs.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2019-0027","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41180334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Bi-level feature selection in high dimensional AFT models with applications to a genomic study 高维AFT模型的双水平特征选择及其在基因组研究中的应用
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-09-17 DOI: 10.1515/sagmb-2019-0016
Hailin Huang, Jizi Shangguan, Peifeng Ruan, Hua Liang
Abstract We propose a new bi-level feature selection method for high dimensional accelerated failure time models by formulating the models to a single index model. The method yields sparse solutions at both the group and individual feature levels along with an expedient algorithm, which is computationally efficient and easily implemented. We analyze a genomic dataset for an illustration, and present a simulation study to show the finite sample performance of the proposed method.
摘要我们提出了一种新的高维加速失效时间模型的双层特征选择方法,将模型公式化为单指标模型。该方法在组和个体特征级别上产生稀疏解,并提供了一种计算高效且易于实现的权宜算法。我们分析了一个基因组数据集进行说明,并进行了模拟研究,以显示所提出方法的有限样本性能。
{"title":"Bi-level feature selection in high dimensional AFT models with applications to a genomic study","authors":"Hailin Huang, Jizi Shangguan, Peifeng Ruan, Hua Liang","doi":"10.1515/sagmb-2019-0016","DOIUrl":"https://doi.org/10.1515/sagmb-2019-0016","url":null,"abstract":"Abstract We propose a new bi-level feature selection method for high dimensional accelerated failure time models by formulating the models to a single index model. The method yields sparse solutions at both the group and individual feature levels along with an expedient algorithm, which is computationally efficient and easily implemented. We analyze a genomic dataset for an illustration, and present a simulation study to show the finite sample performance of the proposed method.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2019-0016","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44345541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Clustering methods for single-cell RNA-sequencing expression data: performance evaluation with varying sample sizes and cell compositions 单细胞rna测序表达数据的聚类方法:不同样本量和细胞组成的性能评估
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-08-14 DOI: 10.1515/sagmb-2019-0004
A. Suner
Abstract A number of specialized clustering methods have been developed so far for the accurate analysis of single-cell RNA-sequencing (scRNA-seq) expression data, and several reports have been published documenting the performance measures of these clustering methods under different conditions. However, to date, there are no available studies regarding the systematic evaluation of the performance measures of the clustering methods taking into consideration the sample size and cell composition of a given scRNA-seq dataset. Herein, a comprehensive performance evaluation study of 11 selected scRNA-seq clustering methods was performed using synthetic datasets with known sample sizes and number of subpopulations, as well as varying levels of transcriptome complexity. The results indicate that the overall performance of the clustering methods under study are highly dependent on the sample size and complexity of the scRNA-seq dataset. In most of the cases, better clustering performances were obtained as the number of cells in a given expression dataset was increased. The findings of this study also highlight the importance of sample size for the successful detection of rare cell subpopulations with an appropriate clustering tool.
为了准确分析单细胞rna测序(scRNA-seq)表达数据,目前已经开发了许多专门的聚类方法,并发表了一些报告,记录了这些聚类方法在不同条件下的性能指标。然而,到目前为止,还没有关于考虑到给定scRNA-seq数据集的样本量和细胞组成的聚类方法性能指标的系统评估的研究。本文使用已知样本量和亚群数量以及不同转录组复杂性水平的合成数据集,对11种选定的scRNA-seq聚类方法进行了综合性能评估研究。结果表明,所研究的聚类方法的总体性能高度依赖于scRNA-seq数据集的样本量和复杂性。在大多数情况下,随着给定表达数据集中的细胞数量的增加,聚类性能会得到更好的提高。本研究的发现还强调了样本大小对于使用适当的聚类工具成功检测稀有细胞亚群的重要性。
{"title":"Clustering methods for single-cell RNA-sequencing expression data: performance evaluation with varying sample sizes and cell compositions","authors":"A. Suner","doi":"10.1515/sagmb-2019-0004","DOIUrl":"https://doi.org/10.1515/sagmb-2019-0004","url":null,"abstract":"Abstract A number of specialized clustering methods have been developed so far for the accurate analysis of single-cell RNA-sequencing (scRNA-seq) expression data, and several reports have been published documenting the performance measures of these clustering methods under different conditions. However, to date, there are no available studies regarding the systematic evaluation of the performance measures of the clustering methods taking into consideration the sample size and cell composition of a given scRNA-seq dataset. Herein, a comprehensive performance evaluation study of 11 selected scRNA-seq clustering methods was performed using synthetic datasets with known sample sizes and number of subpopulations, as well as varying levels of transcriptome complexity. The results indicate that the overall performance of the clustering methods under study are highly dependent on the sample size and complexity of the scRNA-seq dataset. In most of the cases, better clustering performances were obtained as the number of cells in a given expression dataset was increased. The findings of this study also highlight the importance of sample size for the successful detection of rare cell subpopulations with an appropriate clustering tool.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2019-0004","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48981400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Statistical Applications in Genetics and Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1