首页 > 最新文献

Statistical Applications in Genetics and Molecular Biology最新文献

英文 中文
Determining the number of components in PLS regression on incomplete data set 不完全数据集PLS回归中分量数量的确定
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-10-18 DOI: 10.1515/sagmb-2018-0059
T. Nengsih, F. Bertrand, M. Maumy-Bertrand, Nicolas Meyer
Abstract Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.
偏最小二乘回归-或PLS回归-是一种多变量方法,其中模型参数估计使用SIMPLS或NIPALS算法。PLS回归因其在分析结果与一个或多个成分之间的关系方面的有效性而被广泛应用于应用研究。注意,NIPALS算法可以在不完整的数据上提供估计参数。在PLS回归中,用于构建代表性模型的组件数量的选择是一个中心问题。然而,在使用PLS回归时如何处理缺失数据仍然是一个有争议的问题。文献中提出了几种方法,包括Q2标准、AIC和BIC标准。在这里,我们研究NIPALS算法在用于拟合PLS回归时的行为,用于不同比例的缺失数据和不同类型的缺失。我们比较了选择不完整数据集和输入数据集上PLS回归的组件数量的标准,使用三种输入方法:链式方程的多重输入,k近邻输入和奇异值分解输入。在不同的缺失假设下,我们用不同的缺失数据比例(从5%到50%不等)测试了各种标准。Q2-leave-one-out方法比基于AIC和bic的选择结果更可靠。
{"title":"Determining the number of components in PLS regression on incomplete data set","authors":"T. Nengsih, F. Bertrand, M. Maumy-Bertrand, Nicolas Meyer","doi":"10.1515/sagmb-2018-0059","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0059","url":null,"abstract":"Abstract Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2018-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0059","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46367347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
EBADIMEX: an empirical Bayes approach to detect joint differential expression and methylation and to classify samples EBADIMEX:一种检测联合差异表达和甲基化并对样本进行分类的经验贝叶斯方法
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-08-28 DOI: 10.1101/401232
Tobias Madsen, Michal P. Switnicki, Malene Juul, J. S. Pedersen
Abstract DNA methylation and gene expression are interdependent and both implicated in cancer development and progression, with many individual biomarkers discovered. A joint analysis of the two data types can potentially lead to biological insights that are not discoverable with separate analyses. To optimally leverage the joint data for identifying perturbed genes and classifying clinical cancer samples, it is important to accurately model the interactions between the two data types. Here, we present EBADIMEX for jointly identifying differential expression and methylation and classifying samples. The moderated t-test widely used with empirical Bayes priors in current differential expression methods is generalised to a multivariate setting by developing: (1) a moderated Welch t-test for equality of means with unequal variances; (2) a moderated F-test for equality of variances; and (3) a multivariate test for equality of means with equal variances. This leads to parametric models with prior distributions for the parameters, which allow fast evaluation and robust analysis of small data sets. EBADIMEX is demonstrated on simulated data as well as a large breast cancer (BRCA) cohort from TCGA. We show that the use of empirical Bayes priors and moderated tests works particularly well on small data sets.
摘要DNA甲基化和基因表达是相互依赖的,两者都与癌症的发展和进展有关,发现了许多单独的生物标志物。对这两种数据类型的联合分析可能会产生单独分析无法发现的生物学见解。为了最佳地利用联合数据来识别扰动基因和对临床癌症样本进行分类,准确地对两种数据类型之间的相互作用进行建模是很重要的。在这里,我们介绍了EBADIMEX,用于联合鉴定差异表达和甲基化,并对样本进行分类。在当前的差分表达方法中,与经验贝叶斯先验一起广泛使用的有调节t检验通过发展被推广到多变量设置:(1)方差不等的均值相等的有调节Welch t检验;(2) 方差相等的调节F检验;以及(3)方差相等的均值相等的多变量检验。这导致了具有参数先验分布的参数模型,这允许对小数据集进行快速评估和稳健分析。EBADIMEX在模拟数据以及TCGA的大型癌症(BRCA)队列中得到了证明。我们表明,使用经验贝叶斯先验和调节测试在小数据集上效果特别好。
{"title":"EBADIMEX: an empirical Bayes approach to detect joint differential expression and methylation and to classify samples","authors":"Tobias Madsen, Michal P. Switnicki, Malene Juul, J. S. Pedersen","doi":"10.1101/401232","DOIUrl":"https://doi.org/10.1101/401232","url":null,"abstract":"Abstract DNA methylation and gene expression are interdependent and both implicated in cancer development and progression, with many individual biomarkers discovered. A joint analysis of the two data types can potentially lead to biological insights that are not discoverable with separate analyses. To optimally leverage the joint data for identifying perturbed genes and classifying clinical cancer samples, it is important to accurately model the interactions between the two data types. Here, we present EBADIMEX for jointly identifying differential expression and methylation and classifying samples. The moderated t-test widely used with empirical Bayes priors in current differential expression methods is generalised to a multivariate setting by developing: (1) a moderated Welch t-test for equality of means with unequal variances; (2) a moderated F-test for equality of variances; and (3) a multivariate test for equality of means with equal variances. This leads to parametric models with prior distributions for the parameters, which allow fast evaluation and robust analysis of small data sets. EBADIMEX is demonstrated on simulated data as well as a large breast cancer (BRCA) cohort from TCGA. We show that the use of empirical Bayes priors and moderated tests works particularly well on small data sets.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47425887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Noise-robust assessment of SNP array based CNV calls through local noise estimation of log R ratios. 通过对数R比的局部噪声估计,对基于SNP阵列的CNV调用进行噪声鲁棒性评估。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-04-28 DOI: 10.1515/sagmb-2017-0026
Nele Cosemans, Peter Claes, Nathalie Brison, Joris Robert Vermeesch, Hilde Peeters

Arrays based on single nucleotide polymorphisms (SNPs) have been successful for the large scale discovery of copy number variants (CNVs). However, current CNV calling algorithms still have limitations in detecting CNVs with high specificity and sensitivity, especially in case of small (<100 kb) CNVs. Therefore, this study presents a simple statistical analysis to evaluate CNV calls from SNP arrays in order to improve the noise-robustness of existing CNV calling algorithms. The proposed approach estimates local noise of log R ratios and returns the probability that a certain observation is different from this log R ratio noise level. This probability can be triggered at different thresholds to tailor specificity and/or sensitivity in a flexible way. Moreover, a comparison based on qPCR experiments showed that the proposed noise-robust CNV calls outperformed original ones for multiple threshold values.

基于单核苷酸多态性(SNPs)的阵列已经成功地用于大规模发现拷贝数变异(CNVs)。然而,目前的CNV调用算法在检测特异性和灵敏度较高的CNV时仍然存在局限性,特别是在小(
{"title":"Noise-robust assessment of SNP array based CNV calls through local noise estimation of log R ratios.","authors":"Nele Cosemans,&nbsp;Peter Claes,&nbsp;Nathalie Brison,&nbsp;Joris Robert Vermeesch,&nbsp;Hilde Peeters","doi":"10.1515/sagmb-2017-0026","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0026","url":null,"abstract":"<p><p>Arrays based on single nucleotide polymorphisms (SNPs) have been successful for the large scale discovery of copy number variants (CNVs). However, current CNV calling algorithms still have limitations in detecting CNVs with high specificity and sensitivity, especially in case of small (<100 kb) CNVs. Therefore, this study presents a simple statistical analysis to evaluate CNV calls from SNP arrays in order to improve the noise-robustness of existing CNV calling algorithms. The proposed approach estimates local noise of log R ratios and returns the probability that a certain observation is different from this log R ratio noise level. This probability can be triggered at different thresholds to tailor specificity and/or sensitivity in a flexible way. Moreover, a comparison based on qPCR experiments showed that the proposed noise-robust CNV calls outperformed original ones for multiple threshold values.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2018-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36054869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On "A mutual information estimator with exponentially decaying bias" by Zhang and Zheng. 关于张和郑的“具有指数衰减偏差的互信息估计量”。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-03-30 DOI: 10.1515/sagmb-2018-0005
Jialin Zhang, Chen Chen

Zhang, Z. and Zheng, L. (2015): "A mutual information estimator with exponentially decaying bias," Stat. Appl. Genet. Mol. Biol., 14, 243-252, proposed a nonparametric estimator of mutual information developed in entropic perspective, and demonstrated that it has much smaller bias than the plugin estimator yet with the same asymptotic normality under certain conditions. However it is incorrectly suggested in their article that the asymptotic normality could be used for testing independence between two random elements on a joint alphabet. When two random elements are independent, the asymptotic distribution of $sqrt{n}$n-normed estimator degenerates and therefore the claimed normality does not hold. This article complements Zhang and Zheng by establishing a new chi-square test using the same entropic statistics for mutual information being zero. The three examples in Zhang and Zheng are re-worked using the new test. The results turn out to be much more sensible and further illustrate the advantage of the entropic perspective in statistical inference on alphabets. More specifically in Example 2, when a positive mutual information is known to exist, the new test detects it but the log likelihood ratio test fails to do so.

张振和郑磊(2015):“一种具有指数衰减偏差的互信息估计器”,中国科学院学报(自然科学版)。麝猫。摩尔。杂志。, 14, 243-252,从熵的角度提出了一种互信息的非参数估计量,并证明了在一定条件下,它的偏差比插件估计量小得多,但具有相同的渐近正态性。然而,在他们的文章中错误地提出,渐近正态性可以用于测试联合字母表上两个随机元素之间的独立性。当两个随机元素独立时,$sqrt{n}$n-范数估计量的渐近分布退化,因此所宣称的正态性不成立。本文通过建立一个新的卡方检验来补充Zhang和Zheng,使用相同的熵统计量为互信息为零。张和郑的三个例子是用新的测试重新制作的。结果更加合理,进一步说明了熵视角在字母统计推理中的优势。更具体地说,在例2中,当已知存在一个正互信息时,新的测试检测到它,但对数似然比测试没有这样做。
{"title":"On \"A mutual information estimator with exponentially decaying bias\" by Zhang and Zheng.","authors":"Jialin Zhang,&nbsp;Chen Chen","doi":"10.1515/sagmb-2018-0005","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0005","url":null,"abstract":"<p><p>Zhang, Z. and Zheng, L. (2015): \"A mutual information estimator with exponentially decaying bias,\" Stat. Appl. Genet. Mol. Biol., 14, 243-252, proposed a nonparametric estimator of mutual information developed in entropic perspective, and demonstrated that it has much smaller bias than the plugin estimator yet with the same asymptotic normality under certain conditions. However it is incorrectly suggested in their article that the asymptotic normality could be used for testing independence between two random elements on a joint alphabet. When two random elements are independent, the asymptotic distribution of $sqrt{n}$n-normed estimator degenerates and therefore the claimed normality does not hold. This article complements Zhang and Zheng by establishing a new chi-square test using the same entropic statistics for mutual information being zero. The three examples in Zhang and Zheng are re-worked using the new test. The results turn out to be much more sensible and further illustrate the advantage of the entropic perspective in statistical inference on alphabets. More specifically in Example 2, when a positive mutual information is known to exist, the new test detects it but the log likelihood ratio test fails to do so.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2018-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0005","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35962346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Ensemble survival tree models to reveal pairwise interactions of variables with time-to-events outcomes in low-dimensional setting. 集合生存树模型揭示了低维环境中变量与时间到事件结果的成对相互作用。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-02-17 DOI: 10.1515/sagmb-2017-0038
Jean-Eudes Dazard, Hemant Ishwaran, Rajeev Mehlotra, Aaron Weinberg, Peter Zimmerman

Unraveling interactions among variables such as genetic, clinical, demographic and environmental factors is essential to understand the development of common and complex diseases. To increase the power to detect such variables interactions associated with clinical time-to-events outcomes, we borrowed established concepts from random survival forest (RSF) models. We introduce a novel RSF-based pairwise interaction estimator and derive a randomization method with bootstrap confidence intervals for inferring interaction significance. Using various linear and nonlinear time-to-events survival models in simulation studies, we first show the efficiency of our approach: true pairwise interaction-effects between variables are uncovered, while they may not be accompanied with their corresponding main-effects, and may not be detected by standard semi-parametric regression modeling and test statistics used in survival analysis. Moreover, using a RSF-based cross-validation scheme for generating prediction estimators, we show that informative predictors may be inferred. We applied our approach to an HIV cohort study recording key host gene polymorphisms and their association with HIV change of tropism or AIDS progression. Altogether, this shows how linear or nonlinear pairwise statistical interactions of variables may be efficiently detected with a predictive value in observational studies with time-to-event outcomes.

解开诸如遗传、临床、人口统计和环境因素等变量之间的相互作用对于了解常见和复杂疾病的发展至关重要。为了提高检测与临床事件时间相关的变量相互作用的能力,我们借鉴了随机生存森林(RSF)模型的既定概念。我们引入了一种新的基于rsf的两两交互估计量,并推导了一种带自举置信区间的随机化方法来推断交互显著性。在模拟研究中使用各种线性和非线性时间-事件生存模型,我们首先展示了我们方法的效率:揭示了变量之间真正的两两相互作用效应,而它们可能不伴有相应的主效应,并且可能无法通过生存分析中使用的标准半参数回归建模和检验统计检测到。此外,使用基于rsf的交叉验证方案来生成预测估计器,我们表明可以推断出信息预测器。我们将我们的方法应用于一项HIV队列研究,记录了关键宿主基因多态性及其与HIV嗜性变化或艾滋病进展的关系。总之,这显示了变量的线性或非线性成对统计相互作用如何在具有事件时间结果的观察性研究中有效地检测到预测值。
{"title":"Ensemble survival tree models to reveal pairwise interactions of variables with time-to-events outcomes in low-dimensional setting.","authors":"Jean-Eudes Dazard,&nbsp;Hemant Ishwaran,&nbsp;Rajeev Mehlotra,&nbsp;Aaron Weinberg,&nbsp;Peter Zimmerman","doi":"10.1515/sagmb-2017-0038","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0038","url":null,"abstract":"<p><p>Unraveling interactions among variables such as genetic, clinical, demographic and environmental factors is essential to understand the development of common and complex diseases. To increase the power to detect such variables interactions associated with clinical time-to-events outcomes, we borrowed established concepts from random survival forest (RSF) models. We introduce a novel RSF-based pairwise interaction estimator and derive a randomization method with bootstrap confidence intervals for inferring interaction significance. Using various linear and nonlinear time-to-events survival models in simulation studies, we first show the efficiency of our approach: true pairwise interaction-effects between variables are uncovered, while they may not be accompanied with their corresponding main-effects, and may not be detected by standard semi-parametric regression modeling and test statistics used in survival analysis. Moreover, using a RSF-based cross-validation scheme for generating prediction estimators, we show that informative predictors may be inferred. We applied our approach to an HIV cohort study recording key host gene polymorphisms and their association with HIV change of tropism or AIDS progression. Altogether, this shows how linear or nonlinear pairwise statistical interactions of variables may be efficiently detected with a predictive value in observational studies with time-to-event outcomes.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2018-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0038","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35840212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Additive varying-coefficient model for nonlinear gene-environment interactions. 非线性基因-环境相互作用的加性变系数模型。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-02-08 DOI: 10.1515/sagmb-2017-0008
Cen Wu, Ping-Shou Zhong, Yuehua Cui

Gene-environment (G×E) interaction plays a pivotal role in understanding the genetic basis of complex disease. When environmental factors are measured continuously, one can assess the genetic sensitivity over different environmental conditions on a disease trait. Motivated by the increasing awareness of gene set based association analysis over single variant based approaches, we proposed an additive varying-coefficient model to jointly model variants in a genetic system. The model allows us to examine how variants in a gene set are moderated by an environment factor to affect a disease phenotype. We approached the problem from a variable selection perspective. In particular, we select variants with varying, constant and zero coefficients, which correspond to cases of G×E interaction, no G×E interaction and no genetic effect, respectively. The procedure was implemented through a two-stage iterative estimation algorithm via the smoothly clipped absolute deviation penalty function. Under certain regularity conditions, we established the consistency property in variable selection as well as effect separation of the two stage iterative estimators, and showed the optimal convergence rates of the estimates for varying effects. In addition, we showed that the estimate of non-zero constant coefficients enjoy the oracle property. The utility of our procedure was demonstrated through simulation studies and real data analysis.

基因-环境(G×E)相互作用在理解复杂疾病的遗传基础方面起着关键作用。当环境因素被连续测量时,人们可以评估在不同环境条件下对疾病性状的遗传敏感性。由于基于基因集的关联分析比基于单变异的方法更受关注,我们提出了一种加性变系数模型来联合建模遗传系统中的变异。该模型使我们能够研究一组基因中的变异如何受到环境因素的调节,从而影响疾病表型。我们从变量选择的角度来解决这个问题。特别地,我们选择了变系数、恒定系数和零系数的变异,分别对应G×E相互作用、不G×E相互作用和无遗传效应的情况。该过程通过平滑裁剪绝对偏差惩罚函数的两阶段迭代估计算法实现。在一定的正则性条件下,建立了两阶段迭代估计量在变量选择和效果分离方面的一致性,并给出了两阶段迭代估计量在不同效果下的最优收敛速率。此外,我们还证明了非零常系数的估计具有预言性。通过仿真研究和实际数据分析,证明了该方法的实用性。
{"title":"Additive varying-coefficient model for nonlinear gene-environment interactions.","authors":"Cen Wu,&nbsp;Ping-Shou Zhong,&nbsp;Yuehua Cui","doi":"10.1515/sagmb-2017-0008","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0008","url":null,"abstract":"<p><p>Gene-environment (G×E) interaction plays a pivotal role in understanding the genetic basis of complex disease. When environmental factors are measured continuously, one can assess the genetic sensitivity over different environmental conditions on a disease trait. Motivated by the increasing awareness of gene set based association analysis over single variant based approaches, we proposed an additive varying-coefficient model to jointly model variants in a genetic system. The model allows us to examine how variants in a gene set are moderated by an environment factor to affect a disease phenotype. We approached the problem from a variable selection perspective. In particular, we select variants with varying, constant and zero coefficients, which correspond to cases of G×E interaction, no G×E interaction and no genetic effect, respectively. The procedure was implemented through a two-stage iterative estimation algorithm via the smoothly clipped absolute deviation penalty function. Under certain regularity conditions, we established the consistency property in variable selection as well as effect separation of the two stage iterative estimators, and showed the optimal convergence rates of the estimates for varying effects. In addition, we showed that the estimate of non-zero constant coefficients enjoy the oracle property. The utility of our procedure was demonstrated through simulation studies and real data analysis.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2018-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35810903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Distance-correlation based gene set analysis in longitudinal studies. 纵向研究中基于距离相关的基因集分析。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-02-05 DOI: 10.1515/sagmb-2017-0053
Jiehuan Sun, Jose D Herazo-Maya, Xiu Huang, Naftali Kaminski, Hongyu Zhao

Longitudinal gene expression profiles of subjects are collected in some clinical studies to monitor disease progression and understand disease etiology. The identification of gene sets that have coordinated changes with relevant clinical outcomes over time from these data could provide significant insights into the molecular basis of disease progression and lead to better treatments. In this article, we propose a Distance-Correlation based Gene Set Analysis (dcGSA) method for longitudinal gene expression data. dcGSA is a non-parametric approach, statistically robust, and can capture both linear and nonlinear relationships between gene sets and clinical outcomes. In addition, dcGSA is able to identify related gene sets in cases where the effects of gene sets on clinical outcomes differ across subjects due to the subject heterogeneity, remove the confounding effects of some unobserved time-invariant covariates, and allow the assessment of associations between gene sets and multiple related outcomes simultaneously. Through extensive simulation studies, we demonstrate that dcGSA is more powerful of detecting relevant genes than other commonly used gene set analysis methods. When dcGSA is applied to a real dataset on systemic lupus erythematosus, we are able to identify more disease related gene sets than other methods.

在一些临床研究中,收集受试者的纵向基因表达谱以监测疾病进展和了解疾病病因。随着时间的推移,从这些数据中识别出与相关临床结果协调变化的基因集,可以为疾病进展的分子基础提供重要的见解,并导致更好的治疗。在本文中,我们提出了一种基于距离相关的基因集分析(dcGSA)方法,用于纵向基因表达数据。dcGSA是一种非参数方法,具有统计稳稳性,可以捕获基因集与临床结果之间的线性和非线性关系。此外,dcGSA能够在受试者异质性导致基因组对临床结果的影响不同的情况下识别相关基因组,消除一些未观察到的时不变协变量的混杂效应,并允许同时评估基因组与多个相关结果之间的关联。通过大量的模拟研究,我们证明dcGSA在检测相关基因方面比其他常用的基因集分析方法更强大。当dcGSA应用于系统性红斑狼疮的真实数据集时,我们能够识别出比其他方法更多的疾病相关基因集。
{"title":"Distance-correlation based gene set analysis in longitudinal studies.","authors":"Jiehuan Sun,&nbsp;Jose D Herazo-Maya,&nbsp;Xiu Huang,&nbsp;Naftali Kaminski,&nbsp;Hongyu Zhao","doi":"10.1515/sagmb-2017-0053","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0053","url":null,"abstract":"<p><p>Longitudinal gene expression profiles of subjects are collected in some clinical studies to monitor disease progression and understand disease etiology. The identification of gene sets that have coordinated changes with relevant clinical outcomes over time from these data could provide significant insights into the molecular basis of disease progression and lead to better treatments. In this article, we propose a Distance-Correlation based Gene Set Analysis (dcGSA) method for longitudinal gene expression data. dcGSA is a non-parametric approach, statistically robust, and can capture both linear and nonlinear relationships between gene sets and clinical outcomes. In addition, dcGSA is able to identify related gene sets in cases where the effects of gene sets on clinical outcomes differ across subjects due to the subject heterogeneity, remove the confounding effects of some unobserved time-invariant covariates, and allow the assessment of associations between gene sets and multiple related outcomes simultaneously. Through extensive simulation studies, we demonstrate that dcGSA is more powerful of detecting relevant genes than other commonly used gene set analysis methods. When dcGSA is applied to a real dataset on systemic lupus erythematosus, we are able to identify more disease related gene sets than other methods.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2018-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0053","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35791378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tests for comparison of multiple endpoints with application to omics data. 用组学数据应用多个端点的比较测试。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-01-30 DOI: 10.1515/sagmb-2017-0033
Marco Marozzi

In biomedical research, multiple endpoints are commonly analyzed in "omics" fields like genomics, proteomics and metabolomics. Traditional methods designed for low-dimensional data either perform poorly or are not applicable when analyzing high-dimensional data whose dimension is generally similar to, or even much larger than, the number of subjects. The complex biochemical interplay between hundreds (or thousands) of endpoints is reflected by complex dependence relations. The aim of the paper is to propose tests that are very suitable for analyzing omics data because they do not require the normality assumption, are powerful also for small sample sizes, in the presence of complex dependence relations among endpoints, and when the number of endpoints is much larger than the number of subjects. Unbiasedness and consistency of the tests are proved and their size and power are assessed numerically. It is shown that the proposed approach based on the nonparametric combination of dependent interpoint distance tests is very effective. Applications to genomics and metabolomics are discussed.

在生物医学研究中,基因组学、蛋白质组学和代谢组学等“组学”领域通常分析多个端点。针对低维数据设计的传统方法在分析高维数据时要么表现不佳,要么不适用,因为高维数据的维度通常与受试者的数量相似,甚至远远大于受试者的数量。数百(或数千)个端点之间复杂的生化相互作用反映为复杂的依赖关系。本文的目的是提出非常适合分析组学数据的测试,因为它们不需要正态性假设,对于小样本量,在端点之间存在复杂依赖关系的情况下,以及当端点数量远远大于受试者数量时,它们也很强大。证明了试验的无偏性和一致性,并对试验的规模和效力进行了数值评价。结果表明,基于非参数组合点间距离检验的方法是非常有效的。讨论了基因组学和代谢组学的应用。
{"title":"Tests for comparison of multiple endpoints with application to omics data.","authors":"Marco Marozzi","doi":"10.1515/sagmb-2017-0033","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0033","url":null,"abstract":"<p><p>In biomedical research, multiple endpoints are commonly analyzed in \"omics\" fields like genomics, proteomics and metabolomics. Traditional methods designed for low-dimensional data either perform poorly or are not applicable when analyzing high-dimensional data whose dimension is generally similar to, or even much larger than, the number of subjects. The complex biochemical interplay between hundreds (or thousands) of endpoints is reflected by complex dependence relations. The aim of the paper is to propose tests that are very suitable for analyzing omics data because they do not require the normality assumption, are powerful also for small sample sizes, in the presence of complex dependence relations among endpoints, and when the number of endpoints is much larger than the number of subjects. Unbiasedness and consistency of the tests are proved and their size and power are assessed numerically. It is shown that the proposed approach based on the nonparametric combination of dependent interpoint distance tests is very effective. Applications to genomics and metabolomics are discussed.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2018-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35776896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Non-parametric estimation of population size changes from the site frequency spectrum 基于站点频谱的人口规模变化的非参数估计
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2017-04-07 DOI: 10.1101/125351
B. L. Waltoft, A. Hobolth
Abstract Changes in population size is a useful quantity for understanding the evolutionary history of a species. Genetic variation within a species can be summarized by the site frequency spectrum (SFS). For a sample of size n, the SFS is a vector of length n − 1 where entry i is the number of sites where the mutant base appears i times and the ancestral base appears n − i times. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from an observed SFS. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the changes in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the observed SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on unfolded and folded SFS from 26 different human populations from the 1000 Genomes Project.
摘要种群规模的变化是了解一个物种进化史的有用数量。一个物种内部的遗传变异可以通过位点频谱(SFS)来概括。对于大小为n的样本,SFS是长度为n−1的载体,其中条目i是突变碱基出现i次和祖先碱基出现n−i次的位点数量。我们提出了一种新的方法,CubSFS,用于从观测到的SFS中估计泛米体种群的种群大小变化。首先,我们为仅取决于种群大小的预期站点频谱的表达提供了直接的证明。我们的推导是基于瞬时聚结速率矩阵的特征值分解。其次,我们解决了从观测到的SFS中确定种群大小变化的反问题。我们的解决方案是基于种群大小的三次样条曲线。三次样条曲线是通过最小化两项的加权平均值来确定的,即(i)对观测SFS的拟合优度,以及(ii)基于变化平滑度的惩罚项。重量通过交叉验证确定。新方法在模拟的人口统计学历史上得到了验证,并应用于1000基因组项目中26个不同人群的展开和折叠SFS。
{"title":"Non-parametric estimation of population size changes from the site frequency spectrum","authors":"B. L. Waltoft, A. Hobolth","doi":"10.1101/125351","DOIUrl":"https://doi.org/10.1101/125351","url":null,"abstract":"Abstract Changes in population size is a useful quantity for understanding the evolutionary history of a species. Genetic variation within a species can be summarized by the site frequency spectrum (SFS). For a sample of size n, the SFS is a vector of length n − 1 where entry i is the number of sites where the mutant base appears i times and the ancestral base appears n − i times. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from an observed SFS. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the changes in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the observed SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on unfolded and folded SFS from 26 different human populations from the 1000 Genomes Project.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2017-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47621345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Polyunphased: an extension to polytomous outcomes of the Unphased package for family-based genetic association analysis Polyunphased:基于家庭的遗传关联分析的Unphased软件包的多元体结果的扩展
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2017-03-01 DOI: 10.1515/sagmb-2016-0035
A. Bureau, J. Croteau
Abstract Polytomous phenotypes arise when a disease has multiple subtypes or when two dichotomous phenotypes are analyzed simultaneously. Few software programs offer the option to analyze such phenotypes in family studies, and none implements conditional polytomous logistic regression for within-family analysis robust to population stratification. We introduce Polyunphased, an extension to polytomous phenotypes of the Unphased package, a flexible software tool for genetic association analysis in nuclear families. Like Unphased, Polyunphased is written in C++ and runs from the command line or from a Java graphical user interface. Most Unphased options remain available in Polyunphased, including those handling missing parental genotypes while preserving robustness to population stratification, and the modelling options. Simulation studies confirmed the expected statistical behaviour of the maximum likelihood estimates of the association parameters of the conditional logistic regression model when the corresponding association parameters in the parental term of the likelihood function are set to 0, but revealed convergence problems when estimating these parental association parameters separately. The former approach is thus recommended with polytomous phenotypes.
摘要当一种疾病具有多种亚型或同时分析两种二分表型时,就会出现多染色体表型。在家族研究中,很少有软件程序提供分析此类表型的选项,也没有一个软件程序实现对群体分层稳健的家族内分析的条件多元逻辑回归。我们介绍了Polyunphased,它是Unphased软件包的多晶表型的扩展,是一种用于核心家族遗传关联分析的灵活软件工具。与Unphased一样,Polyunphased是用C++编写的,可以从命令行或Java图形用户界面运行。在Polyunphased中,大多数无相位选项仍然可用,包括那些处理缺失的亲本基因型同时保持对群体分层的稳健性的选项,以及建模选项。模拟研究证实了当似然函数的父母项中的相应关联参数设置为0时,条件逻辑回归模型的关联参数的最大似然估计的预期统计行为,但揭示了单独估计这些父母关联参数时的收敛问题。因此,前一种方法被推荐用于多胞表型。
{"title":"Polyunphased: an extension to polytomous outcomes of the Unphased package for family-based genetic association analysis","authors":"A. Bureau, J. Croteau","doi":"10.1515/sagmb-2016-0035","DOIUrl":"https://doi.org/10.1515/sagmb-2016-0035","url":null,"abstract":"Abstract Polytomous phenotypes arise when a disease has multiple subtypes or when two dichotomous phenotypes are analyzed simultaneously. Few software programs offer the option to analyze such phenotypes in family studies, and none implements conditional polytomous logistic regression for within-family analysis robust to population stratification. We introduce Polyunphased, an extension to polytomous phenotypes of the Unphased package, a flexible software tool for genetic association analysis in nuclear families. Like Unphased, Polyunphased is written in C++ and runs from the command line or from a Java graphical user interface. Most Unphased options remain available in Polyunphased, including those handling missing parental genotypes while preserving robustness to population stratification, and the modelling options. Simulation studies confirmed the expected statistical behaviour of the maximum likelihood estimates of the association parameters of the conditional logistic regression model when the corresponding association parameters in the parental term of the likelihood function are set to 0, but revealed convergence problems when estimating these parental association parameters separately. The former approach is thus recommended with polytomous phenotypes.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2017-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2016-0035","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42874518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Applications in Genetics and Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1