首页 > 最新文献

Statistical Applications in Genetics and Molecular Biology最新文献

英文 中文
A practical approach to adjusting for population stratification in genome-wide association studies: principal components and propensity scores (PCAPS). 在全基因组关联研究中调整人群分层的实用方法:主成分和倾向分数 (PCAPS)。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-12-04 DOI: 10.1515/sagmb-2017-0054
Huaqing Zhao, Nandita Mitra, Peter A Kanetsky, Katherine L Nathanson, Timothy R Rebbeck

Genome-wide association studies (GWAS) are susceptible to bias due to population stratification (PS). The most widely used method to correct bias due to PS is principal components (PCs) analysis (PCA), but there is no objective method to guide which PCs to include as covariates. Often, the ten PCs with the highest eigenvalues are included to adjust for PS. This selection is arbitrary, and patterns of local linkage disequilibrium may affect PCA corrections. To address these limitations, we estimate genomic propensity scores based on all statistically significant PCs selected by the Tracy-Widom (TW) statistic. We compare a principal components and propensity scores (PCAPS) approach to PCA and EMMAX using simulated GWAS data under no, moderate, and severe PS. PCAPS reduced spurious genetic associations regardless of the degree of PS, resulting in odds ratio (OR) estimates closer to the true OR. We illustrate our PCAPS method using GWAS data from a study of testicular germ cell tumors. PCAPS provided a more conservative adjustment than PCA. Advantages of the PCAPS approach include reduction of bias compared to PCA, consistent selection of propensity scores to adjust for PS, the potential ability to handle outliers, and ease of implementation using existing software packages.

全基因组关联研究(GWAS)很容易因人群分层(PS)而产生偏差。校正群体分层偏倚最广泛使用的方法是主成分分析(PCA),但目前还没有客观的方法来指导将哪些主成分作为协变量。通常情况下,我们会将特征值最高的十个 PC 纳入进来,以调整 PS。这种选择是任意的,而且局部连锁不平衡的模式可能会影响 PCA 校正。为了解决这些局限性,我们根据特雷西-维多姆(Tracy-Widom,TW)统计量选出的所有具有统计意义的 PC 来估算基因组倾向得分。我们使用无、中度和重度 PS 下的模拟 GWAS 数据,比较了主成分和倾向得分(PCAPS)方法与 PCA 和 EMMAX。无论 PS 的程度如何,PCAPS 都能减少虚假的遗传关联,从而使比值比 (OR) 估计值更接近真实 OR。我们使用睾丸生殖细胞肿瘤研究的 GWAS 数据来说明 PCAPS 方法。PCAPS 提供了比 PCA 更为保守的调整。PCAPS 方法的优点包括:与 PCA 相比减少了偏差、选择一致的倾向分数来调整 PS、具有处理异常值的潜在能力以及易于使用现有软件包实施。
{"title":"A practical approach to adjusting for population stratification in genome-wide association studies: principal components and propensity scores (PCAPS).","authors":"Huaqing Zhao, Nandita Mitra, Peter A Kanetsky, Katherine L Nathanson, Timothy R Rebbeck","doi":"10.1515/sagmb-2017-0054","DOIUrl":"10.1515/sagmb-2017-0054","url":null,"abstract":"<p><p>Genome-wide association studies (GWAS) are susceptible to bias due to population stratification (PS). The most widely used method to correct bias due to PS is principal components (PCs) analysis (PCA), but there is no objective method to guide which PCs to include as covariates. Often, the ten PCs with the highest eigenvalues are included to adjust for PS. This selection is arbitrary, and patterns of local linkage disequilibrium may affect PCA corrections. To address these limitations, we estimate genomic propensity scores based on all statistically significant PCs selected by the Tracy-Widom (TW) statistic. We compare a principal components and propensity scores (PCAPS) approach to PCA and EMMAX using simulated GWAS data under no, moderate, and severe PS. PCAPS reduced spurious genetic associations regardless of the degree of PS, resulting in odds ratio (OR) estimates closer to the true OR. We illustrate our PCAPS method using GWAS data from a study of testicular germ cell tumors. PCAPS provided a more conservative adjustment than PCA. Advantages of the PCAPS approach include reduction of bias compared to PCA, consistent selection of propensity scores to adjust for PS, the potential ability to handle outliers, and ease of implementation using existing software packages.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 6","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6475581/pdf/nihms-1022442.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36745351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel method to accurately calculate statistical significance of local similarity analysis for high-throughput time series. 一种精确计算高通量时间序列局部相似度统计显著性的新方法。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-11-17 DOI: 10.1515/sagmb-2018-0019
Fang Zhang, Ang Shan, Yihui Luan

In recent years, a large number of time series microbial community data has been produced in molecular biological studies, especially in metagenomics. Among the statistical methods for time series, local similarity analysis is used in a wide range of environments to capture potential local and time-shifted associations that cannot be distinguished by traditional correlation analysis. Initially, the permutation test is popularly applied to obtain the statistical significance of local similarity analysis. More recently, a theoretical method has also been developed to achieve this aim. However, all these methods require the assumption that the time series are independent and identically distributed. In this paper, we propose a new approach based on moving block bootstrap to approximate the statistical significance of local similarity scores for dependent time series. Simulations show that our method can control the type I error rate reasonably, while theoretical approximation and the permutation test perform less well. Finally, our method is applied to human and marine microbial community datasets, indicating that it can identify potential relationship among operational taxonomic units (OTUs) and significantly decrease the rate of false positives.

近年来,分子生物学特别是宏基因组学研究中产生了大量的时间序列微生物群落数据。在时间序列的统计方法中,局部相似度分析用于广泛的环境中,以捕获传统相关分析无法区分的潜在局部关联和时移关联。最初,人们普遍采用排列检验来获得局部相似性分析的统计显著性。最近,也发展了一种理论方法来实现这一目标。然而,所有这些方法都要求假设时间序列是独立的和同分布的。在本文中,我们提出了一种新的基于移动块自举的方法来近似依赖时间序列的局部相似分数的统计显著性。仿真结果表明,该方法能较好地控制第一类错误率,而理论逼近和排列测试的效果较差。最后,将该方法应用于人类和海洋微生物群落数据集,结果表明该方法可以识别出操作分类单元(otu)之间的潜在关系,并显著降低了误报率。
{"title":"A novel method to accurately calculate statistical significance of local similarity analysis for high-throughput time series.","authors":"Fang Zhang,&nbsp;Ang Shan,&nbsp;Yihui Luan","doi":"10.1515/sagmb-2018-0019","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0019","url":null,"abstract":"<p><p>In recent years, a large number of time series microbial community data has been produced in molecular biological studies, especially in metagenomics. Among the statistical methods for time series, local similarity analysis is used in a wide range of environments to capture potential local and time-shifted associations that cannot be distinguished by traditional correlation analysis. Initially, the permutation test is popularly applied to obtain the statistical significance of local similarity analysis. More recently, a theoretical method has also been developed to achieve this aim. However, all these methods require the assumption that the time series are independent and identically distributed. In this paper, we propose a new approach based on moving block bootstrap to approximate the statistical significance of local similarity scores for dependent time series. Simulations show that our method can control the type I error rate reasonably, while theoretical approximation and the permutation test perform less well. Finally, our method is applied to human and marine microbial community datasets, indicating that it can identify potential relationship among operational taxonomic units (OTUs) and significantly decrease the rate of false positives.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 6","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0019","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36739757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Determining the number of components in PLS regression on incomplete data set 不完全数据集PLS回归中分量数量的确定
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-10-18 DOI: 10.1515/sagmb-2018-0059
T. Nengsih, F. Bertrand, M. Maumy-Bertrand, Nicolas Meyer
Abstract Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.
偏最小二乘回归-或PLS回归-是一种多变量方法,其中模型参数估计使用SIMPLS或NIPALS算法。PLS回归因其在分析结果与一个或多个成分之间的关系方面的有效性而被广泛应用于应用研究。注意,NIPALS算法可以在不完整的数据上提供估计参数。在PLS回归中,用于构建代表性模型的组件数量的选择是一个中心问题。然而,在使用PLS回归时如何处理缺失数据仍然是一个有争议的问题。文献中提出了几种方法,包括Q2标准、AIC和BIC标准。在这里,我们研究NIPALS算法在用于拟合PLS回归时的行为,用于不同比例的缺失数据和不同类型的缺失。我们比较了选择不完整数据集和输入数据集上PLS回归的组件数量的标准,使用三种输入方法:链式方程的多重输入,k近邻输入和奇异值分解输入。在不同的缺失假设下,我们用不同的缺失数据比例(从5%到50%不等)测试了各种标准。Q2-leave-one-out方法比基于AIC和bic的选择结果更可靠。
{"title":"Determining the number of components in PLS regression on incomplete data set","authors":"T. Nengsih, F. Bertrand, M. Maumy-Bertrand, Nicolas Meyer","doi":"10.1515/sagmb-2018-0059","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0059","url":null,"abstract":"Abstract Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0059","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46367347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
EBADIMEX: an empirical Bayes approach to detect joint differential expression and methylation and to classify samples EBADIMEX:一种检测联合差异表达和甲基化并对样本进行分类的经验贝叶斯方法
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-08-28 DOI: 10.1101/401232
Tobias Madsen, Michal P. Switnicki, Malene Juul, J. S. Pedersen
Abstract DNA methylation and gene expression are interdependent and both implicated in cancer development and progression, with many individual biomarkers discovered. A joint analysis of the two data types can potentially lead to biological insights that are not discoverable with separate analyses. To optimally leverage the joint data for identifying perturbed genes and classifying clinical cancer samples, it is important to accurately model the interactions between the two data types. Here, we present EBADIMEX for jointly identifying differential expression and methylation and classifying samples. The moderated t-test widely used with empirical Bayes priors in current differential expression methods is generalised to a multivariate setting by developing: (1) a moderated Welch t-test for equality of means with unequal variances; (2) a moderated F-test for equality of variances; and (3) a multivariate test for equality of means with equal variances. This leads to parametric models with prior distributions for the parameters, which allow fast evaluation and robust analysis of small data sets. EBADIMEX is demonstrated on simulated data as well as a large breast cancer (BRCA) cohort from TCGA. We show that the use of empirical Bayes priors and moderated tests works particularly well on small data sets.
摘要DNA甲基化和基因表达是相互依赖的,两者都与癌症的发展和进展有关,发现了许多单独的生物标志物。对这两种数据类型的联合分析可能会产生单独分析无法发现的生物学见解。为了最佳地利用联合数据来识别扰动基因和对临床癌症样本进行分类,准确地对两种数据类型之间的相互作用进行建模是很重要的。在这里,我们介绍了EBADIMEX,用于联合鉴定差异表达和甲基化,并对样本进行分类。在当前的差分表达方法中,与经验贝叶斯先验一起广泛使用的有调节t检验通过发展被推广到多变量设置:(1)方差不等的均值相等的有调节Welch t检验;(2) 方差相等的调节F检验;以及(3)方差相等的均值相等的多变量检验。这导致了具有参数先验分布的参数模型,这允许对小数据集进行快速评估和稳健分析。EBADIMEX在模拟数据以及TCGA的大型癌症(BRCA)队列中得到了证明。我们表明,使用经验贝叶斯先验和调节测试在小数据集上效果特别好。
{"title":"EBADIMEX: an empirical Bayes approach to detect joint differential expression and methylation and to classify samples","authors":"Tobias Madsen, Michal P. Switnicki, Malene Juul, J. S. Pedersen","doi":"10.1101/401232","DOIUrl":"https://doi.org/10.1101/401232","url":null,"abstract":"Abstract DNA methylation and gene expression are interdependent and both implicated in cancer development and progression, with many individual biomarkers discovered. A joint analysis of the two data types can potentially lead to biological insights that are not discoverable with separate analyses. To optimally leverage the joint data for identifying perturbed genes and classifying clinical cancer samples, it is important to accurately model the interactions between the two data types. Here, we present EBADIMEX for jointly identifying differential expression and methylation and classifying samples. The moderated t-test widely used with empirical Bayes priors in current differential expression methods is generalised to a multivariate setting by developing: (1) a moderated Welch t-test for equality of means with unequal variances; (2) a moderated F-test for equality of variances; and (3) a multivariate test for equality of means with equal variances. This leads to parametric models with prior distributions for the parameters, which allow fast evaluation and robust analysis of small data sets. EBADIMEX is demonstrated on simulated data as well as a large breast cancer (BRCA) cohort from TCGA. We show that the use of empirical Bayes priors and moderated tests works particularly well on small data sets.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":" ","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47425887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Noise-robust assessment of SNP array based CNV calls through local noise estimation of log R ratios. 通过对数R比的局部噪声估计,对基于SNP阵列的CNV调用进行噪声鲁棒性评估。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-04-28 DOI: 10.1515/sagmb-2017-0026
Nele Cosemans, Peter Claes, Nathalie Brison, Joris Robert Vermeesch, Hilde Peeters

Arrays based on single nucleotide polymorphisms (SNPs) have been successful for the large scale discovery of copy number variants (CNVs). However, current CNV calling algorithms still have limitations in detecting CNVs with high specificity and sensitivity, especially in case of small (<100 kb) CNVs. Therefore, this study presents a simple statistical analysis to evaluate CNV calls from SNP arrays in order to improve the noise-robustness of existing CNV calling algorithms. The proposed approach estimates local noise of log R ratios and returns the probability that a certain observation is different from this log R ratio noise level. This probability can be triggered at different thresholds to tailor specificity and/or sensitivity in a flexible way. Moreover, a comparison based on qPCR experiments showed that the proposed noise-robust CNV calls outperformed original ones for multiple threshold values.

基于单核苷酸多态性(SNPs)的阵列已经成功地用于大规模发现拷贝数变异(CNVs)。然而,目前的CNV调用算法在检测特异性和灵敏度较高的CNV时仍然存在局限性,特别是在小(
{"title":"Noise-robust assessment of SNP array based CNV calls through local noise estimation of log R ratios.","authors":"Nele Cosemans,&nbsp;Peter Claes,&nbsp;Nathalie Brison,&nbsp;Joris Robert Vermeesch,&nbsp;Hilde Peeters","doi":"10.1515/sagmb-2017-0026","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0026","url":null,"abstract":"<p><p>Arrays based on single nucleotide polymorphisms (SNPs) have been successful for the large scale discovery of copy number variants (CNVs). However, current CNV calling algorithms still have limitations in detecting CNVs with high specificity and sensitivity, especially in case of small (<100 kb) CNVs. Therefore, this study presents a simple statistical analysis to evaluate CNV calls from SNP arrays in order to improve the noise-robustness of existing CNV calling algorithms. The proposed approach estimates local noise of log R ratios and returns the probability that a certain observation is different from this log R ratio noise level. This probability can be triggered at different thresholds to tailor specificity and/or sensitivity in a flexible way. Moreover, a comparison based on qPCR experiments showed that the proposed noise-robust CNV calls outperformed original ones for multiple threshold values.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36054869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On "A mutual information estimator with exponentially decaying bias" by Zhang and Zheng. 关于张和郑的“具有指数衰减偏差的互信息估计量”。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-03-30 DOI: 10.1515/sagmb-2018-0005
Jialin Zhang, Chen Chen

Zhang, Z. and Zheng, L. (2015): "A mutual information estimator with exponentially decaying bias," Stat. Appl. Genet. Mol. Biol., 14, 243-252, proposed a nonparametric estimator of mutual information developed in entropic perspective, and demonstrated that it has much smaller bias than the plugin estimator yet with the same asymptotic normality under certain conditions. However it is incorrectly suggested in their article that the asymptotic normality could be used for testing independence between two random elements on a joint alphabet. When two random elements are independent, the asymptotic distribution of $sqrt{n}$n-normed estimator degenerates and therefore the claimed normality does not hold. This article complements Zhang and Zheng by establishing a new chi-square test using the same entropic statistics for mutual information being zero. The three examples in Zhang and Zheng are re-worked using the new test. The results turn out to be much more sensible and further illustrate the advantage of the entropic perspective in statistical inference on alphabets. More specifically in Example 2, when a positive mutual information is known to exist, the new test detects it but the log likelihood ratio test fails to do so.

张振和郑磊(2015):“一种具有指数衰减偏差的互信息估计器”,中国科学院学报(自然科学版)。麝猫。摩尔。杂志。, 14, 243-252,从熵的角度提出了一种互信息的非参数估计量,并证明了在一定条件下,它的偏差比插件估计量小得多,但具有相同的渐近正态性。然而,在他们的文章中错误地提出,渐近正态性可以用于测试联合字母表上两个随机元素之间的独立性。当两个随机元素独立时,$sqrt{n}$n-范数估计量的渐近分布退化,因此所宣称的正态性不成立。本文通过建立一个新的卡方检验来补充Zhang和Zheng,使用相同的熵统计量为互信息为零。张和郑的三个例子是用新的测试重新制作的。结果更加合理,进一步说明了熵视角在字母统计推理中的优势。更具体地说,在例2中,当已知存在一个正互信息时,新的测试检测到它,但对数似然比测试没有这样做。
{"title":"On \"A mutual information estimator with exponentially decaying bias\" by Zhang and Zheng.","authors":"Jialin Zhang,&nbsp;Chen Chen","doi":"10.1515/sagmb-2018-0005","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0005","url":null,"abstract":"<p><p>Zhang, Z. and Zheng, L. (2015): \"A mutual information estimator with exponentially decaying bias,\" Stat. Appl. Genet. Mol. Biol., 14, 243-252, proposed a nonparametric estimator of mutual information developed in entropic perspective, and demonstrated that it has much smaller bias than the plugin estimator yet with the same asymptotic normality under certain conditions. However it is incorrectly suggested in their article that the asymptotic normality could be used for testing independence between two random elements on a joint alphabet. When two random elements are independent, the asymptotic distribution of $sqrt{n}$n-normed estimator degenerates and therefore the claimed normality does not hold. This article complements Zhang and Zheng by establishing a new chi-square test using the same entropic statistics for mutual information being zero. The three examples in Zhang and Zheng are re-worked using the new test. The results turn out to be much more sensible and further illustrate the advantage of the entropic perspective in statistical inference on alphabets. More specifically in Example 2, when a positive mutual information is known to exist, the new test detects it but the log likelihood ratio test fails to do so.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0005","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35962346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Ensemble survival tree models to reveal pairwise interactions of variables with time-to-events outcomes in low-dimensional setting. 集合生存树模型揭示了低维环境中变量与时间到事件结果的成对相互作用。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-02-17 DOI: 10.1515/sagmb-2017-0038
Jean-Eudes Dazard, Hemant Ishwaran, Rajeev Mehlotra, Aaron Weinberg, Peter Zimmerman

Unraveling interactions among variables such as genetic, clinical, demographic and environmental factors is essential to understand the development of common and complex diseases. To increase the power to detect such variables interactions associated with clinical time-to-events outcomes, we borrowed established concepts from random survival forest (RSF) models. We introduce a novel RSF-based pairwise interaction estimator and derive a randomization method with bootstrap confidence intervals for inferring interaction significance. Using various linear and nonlinear time-to-events survival models in simulation studies, we first show the efficiency of our approach: true pairwise interaction-effects between variables are uncovered, while they may not be accompanied with their corresponding main-effects, and may not be detected by standard semi-parametric regression modeling and test statistics used in survival analysis. Moreover, using a RSF-based cross-validation scheme for generating prediction estimators, we show that informative predictors may be inferred. We applied our approach to an HIV cohort study recording key host gene polymorphisms and their association with HIV change of tropism or AIDS progression. Altogether, this shows how linear or nonlinear pairwise statistical interactions of variables may be efficiently detected with a predictive value in observational studies with time-to-event outcomes.

解开诸如遗传、临床、人口统计和环境因素等变量之间的相互作用对于了解常见和复杂疾病的发展至关重要。为了提高检测与临床事件时间相关的变量相互作用的能力,我们借鉴了随机生存森林(RSF)模型的既定概念。我们引入了一种新的基于rsf的两两交互估计量,并推导了一种带自举置信区间的随机化方法来推断交互显著性。在模拟研究中使用各种线性和非线性时间-事件生存模型,我们首先展示了我们方法的效率:揭示了变量之间真正的两两相互作用效应,而它们可能不伴有相应的主效应,并且可能无法通过生存分析中使用的标准半参数回归建模和检验统计检测到。此外,使用基于rsf的交叉验证方案来生成预测估计器,我们表明可以推断出信息预测器。我们将我们的方法应用于一项HIV队列研究,记录了关键宿主基因多态性及其与HIV嗜性变化或艾滋病进展的关系。总之,这显示了变量的线性或非线性成对统计相互作用如何在具有事件时间结果的观察性研究中有效地检测到预测值。
{"title":"Ensemble survival tree models to reveal pairwise interactions of variables with time-to-events outcomes in low-dimensional setting.","authors":"Jean-Eudes Dazard,&nbsp;Hemant Ishwaran,&nbsp;Rajeev Mehlotra,&nbsp;Aaron Weinberg,&nbsp;Peter Zimmerman","doi":"10.1515/sagmb-2017-0038","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0038","url":null,"abstract":"<p><p>Unraveling interactions among variables such as genetic, clinical, demographic and environmental factors is essential to understand the development of common and complex diseases. To increase the power to detect such variables interactions associated with clinical time-to-events outcomes, we borrowed established concepts from random survival forest (RSF) models. We introduce a novel RSF-based pairwise interaction estimator and derive a randomization method with bootstrap confidence intervals for inferring interaction significance. Using various linear and nonlinear time-to-events survival models in simulation studies, we first show the efficiency of our approach: true pairwise interaction-effects between variables are uncovered, while they may not be accompanied with their corresponding main-effects, and may not be detected by standard semi-parametric regression modeling and test statistics used in survival analysis. Moreover, using a RSF-based cross-validation scheme for generating prediction estimators, we show that informative predictors may be inferred. We applied our approach to an HIV cohort study recording key host gene polymorphisms and their association with HIV change of tropism or AIDS progression. Altogether, this shows how linear or nonlinear pairwise statistical interactions of variables may be efficiently detected with a predictive value in observational studies with time-to-event outcomes.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0038","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35840212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Additive varying-coefficient model for nonlinear gene-environment interactions. 非线性基因-环境相互作用的加性变系数模型。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-02-08 DOI: 10.1515/sagmb-2017-0008
Cen Wu, Ping-Shou Zhong, Yuehua Cui

Gene-environment (G×E) interaction plays a pivotal role in understanding the genetic basis of complex disease. When environmental factors are measured continuously, one can assess the genetic sensitivity over different environmental conditions on a disease trait. Motivated by the increasing awareness of gene set based association analysis over single variant based approaches, we proposed an additive varying-coefficient model to jointly model variants in a genetic system. The model allows us to examine how variants in a gene set are moderated by an environment factor to affect a disease phenotype. We approached the problem from a variable selection perspective. In particular, we select variants with varying, constant and zero coefficients, which correspond to cases of G×E interaction, no G×E interaction and no genetic effect, respectively. The procedure was implemented through a two-stage iterative estimation algorithm via the smoothly clipped absolute deviation penalty function. Under certain regularity conditions, we established the consistency property in variable selection as well as effect separation of the two stage iterative estimators, and showed the optimal convergence rates of the estimates for varying effects. In addition, we showed that the estimate of non-zero constant coefficients enjoy the oracle property. The utility of our procedure was demonstrated through simulation studies and real data analysis.

基因-环境(G×E)相互作用在理解复杂疾病的遗传基础方面起着关键作用。当环境因素被连续测量时,人们可以评估在不同环境条件下对疾病性状的遗传敏感性。由于基于基因集的关联分析比基于单变异的方法更受关注,我们提出了一种加性变系数模型来联合建模遗传系统中的变异。该模型使我们能够研究一组基因中的变异如何受到环境因素的调节,从而影响疾病表型。我们从变量选择的角度来解决这个问题。特别地,我们选择了变系数、恒定系数和零系数的变异,分别对应G×E相互作用、不G×E相互作用和无遗传效应的情况。该过程通过平滑裁剪绝对偏差惩罚函数的两阶段迭代估计算法实现。在一定的正则性条件下,建立了两阶段迭代估计量在变量选择和效果分离方面的一致性,并给出了两阶段迭代估计量在不同效果下的最优收敛速率。此外,我们还证明了非零常系数的估计具有预言性。通过仿真研究和实际数据分析,证明了该方法的实用性。
{"title":"Additive varying-coefficient model for nonlinear gene-environment interactions.","authors":"Cen Wu,&nbsp;Ping-Shou Zhong,&nbsp;Yuehua Cui","doi":"10.1515/sagmb-2017-0008","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0008","url":null,"abstract":"<p><p>Gene-environment (G×E) interaction plays a pivotal role in understanding the genetic basis of complex disease. When environmental factors are measured continuously, one can assess the genetic sensitivity over different environmental conditions on a disease trait. Motivated by the increasing awareness of gene set based association analysis over single variant based approaches, we proposed an additive varying-coefficient model to jointly model variants in a genetic system. The model allows us to examine how variants in a gene set are moderated by an environment factor to affect a disease phenotype. We approached the problem from a variable selection perspective. In particular, we select variants with varying, constant and zero coefficients, which correspond to cases of G×E interaction, no G×E interaction and no genetic effect, respectively. The procedure was implemented through a two-stage iterative estimation algorithm via the smoothly clipped absolute deviation penalty function. Under certain regularity conditions, we established the consistency property in variable selection as well as effect separation of the two stage iterative estimators, and showed the optimal convergence rates of the estimates for varying effects. In addition, we showed that the estimate of non-zero constant coefficients enjoy the oracle property. The utility of our procedure was demonstrated through simulation studies and real data analysis.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35810903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Distance-correlation based gene set analysis in longitudinal studies. 纵向研究中基于距离相关的基因集分析。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-02-05 DOI: 10.1515/sagmb-2017-0053
Jiehuan Sun, Jose D Herazo-Maya, Xiu Huang, Naftali Kaminski, Hongyu Zhao

Longitudinal gene expression profiles of subjects are collected in some clinical studies to monitor disease progression and understand disease etiology. The identification of gene sets that have coordinated changes with relevant clinical outcomes over time from these data could provide significant insights into the molecular basis of disease progression and lead to better treatments. In this article, we propose a Distance-Correlation based Gene Set Analysis (dcGSA) method for longitudinal gene expression data. dcGSA is a non-parametric approach, statistically robust, and can capture both linear and nonlinear relationships between gene sets and clinical outcomes. In addition, dcGSA is able to identify related gene sets in cases where the effects of gene sets on clinical outcomes differ across subjects due to the subject heterogeneity, remove the confounding effects of some unobserved time-invariant covariates, and allow the assessment of associations between gene sets and multiple related outcomes simultaneously. Through extensive simulation studies, we demonstrate that dcGSA is more powerful of detecting relevant genes than other commonly used gene set analysis methods. When dcGSA is applied to a real dataset on systemic lupus erythematosus, we are able to identify more disease related gene sets than other methods.

在一些临床研究中,收集受试者的纵向基因表达谱以监测疾病进展和了解疾病病因。随着时间的推移,从这些数据中识别出与相关临床结果协调变化的基因集,可以为疾病进展的分子基础提供重要的见解,并导致更好的治疗。在本文中,我们提出了一种基于距离相关的基因集分析(dcGSA)方法,用于纵向基因表达数据。dcGSA是一种非参数方法,具有统计稳稳性,可以捕获基因集与临床结果之间的线性和非线性关系。此外,dcGSA能够在受试者异质性导致基因组对临床结果的影响不同的情况下识别相关基因组,消除一些未观察到的时不变协变量的混杂效应,并允许同时评估基因组与多个相关结果之间的关联。通过大量的模拟研究,我们证明dcGSA在检测相关基因方面比其他常用的基因集分析方法更强大。当dcGSA应用于系统性红斑狼疮的真实数据集时,我们能够识别出比其他方法更多的疾病相关基因集。
{"title":"Distance-correlation based gene set analysis in longitudinal studies.","authors":"Jiehuan Sun,&nbsp;Jose D Herazo-Maya,&nbsp;Xiu Huang,&nbsp;Naftali Kaminski,&nbsp;Hongyu Zhao","doi":"10.1515/sagmb-2017-0053","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0053","url":null,"abstract":"<p><p>Longitudinal gene expression profiles of subjects are collected in some clinical studies to monitor disease progression and understand disease etiology. The identification of gene sets that have coordinated changes with relevant clinical outcomes over time from these data could provide significant insights into the molecular basis of disease progression and lead to better treatments. In this article, we propose a Distance-Correlation based Gene Set Analysis (dcGSA) method for longitudinal gene expression data. dcGSA is a non-parametric approach, statistically robust, and can capture both linear and nonlinear relationships between gene sets and clinical outcomes. In addition, dcGSA is able to identify related gene sets in cases where the effects of gene sets on clinical outcomes differ across subjects due to the subject heterogeneity, remove the confounding effects of some unobserved time-invariant covariates, and allow the assessment of associations between gene sets and multiple related outcomes simultaneously. Through extensive simulation studies, we demonstrate that dcGSA is more powerful of detecting relevant genes than other commonly used gene set analysis methods. When dcGSA is applied to a real dataset on systemic lupus erythematosus, we are able to identify more disease related gene sets than other methods.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0053","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35791378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tests for comparison of multiple endpoints with application to omics data. 用组学数据应用多个端点的比较测试。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2018-01-30 DOI: 10.1515/sagmb-2017-0033
Marco Marozzi

In biomedical research, multiple endpoints are commonly analyzed in "omics" fields like genomics, proteomics and metabolomics. Traditional methods designed for low-dimensional data either perform poorly or are not applicable when analyzing high-dimensional data whose dimension is generally similar to, or even much larger than, the number of subjects. The complex biochemical interplay between hundreds (or thousands) of endpoints is reflected by complex dependence relations. The aim of the paper is to propose tests that are very suitable for analyzing omics data because they do not require the normality assumption, are powerful also for small sample sizes, in the presence of complex dependence relations among endpoints, and when the number of endpoints is much larger than the number of subjects. Unbiasedness and consistency of the tests are proved and their size and power are assessed numerically. It is shown that the proposed approach based on the nonparametric combination of dependent interpoint distance tests is very effective. Applications to genomics and metabolomics are discussed.

在生物医学研究中,基因组学、蛋白质组学和代谢组学等“组学”领域通常分析多个端点。针对低维数据设计的传统方法在分析高维数据时要么表现不佳,要么不适用,因为高维数据的维度通常与受试者的数量相似,甚至远远大于受试者的数量。数百(或数千)个端点之间复杂的生化相互作用反映为复杂的依赖关系。本文的目的是提出非常适合分析组学数据的测试,因为它们不需要正态性假设,对于小样本量,在端点之间存在复杂依赖关系的情况下,以及当端点数量远远大于受试者数量时,它们也很强大。证明了试验的无偏性和一致性,并对试验的规模和效力进行了数值评价。结果表明,基于非参数组合点间距离检验的方法是非常有效的。讨论了基因组学和代谢组学的应用。
{"title":"Tests for comparison of multiple endpoints with application to omics data.","authors":"Marco Marozzi","doi":"10.1515/sagmb-2017-0033","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0033","url":null,"abstract":"<p><p>In biomedical research, multiple endpoints are commonly analyzed in \"omics\" fields like genomics, proteomics and metabolomics. Traditional methods designed for low-dimensional data either perform poorly or are not applicable when analyzing high-dimensional data whose dimension is generally similar to, or even much larger than, the number of subjects. The complex biochemical interplay between hundreds (or thousands) of endpoints is reflected by complex dependence relations. The aim of the paper is to propose tests that are very suitable for analyzing omics data because they do not require the normality assumption, are powerful also for small sample sizes, in the presence of complex dependence relations among endpoints, and when the number of endpoints is much larger than the number of subjects. Unbiasedness and consistency of the tests are proved and their size and power are assessed numerically. It is shown that the proposed approach based on the nonparametric combination of dependent interpoint distance tests is very effective. Applications to genomics and metabolomics are discussed.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35776896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Statistical Applications in Genetics and Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1