Statistical Applications in Genetics and Molecular Biology最新文献_第8页

An Empirical Bayes approach for the identification of long-range chromosomal interaction from Hi-C data 从Hi-C数据中鉴定远距离染色体相互作用的经验贝叶斯方法

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-12-17 DOI: 10.1101/497776

Qi Zhang, Zheng Xu, Yutong Lai

Abstract Hi-C experiments have become very popular for studying the 3D genome structure in recent years. Identification of long-range chromosomal interaction, i.e., peak detection, is crucial for Hi-C data analysis. But it remains a challenging task due to the inherent high dimensionality, sparsity and the over-dispersion of the Hi-C count data matrix. We propose EBHiC, an empirical Bayes approach for peak detection from Hi-C data. The proposed framework provides flexible over-dispersion modeling by explicitly including the “true” interaction intensities as latent variables. To implement the proposed peak identification method (via the empirical Bayes test), we estimate the overall distributions of the observed counts semiparametrically using a Smoothed Expectation Maximization algorithm, and the empirical null based on the zero assumption. We conducted extensive simulations to validate and evaluate the performance of our proposed approach and applied it to real datasets. Our results suggest that EBHiC can identify better peaks in terms of accuracy, biological interpretability, and the consistency across biological replicates. The source code is available on Github (https://github.com/QiZhangStat/EBHiC).

近年来，Hi-C实验已成为研究三维基因组结构的热门方法。鉴定远距离染色体相互作用，即峰检测，对Hi-C数据分析至关重要。但由于Hi-C计数数据矩阵固有的高维性、稀疏性和过色散性，这仍然是一项具有挑战性的任务。我们提出了EBHiC，一种从Hi-C数据中检测峰的经验贝叶斯方法。所提出的框架通过明确地包括“真实”相互作用强度作为潜在变量，提供了灵活的过分散建模。为了实现所提出的峰值识别方法(通过经验贝叶斯检验)，我们使用平滑期望最大化算法估计观测计数的半参数总体分布，并基于零假设估计经验零。我们进行了大量的模拟来验证和评估我们提出的方法的性能，并将其应用于实际数据集。我们的研究结果表明，EBHiC在准确性、生物可解释性和跨生物重复的一致性方面可以识别出更好的峰。源代码可在Github (https://github.com/QiZhangStat/EBHiC)上获得。

{"title":"An Empirical Bayes approach for the identification of long-range chromosomal interaction from Hi-C data","authors":"Qi Zhang, Zheng Xu, Yutong Lai","doi":"10.1101/497776","DOIUrl":"https://doi.org/10.1101/497776","url":null,"abstract":"Abstract Hi-C experiments have become very popular for studying the 3D genome structure in recent years. Identification of long-range chromosomal interaction, i.e., peak detection, is crucial for Hi-C data analysis. But it remains a challenging task due to the inherent high dimensionality, sparsity and the over-dispersion of the Hi-C count data matrix. We propose EBHiC, an empirical Bayes approach for peak detection from Hi-C data. The proposed framework provides flexible over-dispersion modeling by explicitly including the “true” interaction intensities as latent variables. To implement the proposed peak identification method (via the empirical Bayes test), we estimate the overall distributions of the observed counts semiparametrically using a Smoothed Expectation Maximization algorithm, and the empirical null based on the zero assumption. We conducted extensive simulations to validate and evaluate the performance of our proposed approach and applied it to real datasets. Our results suggest that EBHiC can identify better peaks in terms of accuracy, biological interpretability, and the consistency across biological replicates. The source code is available on Github (https://github.com/QiZhangStat/EBHiC).","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"20 1","pages":"1 - 15"},"PeriodicalIF":0.9,"publicationDate":"2018-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44544987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

False discovery control for penalized variable selections with high-dimensional covariates. 具有高维协变量的惩罚性变量选择的错误发现控制。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-12-15 DOI: 10.1515/sagmb-2018-0038

Kevin He, Xiang Zhou, Hui Jiang, Xiaoquan Wen, Yi Li

Modern bio-technologies have produced a vast amount of high-throughput data with the number of predictors much exceeding the sample size. Penalized variable selection has emerged as a powerful and efficient dimension reduction tool. However, control of false discoveries (i.e. inclusion of irrelevant variables) for penalized high-dimensional variable selection presents serious challenges. To effectively control the fraction of false discoveries for penalized variable selections, we propose a false discovery controlling procedure. The proposed method is general and flexible, and can work with a broad class of variable selection algorithms, not only for linear regressions, but also for generalized linear models and survival analysis.

现代生物技术产生了大量的高通量数据，预测因子的数量远远超过了样本量。惩罚性变量选择已成为一种强大而高效的降维工具。然而，在惩罚性高维变量选择中控制错误发现（即包含无关变量）是一项严峻的挑战。为了有效控制惩罚性变量选择的错误发现率，我们提出了一种错误发现控制程序。所提出的方法具有通用性和灵活性，可用于多种变量选择算法，不仅适用于线性回归，还适用于广义线性模型和生存分析。

引用次数: 0

A practical approach to adjusting for population stratification in genome-wide association studies: principal components and propensity scores (PCAPS). 在全基因组关联研究中调整人群分层的实用方法：主成分和倾向分数 (PCAPS)。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-12-04 DOI: 10.1515/sagmb-2017-0054

Huaqing Zhao, Nandita Mitra, Peter A Kanetsky, Katherine L Nathanson, Timothy R Rebbeck

Genome-wide association studies (GWAS) are susceptible to bias due to population stratification (PS). The most widely used method to correct bias due to PS is principal components (PCs) analysis (PCA), but there is no objective method to guide which PCs to include as covariates. Often, the ten PCs with the highest eigenvalues are included to adjust for PS. This selection is arbitrary, and patterns of local linkage disequilibrium may affect PCA corrections. To address these limitations, we estimate genomic propensity scores based on all statistically significant PCs selected by the Tracy-Widom (TW) statistic. We compare a principal components and propensity scores (PCAPS) approach to PCA and EMMAX using simulated GWAS data under no, moderate, and severe PS. PCAPS reduced spurious genetic associations regardless of the degree of PS, resulting in odds ratio (OR) estimates closer to the true OR. We illustrate our PCAPS method using GWAS data from a study of testicular germ cell tumors. PCAPS provided a more conservative adjustment than PCA. Advantages of the PCAPS approach include reduction of bias compared to PCA, consistent selection of propensity scores to adjust for PS, the potential ability to handle outliers, and ease of implementation using existing software packages.

全基因组关联研究（GWAS）很容易因人群分层（PS）而产生偏差。校正群体分层偏倚最广泛使用的方法是主成分分析（PCA），但目前还没有客观的方法来指导将哪些主成分作为协变量。通常情况下，我们会将特征值最高的十个 PC 纳入进来，以调整 PS。这种选择是任意的，而且局部连锁不平衡的模式可能会影响 PCA 校正。为了解决这些局限性，我们根据特雷西-维多姆（Tracy-Widom，TW）统计量选出的所有具有统计意义的 PC 来估算基因组倾向得分。我们使用无、中度和重度 PS 下的模拟 GWAS 数据，比较了主成分和倾向得分（PCAPS）方法与 PCA 和 EMMAX。无论 PS 的程度如何，PCAPS 都能减少虚假的遗传关联，从而使比值比 (OR) 估计值更接近真实 OR。我们使用睾丸生殖细胞肿瘤研究的 GWAS 数据来说明 PCAPS 方法。PCAPS 提供了比 PCA 更为保守的调整。PCAPS 方法的优点包括：与 PCA 相比减少了偏差、选择一致的倾向分数来调整 PS、具有处理异常值的潜在能力以及易于使用现有软件包实施。

{"title":"A practical approach to adjusting for population stratification in genome-wide association studies: principal components and propensity scores (PCAPS).","authors":"Huaqing Zhao, Nandita Mitra, Peter A Kanetsky, Katherine L Nathanson, Timothy R Rebbeck","doi":"10.1515/sagmb-2017-0054","DOIUrl":"10.1515/sagmb-2017-0054","url":null,"abstract":"Genome-wide association studies (GWAS) are susceptible to bias due to population stratification (PS). The most widely used method to correct bias due to PS is principal components (PCs) analysis (PCA), but there is no objective method to guide which PCs to include as covariates. Often, the ten PCs with the highest eigenvalues are included to adjust for PS. This selection is arbitrary, and patterns of local linkage disequilibrium may affect PCA corrections. To address these limitations, we estimate genomic propensity scores based on all statistically significant PCs selected by the Tracy-Widom (TW) statistic. We compare a principal components and propensity scores (PCAPS) approach to PCA and EMMAX using simulated GWAS data under no, moderate, and severe PS. PCAPS reduced spurious genetic associations regardless of the degree of PS, resulting in odds ratio (OR) estimates closer to the true OR. We illustrate our PCAPS method using GWAS data from a study of testicular germ cell tumors. PCAPS provided a more conservative adjustment than PCA. Advantages of the PCAPS approach include reduction of bias compared to PCA, consistent selection of propensity scores to adjust for PS, the potential ability to handle outliers, and ease of implementation using existing software packages.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 6","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6475581/pdf/nihms-1022442.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36745351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A novel method to accurately calculate statistical significance of local similarity analysis for high-throughput time series. 一种精确计算高通量时间序列局部相似度统计显著性的新方法。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-11-17 DOI: 10.1515/sagmb-2018-0019

Fang Zhang, Ang Shan, Yihui Luan

In recent years, a large number of time series microbial community data has been produced in molecular biological studies, especially in metagenomics. Among the statistical methods for time series, local similarity analysis is used in a wide range of environments to capture potential local and time-shifted associations that cannot be distinguished by traditional correlation analysis. Initially, the permutation test is popularly applied to obtain the statistical significance of local similarity analysis. More recently, a theoretical method has also been developed to achieve this aim. However, all these methods require the assumption that the time series are independent and identically distributed. In this paper, we propose a new approach based on moving block bootstrap to approximate the statistical significance of local similarity scores for dependent time series. Simulations show that our method can control the type I error rate reasonably, while theoretical approximation and the permutation test perform less well. Finally, our method is applied to human and marine microbial community datasets, indicating that it can identify potential relationship among operational taxonomic units (OTUs) and significantly decrease the rate of false positives.

近年来，分子生物学特别是宏基因组学研究中产生了大量的时间序列微生物群落数据。在时间序列的统计方法中，局部相似度分析用于广泛的环境中，以捕获传统相关分析无法区分的潜在局部关联和时移关联。最初，人们普遍采用排列检验来获得局部相似性分析的统计显著性。最近，也发展了一种理论方法来实现这一目标。然而，所有这些方法都要求假设时间序列是独立的和同分布的。在本文中，我们提出了一种新的基于移动块自举的方法来近似依赖时间序列的局部相似分数的统计显著性。仿真结果表明，该方法能较好地控制第一类错误率，而理论逼近和排列测试的效果较差。最后，将该方法应用于人类和海洋微生物群落数据集，结果表明该方法可以识别出操作分类单元(otu)之间的潜在关系，并显著降低了误报率。

{"title":"A novel method to accurately calculate statistical significance of local similarity analysis for high-throughput time series.","authors":"Fang Zhang, Ang Shan, Yihui Luan","doi":"10.1515/sagmb-2018-0019","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0019","url":null,"abstract":"In recent years, a large number of time series microbial community data has been produced in molecular biological studies, especially in metagenomics. Among the statistical methods for time series, local similarity analysis is used in a wide range of environments to capture potential local and time-shifted associations that cannot be distinguished by traditional correlation analysis. Initially, the permutation test is popularly applied to obtain the statistical significance of local similarity analysis. More recently, a theoretical method has also been developed to achieve this aim. However, all these methods require the assumption that the time series are independent and identically distributed. In this paper, we propose a new approach based on moving block bootstrap to approximate the statistical significance of local similarity scores for dependent time series. Simulations show that our method can control the type I error rate reasonably, while theoretical approximation and the permutation test perform less well. Finally, our method is applied to human and marine microbial community datasets, indicating that it can identify potential relationship among operational taxonomic units (OTUs) and significantly decrease the rate of false positives.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 6","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0019","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36739757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Determining the number of components in PLS regression on incomplete data set 不完全数据集PLS回归中分量数量的确定

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-10-18 DOI: 10.1515/sagmb-2018-0059

T. Nengsih, F. Bertrand, M. Maumy-Bertrand, Nicolas Meyer

Abstract Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.

偏最小二乘回归-或PLS回归-是一种多变量方法，其中模型参数估计使用SIMPLS或NIPALS算法。PLS回归因其在分析结果与一个或多个成分之间的关系方面的有效性而被广泛应用于应用研究。注意，NIPALS算法可以在不完整的数据上提供估计参数。在PLS回归中，用于构建代表性模型的组件数量的选择是一个中心问题。然而，在使用PLS回归时如何处理缺失数据仍然是一个有争议的问题。文献中提出了几种方法，包括Q2标准、AIC和BIC标准。在这里，我们研究NIPALS算法在用于拟合PLS回归时的行为，用于不同比例的缺失数据和不同类型的缺失。我们比较了选择不完整数据集和输入数据集上PLS回归的组件数量的标准，使用三种输入方法:链式方程的多重输入，k近邻输入和奇异值分解输入。在不同的缺失假设下，我们用不同的缺失数据比例(从5%到50%不等)测试了各种标准。Q2-leave-one-out方法比基于AIC和bic的选择结果更可靠。

{"title":"Determining the number of components in PLS regression on incomplete data set","authors":"T. Nengsih, F. Bertrand, M. Maumy-Bertrand, Nicolas Meyer","doi":"10.1515/sagmb-2018-0059","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0059","url":null,"abstract":"Abstract Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0059","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46367347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

EBADIMEX: an empirical Bayes approach to detect joint differential expression and methylation and to classify samples EBADIMEX：一种检测联合差异表达和甲基化并对样本进行分类的经验贝叶斯方法

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-08-28 DOI: 10.1101/401232

Tobias Madsen, Michal P. Switnicki, Malene Juul, J. S. Pedersen

Abstract DNA methylation and gene expression are interdependent and both implicated in cancer development and progression, with many individual biomarkers discovered. A joint analysis of the two data types can potentially lead to biological insights that are not discoverable with separate analyses. To optimally leverage the joint data for identifying perturbed genes and classifying clinical cancer samples, it is important to accurately model the interactions between the two data types. Here, we present EBADIMEX for jointly identifying differential expression and methylation and classifying samples. The moderated t-test widely used with empirical Bayes priors in current differential expression methods is generalised to a multivariate setting by developing: (1) a moderated Welch t-test for equality of means with unequal variances; (2) a moderated F-test for equality of variances; and (3) a multivariate test for equality of means with equal variances. This leads to parametric models with prior distributions for the parameters, which allow fast evaluation and robust analysis of small data sets. EBADIMEX is demonstrated on simulated data as well as a large breast cancer (BRCA) cohort from TCGA. We show that the use of empirical Bayes priors and moderated tests works particularly well on small data sets.

摘要DNA甲基化和基因表达是相互依赖的，两者都与癌症的发展和进展有关，发现了许多单独的生物标志物。对这两种数据类型的联合分析可能会产生单独分析无法发现的生物学见解。为了最佳地利用联合数据来识别扰动基因和对临床癌症样本进行分类，准确地对两种数据类型之间的相互作用进行建模是很重要的。在这里，我们介绍了EBADIMEX，用于联合鉴定差异表达和甲基化，并对样本进行分类。在当前的差分表达方法中，与经验贝叶斯先验一起广泛使用的有调节t检验通过发展被推广到多变量设置：（1）方差不等的均值相等的有调节Welch t检验；（2）方差相等的调节F检验；以及（3）方差相等的均值相等的多变量检验。这导致了具有参数先验分布的参数模型，这允许对小数据集进行快速评估和稳健分析。EBADIMEX在模拟数据以及TCGA的大型癌症（BRCA）队列中得到了证明。我们表明，使用经验贝叶斯先验和调节测试在小数据集上效果特别好。

{"title":"EBADIMEX: an empirical Bayes approach to detect joint differential expression and methylation and to classify samples","authors":"Tobias Madsen, Michal P. Switnicki, Malene Juul, J. S. Pedersen","doi":"10.1101/401232","DOIUrl":"https://doi.org/10.1101/401232","url":null,"abstract":"Abstract DNA methylation and gene expression are interdependent and both implicated in cancer development and progression, with many individual biomarkers discovered. A joint analysis of the two data types can potentially lead to biological insights that are not discoverable with separate analyses. To optimally leverage the joint data for identifying perturbed genes and classifying clinical cancer samples, it is important to accurately model the interactions between the two data types. Here, we present EBADIMEX for jointly identifying differential expression and methylation and classifying samples. The moderated t-test widely used with empirical Bayes priors in current differential expression methods is generalised to a multivariate setting by developing: (1) a moderated Welch t-test for equality of means with unequal variances; (2) a moderated F-test for equality of variances; and (3) a multivariate test for equality of means with equal variances. This leads to parametric models with prior distributions for the parameters, which allow fast evaluation and robust analysis of small data sets. EBADIMEX is demonstrated on simulated data as well as a large breast cancer (BRCA) cohort from TCGA. We show that the use of empirical Bayes priors and moderated tests works particularly well on small data sets.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":" ","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47425887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Noise-robust assessment of SNP array based CNV calls through local noise estimation of log R ratios. 通过对数R比的局部噪声估计，对基于SNP阵列的CNV调用进行噪声鲁棒性评估。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-04-28 DOI: 10.1515/sagmb-2017-0026

Nele Cosemans, Peter Claes, Nathalie Brison, Joris Robert Vermeesch, Hilde Peeters

Arrays based on single nucleotide polymorphisms (SNPs) have been successful for the large scale discovery of copy number variants (CNVs). However, current CNV calling algorithms still have limitations in detecting CNVs with high specificity and sensitivity, especially in case of small (<100 kb) CNVs. Therefore, this study presents a simple statistical analysis to evaluate CNV calls from SNP arrays in order to improve the noise-robustness of existing CNV calling algorithms. The proposed approach estimates local noise of log R ratios and returns the probability that a certain observation is different from this log R ratio noise level. This probability can be triggered at different thresholds to tailor specificity and/or sensitivity in a flexible way. Moreover, a comparison based on qPCR experiments showed that the proposed noise-robust CNV calls outperformed original ones for multiple threshold values.

基于单核苷酸多态性(SNPs)的阵列已经成功地用于大规模发现拷贝数变异(CNVs)。然而，目前的CNV调用算法在检测特异性和灵敏度较高的CNV时仍然存在局限性，特别是在小(

引用次数: 0

On "A mutual information estimator with exponentially decaying bias" by Zhang and Zheng. 关于张和郑的“具有指数衰减偏差的互信息估计量”。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-03-30 DOI: 10.1515/sagmb-2018-0005

Jialin Zhang, Chen Chen

Zhang, Z. and Zheng, L. (2015): "A mutual information estimator with exponentially decaying bias," Stat. Appl. Genet. Mol. Biol., 14, 243-252, proposed a nonparametric estimator of mutual information developed in entropic perspective, and demonstrated that it has much smaller bias than the plugin estimator yet with the same asymptotic normality under certain conditions. However it is incorrectly suggested in their article that the asymptotic normality could be used for testing independence between two random elements on a joint alphabet. When two random elements are independent, the asymptotic distribution of $sqrt{n}$n-normed estimator degenerates and therefore the claimed normality does not hold. This article complements Zhang and Zheng by establishing a new chi-square test using the same entropic statistics for mutual information being zero. The three examples in Zhang and Zheng are re-worked using the new test. The results turn out to be much more sensible and further illustrate the advantage of the entropic perspective in statistical inference on alphabets. More specifically in Example 2, when a positive mutual information is known to exist, the new test detects it but the log likelihood ratio test fails to do so.

张振和郑磊(2015):“一种具有指数衰减偏差的互信息估计器”，中国科学院学报(自然科学版)。麝猫。摩尔。杂志。， 14, 243-252，从熵的角度提出了一种互信息的非参数估计量，并证明了在一定条件下，它的偏差比插件估计量小得多，但具有相同的渐近正态性。然而，在他们的文章中错误地提出，渐近正态性可以用于测试联合字母表上两个随机元素之间的独立性。当两个随机元素独立时，$sqrt{n}$n-范数估计量的渐近分布退化，因此所宣称的正态性不成立。本文通过建立一个新的卡方检验来补充Zhang和Zheng，使用相同的熵统计量为互信息为零。张和郑的三个例子是用新的测试重新制作的。结果更加合理，进一步说明了熵视角在字母统计推理中的优势。更具体地说，在例2中，当已知存在一个正互信息时，新的测试检测到它，但对数似然比测试没有这样做。

{"title":"On \"A mutual information estimator with exponentially decaying bias\" by Zhang and Zheng.","authors":"Jialin Zhang, Chen Chen","doi":"10.1515/sagmb-2018-0005","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0005","url":null,"abstract":"Zhang, Z. and Zheng, L. (2015): \"A mutual information estimator with exponentially decaying bias,\" Stat. Appl. Genet. Mol. Biol., 14, 243-252, proposed a nonparametric estimator of mutual information developed in entropic perspective, and demonstrated that it has much smaller bias than the plugin estimator yet with the same asymptotic normality under certain conditions. However it is incorrectly suggested in their article that the asymptotic normality could be used for testing independence between two random elements on a joint alphabet. When two random elements are independent, the asymptotic distribution of $sqrt{n}$n-normed estimator degenerates and therefore the claimed normality does not hold. This article complements Zhang and Zheng by establishing a new chi-square test using the same entropic statistics for mutual information being zero. The three examples in Zhang and Zheng are re-worked using the new test. The results turn out to be much more sensible and further illustrate the advantage of the entropic perspective in statistical inference on alphabets. More specifically in Example 2, when a positive mutual information is known to exist, the new test detects it but the log likelihood ratio test fails to do so.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0005","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35962346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Ensemble survival tree models to reveal pairwise interactions of variables with time-to-events outcomes in low-dimensional setting. 集合生存树模型揭示了低维环境中变量与时间到事件结果的成对相互作用。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-02-17 DOI: 10.1515/sagmb-2017-0038

Jean-Eudes Dazard, Hemant Ishwaran, Rajeev Mehlotra, Aaron Weinberg, Peter Zimmerman

Unraveling interactions among variables such as genetic, clinical, demographic and environmental factors is essential to understand the development of common and complex diseases. To increase the power to detect such variables interactions associated with clinical time-to-events outcomes, we borrowed established concepts from random survival forest (RSF) models. We introduce a novel RSF-based pairwise interaction estimator and derive a randomization method with bootstrap confidence intervals for inferring interaction significance. Using various linear and nonlinear time-to-events survival models in simulation studies, we first show the efficiency of our approach: true pairwise interaction-effects between variables are uncovered, while they may not be accompanied with their corresponding main-effects, and may not be detected by standard semi-parametric regression modeling and test statistics used in survival analysis. Moreover, using a RSF-based cross-validation scheme for generating prediction estimators, we show that informative predictors may be inferred. We applied our approach to an HIV cohort study recording key host gene polymorphisms and their association with HIV change of tropism or AIDS progression. Altogether, this shows how linear or nonlinear pairwise statistical interactions of variables may be efficiently detected with a predictive value in observational studies with time-to-event outcomes.

解开诸如遗传、临床、人口统计和环境因素等变量之间的相互作用对于了解常见和复杂疾病的发展至关重要。为了提高检测与临床事件时间相关的变量相互作用的能力，我们借鉴了随机生存森林(RSF)模型的既定概念。我们引入了一种新的基于rsf的两两交互估计量，并推导了一种带自举置信区间的随机化方法来推断交互显著性。在模拟研究中使用各种线性和非线性时间-事件生存模型，我们首先展示了我们方法的效率:揭示了变量之间真正的两两相互作用效应，而它们可能不伴有相应的主效应，并且可能无法通过生存分析中使用的标准半参数回归建模和检验统计检测到。此外，使用基于rsf的交叉验证方案来生成预测估计器，我们表明可以推断出信息预测器。我们将我们的方法应用于一项HIV队列研究，记录了关键宿主基因多态性及其与HIV嗜性变化或艾滋病进展的关系。总之，这显示了变量的线性或非线性成对统计相互作用如何在具有事件时间结果的观察性研究中有效地检测到预测值。

{"title":"Ensemble survival tree models to reveal pairwise interactions of variables with time-to-events outcomes in low-dimensional setting.","authors":"Jean-Eudes Dazard, Hemant Ishwaran, Rajeev Mehlotra, Aaron Weinberg, Peter Zimmerman","doi":"10.1515/sagmb-2017-0038","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0038","url":null,"abstract":"Unraveling interactions among variables such as genetic, clinical, demographic and environmental factors is essential to understand the development of common and complex diseases. To increase the power to detect such variables interactions associated with clinical time-to-events outcomes, we borrowed established concepts from random survival forest (RSF) models. We introduce a novel RSF-based pairwise interaction estimator and derive a randomization method with bootstrap confidence intervals for inferring interaction significance. Using various linear and nonlinear time-to-events survival models in simulation studies, we first show the efficiency of our approach: true pairwise interaction-effects between variables are uncovered, while they may not be accompanied with their corresponding main-effects, and may not be detected by standard semi-parametric regression modeling and test statistics used in survival analysis. Moreover, using a RSF-based cross-validation scheme for generating prediction estimators, we show that informative predictors may be inferred. We applied our approach to an HIV cohort study recording key host gene polymorphisms and their association with HIV change of tropism or AIDS progression. Altogether, this shows how linear or nonlinear pairwise statistical interactions of variables may be efficiently detected with a predictive value in observational studies with time-to-event outcomes.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0038","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35840212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Additive varying-coefficient model for nonlinear gene-environment interactions. 非线性基因-环境相互作用的加性变系数模型。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2018-02-08 DOI: 10.1515/sagmb-2017-0008

Cen Wu, Ping-Shou Zhong, Yuehua Cui

Gene-environment (G×E) interaction plays a pivotal role in understanding the genetic basis of complex disease. When environmental factors are measured continuously, one can assess the genetic sensitivity over different environmental conditions on a disease trait. Motivated by the increasing awareness of gene set based association analysis over single variant based approaches, we proposed an additive varying-coefficient model to jointly model variants in a genetic system. The model allows us to examine how variants in a gene set are moderated by an environment factor to affect a disease phenotype. We approached the problem from a variable selection perspective. In particular, we select variants with varying, constant and zero coefficients, which correspond to cases of G×E interaction, no G×E interaction and no genetic effect, respectively. The procedure was implemented through a two-stage iterative estimation algorithm via the smoothly clipped absolute deviation penalty function. Under certain regularity conditions, we established the consistency property in variable selection as well as effect separation of the two stage iterative estimators, and showed the optimal convergence rates of the estimates for varying effects. In addition, we showed that the estimate of non-zero constant coefficients enjoy the oracle property. The utility of our procedure was demonstrated through simulation studies and real data analysis.

基因-环境(G×E)相互作用在理解复杂疾病的遗传基础方面起着关键作用。当环境因素被连续测量时，人们可以评估在不同环境条件下对疾病性状的遗传敏感性。由于基于基因集的关联分析比基于单变异的方法更受关注，我们提出了一种加性变系数模型来联合建模遗传系统中的变异。该模型使我们能够研究一组基因中的变异如何受到环境因素的调节，从而影响疾病表型。我们从变量选择的角度来解决这个问题。特别地，我们选择了变系数、恒定系数和零系数的变异，分别对应G×E相互作用、不G×E相互作用和无遗传效应的情况。该过程通过平滑裁剪绝对偏差惩罚函数的两阶段迭代估计算法实现。在一定的正则性条件下，建立了两阶段迭代估计量在变量选择和效果分离方面的一致性，并给出了两阶段迭代估计量在不同效果下的最优收敛速率。此外，我们还证明了非零常系数的估计具有预言性。通过仿真研究和实际数据分析，证明了该方法的实用性。

{"title":"Additive varying-coefficient model for nonlinear gene-environment interactions.","authors":"Cen Wu, Ping-Shou Zhong, Yuehua Cui","doi":"10.1515/sagmb-2017-0008","DOIUrl":"https://doi.org/10.1515/sagmb-2017-0008","url":null,"abstract":"Gene-environment (G×E) interaction plays a pivotal role in understanding the genetic basis of complex disease. When environmental factors are measured continuously, one can assess the genetic sensitivity over different environmental conditions on a disease trait. Motivated by the increasing awareness of gene set based association analysis over single variant based approaches, we proposed an additive varying-coefficient model to jointly model variants in a genetic system. The model allows us to examine how variants in a gene set are moderated by an environment factor to affect a disease phenotype. We approached the problem from a variable selection perspective. In particular, we select variants with varying, constant and zero coefficients, which correspond to cases of G×E interaction, no G×E interaction and no genetic effect, respectively. The procedure was implemented through a two-stage iterative estimation algorithm via the smoothly clipped absolute deviation penalty function. Under certain regularity conditions, we established the consistency property in variable selection as well as effect separation of the two stage iterative estimators, and showed the optimal convergence rates of the estimates for varying effects. In addition, we showed that the estimate of non-zero constant coefficients enjoy the oracle property. The utility of our procedure was demonstrated through simulation studies and real data analysis.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"17 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2018-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2017-0008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35810903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21