首页 > 最新文献

Statistical Applications in Genetics and Molecular Biology最新文献

英文 中文
pwrBRIDGE: a user-friendly web application for power and sample size estimation in batch-confounded microarray studies with dependent samples. pwrBRIDGE:一个用户友好的web应用程序,用于在依赖样本的批量混杂微阵列研究中估计功率和样本量。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2022-10-10 eCollection Date: 2022-01-01 DOI: 10.1515/sagmb-2022-0003
Qing Xia, Jeffrey A Thompson, Devin C Koestler

Batch effect Reduction of mIcroarray data with Dependent samples usinG Empirical Bayes (BRIDGE) is a recently developed statistical method to address the issue of batch effect correction in batch-confounded microarray studies with dependent samples. The key component of the BRIDGE methodology is the use of samples run as technical replicates in two or more batches, "bridging samples", to inform batch effect correction/attenuation. While previously published results indicate a relationship between the number of bridging samples, M, and the statistical power of downstream statistical testing on the batch-corrected data, there is of yet no formal statistical framework or user-friendly software, for estimating M to achieve a specific statistical power for hypothesis tests conducted on the batch-corrected data. To fill this gap, we developed pwrBRIDGE, a simulation-based approach to estimate the bridging sample size, M, in batch-confounded longitudinal microarray studies. To illustrate the use of pwrBRIDGE, we consider a hypothetical, longitudinal batch-confounded study whose goal is to identify Alzheimer's disease (AD) progression-associated genes from amnestic mild cognitive impairment (aMCI) to AD in human blood after a 5-year follow-up. pwrBRIDGE helps researchers design and plan batch-confounded microarray studies with dependent samples to avoid over- or under-powered studies.

使用经验贝叶斯(BRIDGE)减少依赖样本的微阵列数据的批量效应是最近发展起来的一种统计方法,用于解决具有依赖样本的批量混杂微阵列研究中的批量效应校正问题。BRIDGE方法的关键组成部分是使用在两个或多个批次中作为技术复制运行的样品,“桥接样品”,以通知批次效果校正/衰减。虽然先前发表的结果表明桥接样本的数量M与批量校正数据的下游统计检验的统计能力之间存在关系,但目前还没有正式的统计框架或用户友好的软件来估计M,以实现对批量校正数据进行假设检验的特定统计能力。为了填补这一空白,我们开发了pwrBRIDGE,这是一种基于模拟的方法,用于估计批量混淆纵向微阵列研究中的桥接样本量M。为了说明pwrBRIDGE的使用,我们考虑了一项假设的纵向批量混淆研究,其目标是在5年随访后确定人类血液中从遗忘性轻度认知障碍(aMCI)到AD的阿尔茨海默病(AD)进展相关基因。pwrBRIDGE帮助研究人员设计和计划与依赖样本的批量混淆微阵列研究,以避免过度或不足的研究。
{"title":"<i>pwrBRIDGE</i>: a user-friendly web application for power and sample size estimation in batch-confounded microarray studies with dependent samples.","authors":"Qing Xia,&nbsp;Jeffrey A Thompson,&nbsp;Devin C Koestler","doi":"10.1515/sagmb-2022-0003","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0003","url":null,"abstract":"<p><p><u>B</u>atch effect <u>R</u>eduction of m<u>I</u>croarray data with <u>D</u>ependent samples usin<u>G</u> <u>E</u>mpirical Bayes (<i>BRIDGE</i>) is a recently developed statistical method to address the issue of batch effect correction in batch-confounded microarray studies with dependent samples. The key component of the <i>BRIDGE</i> methodology is the use of samples run as technical replicates in two or more batches, \"bridging samples\", to inform batch effect correction/attenuation. While previously published results indicate a relationship between the number of bridging samples, <i>M</i>, and the statistical power of downstream statistical testing on the batch-corrected data, there is of yet no formal statistical framework or user-friendly software, for estimating <i>M</i> to achieve a specific statistical power for hypothesis tests conducted on the batch-corrected data. To fill this gap, we developed <i>pwrBRIDGE</i>, a simulation-based approach to estimate the bridging sample size, <i>M</i>, in batch-confounded longitudinal microarray studies. To illustrate the use of <i>pwrBRIDGE</i>, we consider a hypothetical, longitudinal batch-confounded study whose goal is to identify Alzheimer's disease (AD) progression-associated genes from amnestic mild cognitive impairment (aMCI) to AD in human blood after a 5-year follow-up. <i>pwrBRIDGE</i> helps researchers design and plan batch-confounded microarray studies with dependent samples to avoid over- or under-powered studies.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9550194/pdf/sagmb-21-1-sagmb-2022-0003.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33519105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distinct characteristics of correlation analysis at the single-cell and the population level. 单细胞和群体水平相关性分析的不同特点。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2022-08-02 eCollection Date: 2022-01-01 DOI: 10.1515/sagmb-2022-0015
Guoyu Wu, Yuchao Li

Correlation analysis is widely used in biological studies to infer molecular relationships within biological networks. Recently, single-cell analysis has drawn tremendous interests, for its ability to obtain high-resolution molecular phenotypes. It turns out that there is little overlap of co-expressed genes identified in single-cell level investigations with that of population level investigations. However, the nature of the relationship of correlations between single-cell and population levels remains unclear. In this manuscript, we aimed to unveil the origin of the differences between the correlation coefficients at the single-cell level and that at the population level, and bridge the gap between them. Through developing formulations to link correlations at the single-cell and the population level, we illustrated that aggregated correlations could be stronger, weaker or equal to the corresponding individual correlations, depending on the variations and the correlations within the population. When the correlation within the population is weaker than the individual correlation, the aggregated correlation is stronger than the corresponding individual correlation. Besides, our data indicated that aggregated correlation is more likely to be stronger than the corresponding individual correlation, and it was rare to find gene-pairs exclusively strongly correlated at the single-cell level. Through a bottom-up approach to model interactions between molecules in a signaling cascade or a multi-regulator-controlled gene expression, we surprisingly found that the existence of interaction between two components could not be excluded simply based on their low correlation coefficients, suggesting a reconsideration of connectivity within biological networks which was derived solely from correlation analysis. We also investigated the impact of technical random measurement errors on the correlation coefficients for the single-cell level and the population level. The results indicate that the aggregated correlation is relatively robust and less affected. Because of the heterogeneity among single cells, correlation coefficients calculated based on data of the single-cell level might be different from that of the population level. Depending on the specific question we are asking, proper sampling and normalization procedure should be done before we draw any conclusions.

相关性分析被广泛应用于生物研究,以推断生物网络中的分子关系。最近,单细胞分析因其获得高分辨率分子表型的能力而引起了极大的兴趣。事实证明,单细胞水平研究中发现的共表达基因与群体水平研究中发现的共表达基因几乎没有重叠。然而,单细胞水平与群体水平之间相关关系的性质仍不清楚。在本稿件中,我们旨在揭示单细胞水平相关系数与群体水平相关系数之间差异的根源,并弥合两者之间的差距。通过将单细胞和种群水平的相关性联系起来的公式,我们说明了根据种群内的变化和相关性,聚集相关性可能更强、更弱或与相应的个体相关性相等。当种群内的相关性弱于个体相关性时,聚集相关性就强于相应的个体相关性。此外,我们的数据表明,聚合相关性更有可能强于相应的个体相关性,而在单细胞水平上发现完全强相关性的基因对并不多见。通过自下而上的方法来模拟信号级联或多调控因子控制的基因表达中分子间的相互作用,我们意外地发现,不能简单地根据两个组分的低相关系数来排除它们之间相互作用的存在,这提示我们要重新考虑生物网络内部的连通性,因为这种连通性仅仅是由相关性分析得出的。我们还研究了技术随机测量误差对单细胞水平和群体水平相关系数的影响。结果表明,总体相关性相对稳健,受影响较小。由于单细胞之间存在异质性,根据单细胞水平的数据计算出的相关系数可能与群体水平的相关系数不同。根据我们提出的具体问题,在得出结论之前,应进行适当的取样和归一化处理。
{"title":"Distinct characteristics of correlation analysis at the single-cell and the population level.","authors":"Guoyu Wu, Yuchao Li","doi":"10.1515/sagmb-2022-0015","DOIUrl":"10.1515/sagmb-2022-0015","url":null,"abstract":"<p><p>Correlation analysis is widely used in biological studies to infer molecular relationships within biological networks. Recently, single-cell analysis has drawn tremendous interests, for its ability to obtain high-resolution molecular phenotypes. It turns out that there is little overlap of co-expressed genes identified in single-cell level investigations with that of population level investigations. However, the nature of the relationship of correlations between single-cell and population levels remains unclear. In this manuscript, we aimed to unveil the origin of the differences between the correlation coefficients at the single-cell level and that at the population level, and bridge the gap between them. Through developing formulations to link correlations at the single-cell and the population level, we illustrated that aggregated correlations could be stronger, weaker or equal to the corresponding individual correlations, depending on the variations and the correlations within the population. When the correlation within the population is weaker than the individual correlation, the aggregated correlation is stronger than the corresponding individual correlation. Besides, our data indicated that aggregated correlation is more likely to be stronger than the corresponding individual correlation, and it was rare to find gene-pairs exclusively strongly correlated at the single-cell level. Through a bottom-up approach to model interactions between molecules in a signaling cascade or a multi-regulator-controlled gene expression, we surprisingly found that the existence of interaction between two components could not be excluded simply based on their low correlation coefficients, suggesting a reconsideration of connectivity within biological networks which was derived solely from correlation analysis. We also investigated the impact of technical random measurement errors on the correlation coefficients for the single-cell level and the population level. The results indicate that the aggregated correlation is relatively robust and less affected. Because of the heterogeneity among single cells, correlation coefficients calculated based on data of the single-cell level might be different from that of the population level. Depending on the specific question we are asking, proper sampling and normalization procedure should be done before we draw any conclusions.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40578441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Use of SVM-based ensemble feature selection method for gene expression data analysis. 利用基于支持向量机的集成特征选择方法对基因表达数据进行分析。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2022-07-14 DOI: 10.1515/sagmb-2022-0002
Shizhi Zhang, Mingjin Zhang

Gene selection is one of the key steps for gene expression data analysis. An SVM-based ensemble feature selection method is proposed in this paper. Firstly, the method builds many subsets by using Monte Carlo sampling. Secondly, ranking all the features on each of the subsets and integrating them to obtain a final ranking list. Finally, the optimum feature set is determined by a backward feature elimination strategy. This method is applied to the analysis of 4 public datasets: the Leukemia, Prostate, Colorectal, and SMK_CAN, resulting 7, 10, 13, and 32 features. The AUC obtained from independent test sets are 0.9867, 0.9796, 0.9571, and 0.9575, respectively. These results indicate that the features selected by the proposed method can improve sample classification accuracy, and thus be effective for gene selection from gene expression data.

基因选择是基因表达数据分析的关键步骤之一。提出了一种基于支持向量机的集成特征选择方法。该方法首先利用蒙特卡罗采样方法构建多个子集;其次,对每个子集上的所有特征进行排序并进行积分,得到最终的排序表。最后,通过反向特征消除策略确定最优特征集。该方法应用于白血病、前列腺癌、结肠直肠癌和SMK_CAN 4个公共数据集的分析,得到7个、10个、13个和32个特征。独立测试集的AUC分别为0.9867、0.9796、0.9571和0.9575。这些结果表明,该方法所选择的特征可以提高样本分类精度,从而有效地从基因表达数据中进行基因选择。
{"title":"Use of SVM-based ensemble feature selection method for gene expression data analysis.","authors":"Shizhi Zhang,&nbsp;Mingjin Zhang","doi":"10.1515/sagmb-2022-0002","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0002","url":null,"abstract":"<p><p>Gene selection is one of the key steps for gene expression data analysis. An SVM-based ensemble feature selection method is proposed in this paper. Firstly, the method builds many subsets by using Monte Carlo sampling. Secondly, ranking all the features on each of the subsets and integrating them to obtain a final ranking list. Finally, the optimum feature set is determined by a backward feature elimination strategy. This method is applied to the analysis of 4 public datasets: the Leukemia, Prostate, Colorectal, and SMK_CAN, resulting 7, 10, 13, and 32 features. The AUC obtained from independent test sets are 0.9867, 0.9796, 0.9571, and 0.9575, respectively. These results indicate that the features selected by the proposed method can improve sample classification accuracy, and thus be effective for gene selection from gene expression data.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40515868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A robust association test with multiple genetic variants and covariates. 一个与多个遗传变异和协变量的强大关联检验。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2022-06-06 DOI: 10.1515/sagmb-2021-0029
Jen-Yu Lee, Pao-Sheng Shen, Kuang-Fu Cheng

Due to the advancement of genome sequencing techniques, a great stride has been made in exome sequencing such that the association study between disease and genetic variants has become feasible. Some powerful and well-known association tests have been proposed to test the association between a group of genes and the disease of interest. However, some challenges still remain, in particular, many factors can affect the performance of testing power, e.g., the sample size, the number of causal and non-causal variants, and direction of the effect of causal variants. Recently, a powerful test, called TREM , is derived based on a random effects model. TREM has the advantages of being less sensitive to the inclusion of non-causal rare variants or low effect common variants or the presence of missing genotypes. However, the testing power of TREM can be low when a portion of causal variants has effects in opposite directions. To improve the drawback of TREM , we propose a novel test, called TROB , which keeps the advantages of TREM and is more robust than TREM in terms of having adequate power in the case of variants with opposite directions of effect. Simulation results show that TROB has a stable type I error rate and outperforms TREM when the proportion of risk variants decreases to a certain level and its advantage over TREM increases as the proportion decreases. Furthermore, TROB outperforms several other competing tests in most scenarios. The proposed methodology is illustrated using the Shanghai Breast Cancer Study.

由于基因组测序技术的进步,外显子组测序取得了长足的进步,使得疾病与遗传变异的关联研究成为可能。已经提出了一些强大而知名的关联测试来测试一组基因与感兴趣的疾病之间的关联。然而,仍然存在一些挑战,特别是许多因素会影响测试能力的表现,例如样本量,因果变量和非因果变量的数量,以及因果变量的影响方向。最近,一个强大的测试,称为TREM,是基于随机效应模型推导出来的。TREM的优点是对包含非因果罕见变异或低影响常见变异或缺失基因型的存在不太敏感。然而,当部分因果变量具有相反方向的影响时,TREM的检验能力可能较低。为了改善TREM的缺点,我们提出了一种新的测试,称为TROB,它保留了TREM的优点,并且在具有相反作用方向的变体的情况下具有足够的功率,比TREM更稳健。仿真结果表明,当风险变量的比例减小到一定程度时,TROB具有稳定的I类错误率,优于TREM,并且其优于TREM的优势随着比例的减小而增大。此外,在大多数情况下,TROB优于其他几个竞争测试。所提出的方法用上海乳腺癌研究来说明。
{"title":"A robust association test with multiple genetic variants and covariates.","authors":"Jen-Yu Lee,&nbsp;Pao-Sheng Shen,&nbsp;Kuang-Fu Cheng","doi":"10.1515/sagmb-2021-0029","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0029","url":null,"abstract":"<p><p>Due to the advancement of genome sequencing techniques, a great stride has been made in exome sequencing such that the association study between disease and genetic variants has become feasible. Some powerful and well-known association tests have been proposed to test the association between a group of genes and the disease of interest. However, some challenges still remain, in particular, many factors can affect the performance of testing power, e.g., the sample size, the number of causal and non-causal variants, and direction of the effect of causal variants. Recently, a powerful test, called <i>T</i><sub><i>REM</i></sub> , is derived based on a random effects model. <i>T</i><sub><i>REM</i></sub> has the advantages of being less sensitive to the inclusion of non-causal rare variants or low effect common variants or the presence of missing genotypes. However, the testing power of <i>T</i><sub><i>REM</i></sub> can be low when a portion of causal variants has effects in opposite directions. To improve the drawback of <i>T</i><sub><i>REM</i></sub> , we propose a novel test, called <i>T</i><sub><i>ROB</i></sub> , which keeps the advantages of <i>T</i><sub><i>REM</i></sub> and is more robust than <i>T</i><sub><i>REM</i></sub> in terms of having adequate power in the case of variants with opposite directions of effect. Simulation results show that <i>T</i><sub><i>ROB</i></sub> has a stable type I error rate and outperforms <i>T</i><sub><i>REM</i></sub> when the proportion of risk variants decreases to a certain level and its advantage over <i>T</i><sub><i>REM</i></sub> increases as the proportion decreases. Furthermore, <i>T</i><sub><i>ROB</i></sub> outperforms several other competing tests in most scenarios. The proposed methodology is illustrated using the Shanghai Breast Cancer Study.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40515867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Challenges for machine learning in RNA-protein interaction prediction. 机器学习在rna -蛋白相互作用预测中的挑战。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2022-05-02 DOI: 10.1515/sagmb-2021-0087
Viplove Arora, Guido Sanguinetti

RNA-protein interactions have long being recognised as crucial regulators of gene expression. Recently, the development of scalable experimental techniques to measure these interactions has revolutionised the field, leading to the production of large-scale datasets which offer both opportunities and challenges for machine learning techniques. In this brief note, we will discuss some of the major stumbling blocks towards the use of machine learning in computational RNA biology, focusing specifically on the problem of predicting RNA-protein interactions from next-generation sequencing data.

长期以来,rna -蛋白相互作用一直被认为是基因表达的关键调控因子。最近,测量这些相互作用的可扩展实验技术的发展已经彻底改变了该领域,导致大规模数据集的产生,这为机器学习技术提供了机遇和挑战。在这篇简短的文章中,我们将讨论在计算RNA生物学中使用机器学习的一些主要障碍,特别关注从下一代测序数据预测RNA-蛋白质相互作用的问题。
{"title":"Challenges for machine learning in RNA-protein interaction prediction.","authors":"Viplove Arora,&nbsp;Guido Sanguinetti","doi":"10.1515/sagmb-2021-0087","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0087","url":null,"abstract":"<p><p>RNA-protein interactions have long being recognised as crucial regulators of gene expression. Recently, the development of scalable experimental techniques to measure these interactions has revolutionised the field, leading to the production of large-scale datasets which offer both opportunities and challenges for machine learning techniques. In this brief note, we will discuss some of the major stumbling blocks towards the use of machine learning in computational RNA biology, focusing specifically on the problem of predicting RNA-protein interactions from next-generation sequencing data.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39963066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Estimation of the covariance structure from SNP allele frequencies 从SNP等位基因频率估计协方差结构
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2022-01-01 DOI: 10.1515/sagmb-2022-0005
J. van Waaij, Zilong Li, C. Wiuf
Abstract We propose two new statistics, V ̂ $hat{V}$ and S ̂ $hat{S}$ , to disentangle the population history of related populations from SNP frequency data. If the populations are related by a tree, we show by theoretical means as well as by simulation that the new statistics are able to identify the root of a tree correctly, in contrast to standard statistics, such as the observed matrix of F 2-statistics (distances between pairs of populations). The statistic V ̂ $hat{V}$ is obtained by averaging over all SNPs (similar to standard statistics). Its expectation is the true covariance matrix of the observed population SNP frequencies, offset by a matrix with identical entries. In contrast, the statistic S ̂ $hat{S}$ is put in a Bayesian context and is obtained by averaging over pairs of SNPs, such that each SNP is only used once. It thus makes use of the joint distribution of pairs of SNPs. In addition, we provide a number of novel mathematical results about old and new statistics, and their mutual relationship.
摘要:本文提出了两个新的统计量V´$hat{V}$和S´$hat{S}$,用于从SNP频率数据中分离相关种群的种群历史。如果种群与树相关,我们通过理论手段和模拟表明,与标准统计(如观察到的f2统计矩阵(种群对之间的距离))相比,新的统计能够正确地识别树的根。统计量V´$hat{V}$是通过对所有snp进行平均得到的(类似于标准统计量)。它的期望是观察到的总体SNP频率的真实协方差矩阵,由具有相同条目的矩阵抵消。相比之下,统计S´$hat{S}$被放在贝叶斯上下文中,并通过对SNP进行平均来获得,这样每个SNP只使用一次。因此,它利用了snp对的联合分布。此外,我们还提供了一些关于新旧统计及其相互关系的新颖数学结果。
{"title":"Estimation of the covariance structure from SNP allele frequencies","authors":"J. van Waaij, Zilong Li, C. Wiuf","doi":"10.1515/sagmb-2022-0005","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0005","url":null,"abstract":"Abstract We propose two new statistics, V ̂ $hat{V}$ and S ̂ $hat{S}$ , to disentangle the population history of related populations from SNP frequency data. If the populations are related by a tree, we show by theoretical means as well as by simulation that the new statistics are able to identify the root of a tree correctly, in contrast to standard statistics, such as the observed matrix of F 2-statistics (distances between pairs of populations). The statistic V ̂ $hat{V}$ is obtained by averaging over all SNPs (similar to standard statistics). Its expectation is the true covariance matrix of the observed population SNP frequencies, offset by a matrix with identical entries. In contrast, the statistic S ̂ $hat{S}$ is put in a Bayesian context and is obtained by averaging over pairs of SNPs, such that each SNP is only used once. It thus makes use of the joint distribution of pairs of SNPs. In addition, we provide a number of novel mathematical results about old and new statistics, and their mutual relationship.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43452592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
GMEPS: a fast and efficient likelihood approach for genome-wide mediation analysis under extreme phenotype sequencing GMEPS:一种在极端表型测序下进行全基因组介导分析的快速有效的可能性方法
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2022-01-01 DOI: 10.1515/sagmb-2021-0071
J. Liyanage, J. Estepp, K. Srivastava, Yun Li, Motomi Mori, G. Kang
Abstract Due to many advantages such as higher statistical power of detecting the association of genetic variants in human disorders and cost saving, extreme phenotype sequencing (EPS) is a rapidly emerging study design in epidemiological and clinical studies investigating how genetic variations associate with complex phenotypes. However, the investigation of the mediation effect of genetic variants on phenotypes is strictly restrictive under the EPS design because existing methods cannot well accommodate the non-random extreme tails sampling process incurred by the EPS design. In this paper, we propose a likelihood approach for testing the mediation effect of genetic variants through continuous and binary mediators on a continuous phenotype under the EPS design (GMEPS). Besides implementing in EPS design, it can also be utilized as a general mediation analysis procedure. Extensive simulations and two real data applications of a genome-wide association study of benign ethnic neutropenia under EPS design and a candidate-gene study of neurocognitive performance in patients with sickle cell disease under random sampling design demonstrate the superiority of GMEPS under the EPS design over widely used mediation analysis procedures, while demonstrating compatible capabilities under the general random sampling framework.
极端表型测序(extreme phenotype sequencing, EPS)由于具有检测人类疾病中遗传变异关联的较高统计能力和节省成本等诸多优势,在流行病学和临床研究中,研究遗传变异与复杂表型之间的关系是一种迅速兴起的研究设计。然而,由于现有方法不能很好地适应EPS设计带来的非随机极端尾抽样过程,因此在EPS设计下,遗传变异对表型的中介效应的研究受到严格限制。在本文中,我们提出了一种可能性方法来测试遗传变异在EPS设计(GMEPS)下通过连续和二元介质对连续表型的中介效应。除了在EPS设计中实现外,还可以作为通用的中介分析程序使用。一项基于EPS设计的良性少数民族中性粒细胞减少的全基因组关联研究和一项基于随机抽样设计的镰状细胞病患者神经认知表现的候选基因研究的广泛模拟和两个实际数据应用表明,EPS设计下的GMEPS优于广泛使用的中介分析程序,同时显示了在一般随机抽样框架下的兼容能力。
{"title":"GMEPS: a fast and efficient likelihood approach for genome-wide mediation analysis under extreme phenotype sequencing","authors":"J. Liyanage, J. Estepp, K. Srivastava, Yun Li, Motomi Mori, G. Kang","doi":"10.1515/sagmb-2021-0071","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0071","url":null,"abstract":"Abstract Due to many advantages such as higher statistical power of detecting the association of genetic variants in human disorders and cost saving, extreme phenotype sequencing (EPS) is a rapidly emerging study design in epidemiological and clinical studies investigating how genetic variations associate with complex phenotypes. However, the investigation of the mediation effect of genetic variants on phenotypes is strictly restrictive under the EPS design because existing methods cannot well accommodate the non-random extreme tails sampling process incurred by the EPS design. In this paper, we propose a likelihood approach for testing the mediation effect of genetic variants through continuous and binary mediators on a continuous phenotype under the EPS design (GMEPS). Besides implementing in EPS design, it can also be utilized as a general mediation analysis procedure. Extensive simulations and two real data applications of a genome-wide association study of benign ethnic neutropenia under EPS design and a candidate-gene study of neurocognitive performance in patients with sickle cell disease under random sampling design demonstrate the superiority of GMEPS under the EPS design over widely used mediation analysis procedures, while demonstrating compatible capabilities under the general random sampling framework.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46601961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Inference of genetic regulatory networks with regulatory hubs using vector autoregressions and automatic relevance determination with model selections. 利用向量自回归和模型选择的自动相关性确定来推断具有调控枢纽的遗传调控网络。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2021-12-28 DOI: 10.1515/sagmb-2020-0054
Chi-Kan Chen

The inference of genetic regulatory networks (GRNs) reveals how genes interact with each other. A few genes can regulate many genes as targets to control cell functions. We present new methods based on the order-1 vector autoregression (VAR1) for inferring GRNs from gene expression time series. The methods use the automatic relevance determination (ARD) to incorporate the regulatory hub structure into the estimation of VAR1 in a Bayesian framework. Several sparse approximation schemes are applied to the estimated regression weights or VAR1 model to generate the sparse weighted adjacency matrices representing the inferred GRNs. We apply the proposed and several widespread reference methods to infer GRNs with up to 100 genes using simulated, DREAM4 in silico and experimental E. coli gene expression time series. We show that the proposed methods are efficient on simulated hub GRNs and scale-free GRNs using short time series simulated by VAR1s and outperform reference methods on small-scale DREAM4 in silico GRNs and E. coli GRNs. They can utilize the known major regulatory hubs to improve the performance on larger DREAM4 in silico GRNs and E. coli GRNs. The impact of nonlinear time series data on the performance of proposed methods is discussed.

遗传调控网络(grn)的推断揭示了基因之间如何相互作用。少数基因可以调控许多基因作为靶标来控制细胞功能。我们提出了基于order-1向量自回归(VAR1)的新方法,用于从基因表达时间序列推断grn。该方法使用自动相关性确定(ARD)将监管枢纽结构纳入贝叶斯框架中VAR1的估计。将几种稀疏逼近方案应用于估计的回归权值或VAR1模型,生成表示推断grn的稀疏加权邻接矩阵。我们利用模拟的DREAM4和实验的大肠杆菌基因表达时间序列,应用所提出的方法和几种广泛的参考方法来推断多达100个基因的grn。研究表明,该方法在模拟轮毂grn和使用var1模拟的短时间序列的无标度grn上是有效的,并且在小型DREAM4硅grn和大肠杆菌grn上优于参考方法。他们可以利用已知的主要调控中心来提高更大的DREAM4硅grn和大肠杆菌grn的性能。讨论了非线性时间序列数据对所提方法性能的影响。
{"title":"Inference of genetic regulatory networks with regulatory hubs using vector autoregressions and automatic relevance determination with model selections.","authors":"Chi-Kan Chen","doi":"10.1515/sagmb-2020-0054","DOIUrl":"https://doi.org/10.1515/sagmb-2020-0054","url":null,"abstract":"<p><p>The inference of genetic regulatory networks (GRNs) reveals how genes interact with each other. A few genes can regulate many genes as targets to control cell functions. We present new methods based on the order-1 vector autoregression (VAR1) for inferring GRNs from gene expression time series. The methods use the automatic relevance determination (ARD) to incorporate the regulatory hub structure into the estimation of VAR1 in a Bayesian framework. Several sparse approximation schemes are applied to the estimated regression weights or VAR1 model to generate the sparse weighted adjacency matrices representing the inferred GRNs. We apply the proposed and several widespread reference methods to infer GRNs with up to 100 genes using simulated, DREAM4 in silico and experimental <i>E. coli</i> gene expression time series. We show that the proposed methods are efficient on simulated hub GRNs and scale-free GRNs using short time series simulated by VAR1s and outperform reference methods on small-scale DREAM4 in silico GRNs and <i>E. coli</i> GRNs. They can utilize the known major regulatory hubs to improve the performance on larger DREAM4 in silico GRNs and <i>E. coli</i> GRNs. The impact of nonlinear time series data on the performance of proposed methods is discussed.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2021-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39646200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Batch effect reduction of microarray data with dependent samples using an empirical Bayes approach (BRIDGE). 使用经验贝叶斯方法(BRIDGE)减少具有依赖性样本的微阵列数据的批次效应。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2021-12-14 DOI: 10.1515/sagmb-2021-0020
Qing Xia, Jeffrey A Thompson, Devin C Koestler

Batch-effects present challenges in the analysis of high-throughput molecular data and are particularly problematic in longitudinal studies when interest lies in identifying genes/features whose expression changes over time, but time is confounded with batch. While many methods to correct for batch-effects exist, most assume independence across samples; an assumption that is unlikely to hold in longitudinal microarray studies. We propose Batch effect Reduction of mIcroarray data with Dependent samples usinGEmpirical Bayes (BRIDGE), a three-step parametric empirical Bayes approach that leverages technical replicate samples profiled at multiple timepoints/batches, so-called "bridge samples", to inform batch-effect reduction/attenuation in longitudinal microarray studies. Extensive simulation studies and an analysis of a real biological data set were conducted to benchmark the performance of BRIDGE against both ComBat and longitudinalComBat. Our results demonstrate that while all methods perform well in facilitating accurate estimates of time effects, BRIDGE outperforms both ComBat and longitudinal ComBat in the removal of batch-effects in data sets with bridging samples, and perhaps as a result, was observed to have improved statistical power for detecting genes with a time effect. BRIDGE demonstrated competitive performance in batch effect reduction of confounded longitudinal microarray studies, both in simulated and a real data sets, and may serve as a useful preprocessing method for researchers conducting longitudinal microarray studies that include bridging samples.

批次效应给高通量分子数据分析带来了挑战,尤其是在纵向研究中,当研究兴趣在于识别表达随时间变化的基因/特征,但时间与批次混淆时,批次效应更是问题重重。虽然有很多方法可以校正批次效应,但大多数方法都假设不同样本之间是独立的,而这一假设在纵向微阵列研究中不太可能成立。我们提出了使用经验贝叶斯降低依赖样本的微阵列数据批次效应(BRIDGE),这是一种三步参数经验贝叶斯方法,它利用在多个时间点/批次剖析的技术复制样本(即所谓的 "桥样本"),为纵向微阵列研究中批次效应的降低/减弱提供信息。我们进行了广泛的模拟研究和对真实生物数据集的分析,以对照 ComBat 和 longitudinalComBat 对 BRIDGE 的性能进行基准测试。我们的结果表明,虽然所有方法都能很好地促进时间效应的准确估计,但 BRIDGE 在消除具有桥接样本的数据集中的批次效应方面优于 ComBat 和纵向 ComBat,因此,在检测具有时间效应的基因方面,BRIDGE 的统计能力也得到了提高。无论是在模拟数据集还是真实数据集中,BRIDGE 在减少纵向微阵列研究中的批次效应方面都表现出了很强的竞争力,可以作为研究人员进行包含桥接样本的纵向微阵列研究的一种有用的预处理方法。
{"title":"Batch effect reduction of microarray data with dependent samples using an empirical Bayes approach (BRIDGE).","authors":"Qing Xia, Jeffrey A Thompson, Devin C Koestler","doi":"10.1515/sagmb-2021-0020","DOIUrl":"10.1515/sagmb-2021-0020","url":null,"abstract":"<p><p>Batch-effects present challenges in the analysis of high-throughput molecular data and are particularly problematic in longitudinal studies when interest lies in identifying genes/features whose expression changes over time, but time is confounded with batch. While many methods to correct for batch-effects exist, most assume independence across samples; an assumption that is unlikely to hold in longitudinal microarray studies. We propose <u>B</u>atch effect <u>R</u>eduction of m<u>I</u>croarray data with <u>D</u>ependent samples usin<u>G</u><u>E</u>mpirical Bayes (<i>BRIDGE</i>), a three-step parametric empirical Bayes approach that leverages technical replicate samples profiled at multiple timepoints/batches, so-called \"bridge samples\", to inform batch-effect reduction/attenuation in longitudinal microarray studies. Extensive simulation studies and an analysis of a real biological data set were conducted to benchmark the performance of <i>BRIDGE</i> against both <i>ComBat</i> and <i>longitudinal</i><i>ComBat</i>. Our results demonstrate that while all methods perform well in facilitating accurate estimates of time effects, <i>BRIDGE</i> outperforms both <i>ComBat</i> and <i>longitudinal ComBat</i> in the removal of batch-effects in data sets with bridging samples, and perhaps as a result, was observed to have improved statistical power for detecting genes with a time effect. <i>BRIDGE</i> demonstrated competitive performance in batch effect reduction of confounded longitudinal microarray studies, both in simulated and a real data sets, and may serve as a useful preprocessing method for researchers conducting longitudinal microarray studies that include bridging samples.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2021-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9617207/pdf/nihms-1843789.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39586240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Frontmatter
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2021-12-01 DOI: 10.1515/sagmb-2021-frontmatter4-6
{"title":"Frontmatter","authors":"","doi":"10.1515/sagmb-2021-frontmatter4-6","DOIUrl":"https://doi.org/10.1515/sagmb-2021-frontmatter4-6","url":null,"abstract":"","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43944170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Applications in Genetics and Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1