首页 > 最新文献

Statistical Applications in Genetics and Molecular Biology最新文献

英文 中文
Testing genotypes-phenotype relationships using permutation tests on association rules. 使用关联规则上的排列测试测试基因型-表型关系。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2015-02-01 DOI: 10.1515/sagmb-2014-0033
Mateen Shaikh, Joseph Beyene

Association rule mining is a knowledge discovery technique which informs researchers about relationships between variables in data. These relationships can be focused to a specific set of response variables. We propose an augmented version of this method to discover groups of genotypes which relate to specific outcomes. We derive the methodology to find these candidate groups of genotypes and illustrate how the method works on data regarding neuroinvasive complications of West Nile virus and through simulation.

关联规则挖掘是一种知识发现技术,它可以告诉研究人员数据中变量之间的关系。这些关系可以集中在一组特定的响应变量上。我们提出了这种方法的增强版本,以发现与特定结果相关的基因型组。我们推导了方法来找到这些候选基因型组,并说明该方法如何处理有关西尼罗河病毒神经侵入性并发症的数据,并通过模拟。
{"title":"Testing genotypes-phenotype relationships using permutation tests on association rules.","authors":"Mateen Shaikh,&nbsp;Joseph Beyene","doi":"10.1515/sagmb-2014-0033","DOIUrl":"https://doi.org/10.1515/sagmb-2014-0033","url":null,"abstract":"<p><p>Association rule mining is a knowledge discovery technique which informs researchers about relationships between variables in data. These relationships can be focused to a specific set of response variables. We propose an augmented version of this method to discover groups of genotypes which relate to specific outcomes. We derive the methodology to find these candidate groups of genotypes and illustrate how the method works on data regarding neuroinvasive complications of West Nile virus and through simulation.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"14 1","pages":"83-92"},"PeriodicalIF":0.9,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2014-0033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32968881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Bayesian mixture model for chromatin interaction data. 染色质相互作用数据的贝叶斯混合模型。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2015-02-01 DOI: 10.1515/sagmb-2014-0029
Liang Niu, Shili Lin

Chromatin interactions mediated by a particular protein are of interest for studying gene regulation, especially the regulation of genes that are associated with, or known to be causative of, a disease. A recent molecular technique, Chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), that uses chromatin immunoprecipitation (ChIP) and high throughput paired-end sequencing, is able to detect such chromatin interactions genomewide. However, ChIA-PET may generate noise (i.e., pairings of DNA fragments by random chance) in addition to true signal (i.e., pairings of DNA fragments by interactions). In this paper, we propose MC_DIST based on a mixture modeling framework to identify true chromatin interactions from ChIA-PET count data (counts of DNA fragment pairs). The model is cast into a Bayesian framework to take into account the dependency among the data and the available information on protein binding sites and gene promoters to reduce false positives. A simulation study showed that MC_DIST outperforms the previously proposed hypergeometric model in terms of both power and type I error rate. A real data study showed that MC_DIST may identify potential chromatin interactions between protein binding sites and gene promoters that may be missed by the hypergeometric model. An R package implementing the MC_DIST model is available at http://www.stat.osu.edu/~statgen/SOFTWARE/MDM.

由特定蛋白质介导的染色质相互作用是研究基因调控,特别是与疾病相关或已知导致疾病的基因调控的兴趣所在。最近的一项分子技术,染色质相互作用分析通过配对末端标记测序(ChIA-PET),使用染色质免疫沉淀(ChIP)和高通量配对末端测序,能够检测全基因组范围内的染色质相互作用。然而,除了真信号(即DNA片段通过相互作用配对)外,ChIA-PET还可能产生噪声(即随机配对的DNA片段)。在本文中,我们提出了基于混合建模框架的MC_DIST,以从china - pet计数数据(DNA片段对计数)中识别真正的染色质相互作用。该模型被转换为贝叶斯框架,以考虑数据之间的依赖性以及蛋白质结合位点和基因启动子的可用信息,以减少误报。仿真研究表明,MC_DIST在功率和I型错误率方面都优于先前提出的超几何模型。一项实际数据研究表明,MC_DIST可以识别蛋白质结合位点和基因启动子之间可能被超几何模型遗漏的潜在染色质相互作用。实现MC_DIST模型的R包可从http://www.stat.osu.edu/~statgen/SOFTWARE/MDM获得。
{"title":"A Bayesian mixture model for chromatin interaction data.","authors":"Liang Niu,&nbsp;Shili Lin","doi":"10.1515/sagmb-2014-0029","DOIUrl":"https://doi.org/10.1515/sagmb-2014-0029","url":null,"abstract":"<p><p>Chromatin interactions mediated by a particular protein are of interest for studying gene regulation, especially the regulation of genes that are associated with, or known to be causative of, a disease. A recent molecular technique, Chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), that uses chromatin immunoprecipitation (ChIP) and high throughput paired-end sequencing, is able to detect such chromatin interactions genomewide. However, ChIA-PET may generate noise (i.e., pairings of DNA fragments by random chance) in addition to true signal (i.e., pairings of DNA fragments by interactions). In this paper, we propose MC_DIST based on a mixture modeling framework to identify true chromatin interactions from ChIA-PET count data (counts of DNA fragment pairs). The model is cast into a Bayesian framework to take into account the dependency among the data and the available information on protein binding sites and gene promoters to reduce false positives. A simulation study showed that MC_DIST outperforms the previously proposed hypergeometric model in terms of both power and type I error rate. A real data study showed that MC_DIST may identify potential chromatin interactions between protein binding sites and gene promoters that may be missed by the hypergeometric model. An R package implementing the MC_DIST model is available at http://www.stat.osu.edu/~statgen/SOFTWARE/MDM.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"14 1","pages":"53-64"},"PeriodicalIF":0.9,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2014-0029","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32890093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A region-based multiple testing method for hypotheses ordered in space or time. 一种在空间或时间上有序的假设的基于区域的多重检验方法。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2015-02-01 DOI: 10.1515/sagmb-2013-0075
Rosa J Meijer, Thijmen J P Krebs, Jelle J Goeman

We present a multiple testing method for hypotheses that are ordered in space or time. Given such hypotheses, the elementary hypotheses as well as regions of consecutive hypotheses are of interest. These region hypotheses not only have intrinsic meaning but testing them also has the advantage that (potentially small) signals across a region are combined in one test. Because the expected number and length of potentially interesting regions are usually not available beforehand, we propose a method that tests all possible region hypotheses as well as all individual hypotheses in a single multiple testing procedure that controls the familywise error rate. We start at testing the global null-hypothesis and when this hypothesis can be rejected we continue with further specifying the exact location/locations of the effect present. The method is implemented in the R package cherry and is illustrated on a DNA copy number data set.

我们提出了一种在空间或时间上有序的假设的多重检验方法。在这样的假设下,基本假设和连续假设的区域是有意义的。这些区域假设不仅具有内在意义,而且测试它们还有一个优势,即跨区域的信号(可能很小)可以在一个测试中组合起来。由于潜在感兴趣的区域的预期数量和长度通常事先无法获得,因此我们提出了一种方法,该方法可以在单个多个测试过程中测试所有可能的区域假设以及所有单独的假设,从而控制家族错误率。我们从测试全局零假设开始,当这个假设可以被拒绝时,我们继续进一步指定当前效应的确切位置。该方法在R包cherry中实现,并在DNA拷贝数数据集上进行了说明。
{"title":"A region-based multiple testing method for hypotheses ordered in space or time.","authors":"Rosa J Meijer,&nbsp;Thijmen J P Krebs,&nbsp;Jelle J Goeman","doi":"10.1515/sagmb-2013-0075","DOIUrl":"https://doi.org/10.1515/sagmb-2013-0075","url":null,"abstract":"<p><p>We present a multiple testing method for hypotheses that are ordered in space or time. Given such hypotheses, the elementary hypotheses as well as regions of consecutive hypotheses are of interest. These region hypotheses not only have intrinsic meaning but testing them also has the advantage that (potentially small) signals across a region are combined in one test. Because the expected number and length of potentially interesting regions are usually not available beforehand, we propose a method that tests all possible region hypotheses as well as all individual hypotheses in a single multiple testing procedure that controls the familywise error rate. We start at testing the global null-hypothesis and when this hypothesis can be rejected we continue with further specifying the exact location/locations of the effect present. The method is implemented in the R package cherry and is illustrated on a DNA copy number data set.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"14 1","pages":"1-19"},"PeriodicalIF":0.9,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2013-0075","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32921938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
A hidden Markov-model for gene mapping based on whole-genome next generation sequencing data. 基于下一代全基因组测序数据的基因定位隐马尔可夫模型。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2015-02-01 DOI: 10.1515/sagmb-2014-0007
Jürgen Claesen, Tomasz Burzykowski

The analysis of polygenic, phenotypic characteristics such as quantitative traits or inheritable diseases requires reliable scoring of many genetic markers covering the entire genome. The advent of high-throughput sequencing technologies provides a new way to evaluate large numbers of single nucleotide polymorphisms as genetic markers. Combining the technologies with pooling of segregants, as performed in bulk segregant analysis, should, in principle, allow the simultaneous mapping of multiple genetic loci present throughout the genome. We propose a hidden Markov-model to analyze the marker data obtained by the bulk segregant next generation sequencing. The model includes several states, each associated with a different probability of observing the same/different nucleotide in an offspring as compared to the parent. The transitions between the molecular markers imply transitions between the states of the model. After estimating the transition probabilities and state-related probabilities of nucleotide (dis)similarity, the most probable state for each SNP is selected. The most probable states can then be used to indicate which genomic regions may be likely to contain trait-related genes. The application of the model is illustrated on the data from a study of ethanol tolerance in yeast. Software is written in R. R-functions, R-scripts and documentation are available on www.ibiostat.be/software/bioinformatics.

多基因、表型特征(如数量性状或遗传性疾病)的分析需要对覆盖整个基因组的许多遗传标记进行可靠的评分。高通量测序技术的出现,为评价大量单核苷酸多态性作为遗传标记提供了新的途径。将这些技术与分离池相结合,就像在批量分离分析中执行的那样,原则上应该允许同时绘制整个基因组中存在的多个遗传位点。我们提出了一个隐马尔可夫模型来分析由批量分离下一代测序获得的标记数据。该模型包括几种状态,每一种状态都与在后代中观察到与父辈相同/不同核苷酸的不同概率有关。分子标记之间的转换意味着模型状态之间的转换。在估计核苷酸(非)相似性的转移概率和状态相关概率后,选择每个SNP的最可能状态。最可能的状态可以用来指示哪些基因组区域可能包含与性状相关的基因。以酵母对乙醇耐受性的研究数据为例说明了该模型的应用。软件是用r编写的,r函数、r脚本和文档可以在www.ibiostat.be/software/bioinformatics上找到。
{"title":"A hidden Markov-model for gene mapping based on whole-genome next generation sequencing data.","authors":"Jürgen Claesen,&nbsp;Tomasz Burzykowski","doi":"10.1515/sagmb-2014-0007","DOIUrl":"https://doi.org/10.1515/sagmb-2014-0007","url":null,"abstract":"<p><p>The analysis of polygenic, phenotypic characteristics such as quantitative traits or inheritable diseases requires reliable scoring of many genetic markers covering the entire genome. The advent of high-throughput sequencing technologies provides a new way to evaluate large numbers of single nucleotide polymorphisms as genetic markers. Combining the technologies with pooling of segregants, as performed in bulk segregant analysis, should, in principle, allow the simultaneous mapping of multiple genetic loci present throughout the genome. We propose a hidden Markov-model to analyze the marker data obtained by the bulk segregant next generation sequencing. The model includes several states, each associated with a different probability of observing the same/different nucleotide in an offspring as compared to the parent. The transitions between the molecular markers imply transitions between the states of the model. After estimating the transition probabilities and state-related probabilities of nucleotide (dis)similarity, the most probable state for each SNP is selected. The most probable states can then be used to indicate which genomic regions may be likely to contain trait-related genes. The application of the model is illustrated on the data from a study of ethanol tolerance in yeast. Software is written in R. R-functions, R-scripts and documentation are available on www.ibiostat.be/software/bioinformatics.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"14 1","pages":"21-34"},"PeriodicalIF":0.9,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2014-0007","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32885132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Exploring homogeneity of correlation structures of gene expression datasets within and between etiological disease categories. 探索病原学疾病类别内部和之间基因表达数据集相关结构的同质性。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2014-12-01 DOI: 10.1515/sagmb-2014-0003
Victor L Jong, Putri W Novianti, Kit C B Roes, Marinus J C Eijkemans

The literature shows that classifiers perform differently across datasets and that correlations within datasets affect the performance of classifiers. The question that arises is whether the correlation structure within datasets differ significantly across diseases. In this study, we evaluated the homogeneity of correlation structures within and between datasets of six etiological disease categories; inflammatory, immune, infectious, degenerative, hereditary and acute myeloid leukemia (AML). We also assessed the effect of filtering; detection call and variance filtering on correlation structures. We downloaded microarray datasets from ArrayExpress for experiments meeting predefined criteria and ended up with 12 datasets for non-cancerous diseases and six for AML. The datasets were preprocessed by a common procedure incorporating platform-specific recommendations and the two filtering methods mentioned above. Homogeneity of correlation matrices between and within datasets of etiological diseases was assessed using the Box's M statistic on permuted samples. We found that correlation structures significantly differ between datasets of the same and/or different etiological disease categories and that variance filtering eliminates more uncorrelated probesets than detection call filtering and thus renders the data highly correlated.

文献表明,分类器在不同的数据集上表现不同,数据集内的相关性影响分类器的性能。由此产生的问题是,数据集内的相关结构在不同疾病之间是否存在显著差异。在这项研究中,我们评估了六种病原学疾病类别数据集内部和之间相关结构的同质性;炎性、免疫性、感染性、退行性、遗传性和急性髓性白血病(AML)。我们还评估了过滤的效果;相关结构的检测调用和方差滤波。我们从ArrayExpress下载了符合预定义标准的微阵列数据集,最终获得了12个非癌性疾病数据集和6个AML数据集。数据集通过结合平台特定建议和上述两种过滤方法的通用程序进行预处理。使用排列样本的Box's M统计量评估病原学疾病数据集之间和内部相关矩阵的同质性。我们发现相同和/或不同病因疾病类别的数据集之间的相关性结构显着不同,方差过滤比检测调用过滤消除了更多不相关的问题集,从而使数据高度相关。
{"title":"Exploring homogeneity of correlation structures of gene expression datasets within and between etiological disease categories.","authors":"Victor L Jong,&nbsp;Putri W Novianti,&nbsp;Kit C B Roes,&nbsp;Marinus J C Eijkemans","doi":"10.1515/sagmb-2014-0003","DOIUrl":"https://doi.org/10.1515/sagmb-2014-0003","url":null,"abstract":"<p><p>The literature shows that classifiers perform differently across datasets and that correlations within datasets affect the performance of classifiers. The question that arises is whether the correlation structure within datasets differ significantly across diseases. In this study, we evaluated the homogeneity of correlation structures within and between datasets of six etiological disease categories; inflammatory, immune, infectious, degenerative, hereditary and acute myeloid leukemia (AML). We also assessed the effect of filtering; detection call and variance filtering on correlation structures. We downloaded microarray datasets from ArrayExpress for experiments meeting predefined criteria and ended up with 12 datasets for non-cancerous diseases and six for AML. The datasets were preprocessed by a common procedure incorporating platform-specific recommendations and the two filtering methods mentioned above. Homogeneity of correlation matrices between and within datasets of etiological diseases was assessed using the Box's M statistic on permuted samples. We found that correlation structures significantly differ between datasets of the same and/or different etiological disease categories and that variance filtering eliminates more uncorrelated probesets than detection call filtering and thus renders the data highly correlated.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"13 6","pages":"717-32"},"PeriodicalIF":0.9,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2014-0003","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32906123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Covariate adjusted differential variability analysis of DNA methylation with propensity score method. 用倾向评分法分析DNA甲基化的协变量调整差异变异性。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2014-12-01 DOI: 10.1515/sagmb-2013-0072
Pei Fen Kuan

It has been proposed recently that differentially variable CpG methylation (DVC) may contribute to transcriptional aberrations in human diseases. In large scale epigenetic studies, potential confounders could affect the observed methylation variabilities and need to be accounted for. In this paper, we develop a robust statistical model for differential variability DVC analysis that accounts for potential confounding covariates by utilizing the propensity score method. Our method is based on a weighted score test on strata generated propensity score stratification. To the best of our knowledge, this is the first proposed statistical method for detecting DVCs that adjusts for confounding covariates. We show that this method is robust against model misspecification and achieves good operating characteristics based on extensive simulations and a case study.

最近有人提出,差异可变CpG甲基化(DVC)可能有助于人类疾病的转录畸变。在大规模的表观遗传学研究中,潜在的混杂因素可能会影响观察到的甲基化变异性,需要加以考虑。在本文中,我们开发了一个稳健的统计模型,用于差分变异性DVC分析,该模型利用倾向评分法解释了潜在的混杂协变量。我们的方法是基于对地层生成的倾向评分分层进行加权得分测试。据我们所知,这是首次提出的用于检测DVCs的统计方法,该方法可以根据混杂协变量进行调整。通过大量的仿真和实例研究表明,该方法对模型错配具有较强的鲁棒性,并取得了良好的运行特性。
{"title":"Covariate adjusted differential variability analysis of DNA methylation with propensity score method.","authors":"Pei Fen Kuan","doi":"10.1515/sagmb-2013-0072","DOIUrl":"https://doi.org/10.1515/sagmb-2013-0072","url":null,"abstract":"<p><p>It has been proposed recently that differentially variable CpG methylation (DVC) may contribute to transcriptional aberrations in human diseases. In large scale epigenetic studies, potential confounders could affect the observed methylation variabilities and need to be accounted for. In this paper, we develop a robust statistical model for differential variability DVC analysis that accounts for potential confounding covariates by utilizing the propensity score method. Our method is based on a weighted score test on strata generated propensity score stratification. To the best of our knowledge, this is the first proposed statistical method for detecting DVCs that adjusts for confounding covariates. We show that this method is robust against model misspecification and achieves good operating characteristics based on extensive simulations and a case study.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"13 6","pages":"645-58"},"PeriodicalIF":0.9,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2013-0072","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32761678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
P-value calibration for multiple testing problems in genomics. 基因组学中多重检测问题的p值校准。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2014-12-01 DOI: 10.1515/sagmb-2013-0074
John P Ferguson, Dean Palejev

Conservative statistical tests are often used in complex multiple testing settings in which computing the type I error may be difficult. In such tests, the reported p-value for a hypothesis can understate the evidence against the null hypothesis and consequently statistical power may be lost. False Discovery Rate adjustments, used in multiple comparison settings, can worsen the unfavorable effect. We present a computationally efficient and test-agnostic calibration technique that can substantially reduce the conservativeness of such tests. As a consequence, a lower sample size might be sufficient to reject the null hypothesis for true alternatives, and experimental costs can be lowered. We apply the calibration technique to the results of DESeq, a popular method for detecting differentially expressed genes from RNA sequencing data. The increase in power may be particularly high in small sample size experiments, often used in preliminary experiments and funding applications.

保守统计检验通常用于复杂的多重测试设置,其中计算第一类误差可能很困难。在这样的检验中,假设的报告p值可能低估了反对原假设的证据,因此可能会失去统计能力。在多个比较设置中使用的虚假发现率调整可能会恶化不利影响。我们提出了一种计算效率高且与测试无关的校准技术,可以大大降低此类测试的保守性。因此,较低的样本量可能足以拒绝真实替代方案的零假设,并且可以降低实验成本。我们将校准技术应用于DESeq的结果,DESeq是一种从RNA测序数据中检测差异表达基因的流行方法。在小样本实验中,功率的增加可能特别高,通常用于初步实验和资助应用。
{"title":"P-value calibration for multiple testing problems in genomics.","authors":"John P Ferguson,&nbsp;Dean Palejev","doi":"10.1515/sagmb-2013-0074","DOIUrl":"https://doi.org/10.1515/sagmb-2013-0074","url":null,"abstract":"<p><p>Conservative statistical tests are often used in complex multiple testing settings in which computing the type I error may be difficult. In such tests, the reported p-value for a hypothesis can understate the evidence against the null hypothesis and consequently statistical power may be lost. False Discovery Rate adjustments, used in multiple comparison settings, can worsen the unfavorable effect. We present a computationally efficient and test-agnostic calibration technique that can substantially reduce the conservativeness of such tests. As a consequence, a lower sample size might be sufficient to reject the null hypothesis for true alternatives, and experimental costs can be lowered. We apply the calibration technique to the results of DESeq, a popular method for detecting differentially expressed genes from RNA sequencing data. The increase in power may be particularly high in small sample size experiments, often used in preliminary experiments and funding applications.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"13 6","pages":"659-73"},"PeriodicalIF":0.9,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2013-0074","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32761679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
When is Menzerath-Altmann law mathematically trivial? A new approach. 什么时候Menzerath-Altmann定律在数学上是微不足道的?一种新的方法。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2014-12-01 DOI: 10.1515/sagmb-2013-0034
Ramon Ferrer-i-Cancho, Antoni Hernández-Fernández, Jaume Baixeries, Łukasz Dębowski, Ján Mačutek

Menzerath's law, the tendency of Z (the mean size of the parts) to decrease as X (the number of parts) increases, is found in language, music and genomes. Recently, it has been argued that the presence of the law in genomes is an inevitable consequence of the fact that Z=Y/X, which would imply that Z scales with X as Z ∼ 1/X. That scaling is a very particular case of Menzerath-Altmann law that has been rejected by means of a correlation test between X and Y in genomes, being X the number of chromosomes of a species, Y its genome size in bases and Z the mean chromosome size. Here we review the statistical foundations of that test and consider three non-parametric tests based upon different correlation metrics and one parametric test to evaluate if Z ∼ 1/X in genomes. The most powerful test is a new non-parametric one based upon the correlation ratio, which is able to reject Z ∼ 1/X in nine out of 11 taxonomic groups and detect a borderline group. Rather than a fact, Z ∼ 1/X is a baseline that real genomes do not meet. The view of Menzerath-Altmann law as inevitable is seriously flawed.

门泽拉斯定律,即Z(部分的平均大小)随着X(部分的数量)的增加而减小的趋势,在语言、音乐和基因组中都有发现。最近,有人认为该定律在基因组中的存在是Z=Y/X这一事实的必然结果,这意味着Z与X的比例为Z ~ 1/X。这种缩放是Menzerath-Altmann定律的一个非常特殊的例子,它被基因组中X和Y之间的相关性测试所拒绝,X是一个物种的染色体数量,Y是它的基因组碱基大小,Z是平均染色体大小。在这里,我们回顾了该测试的统计基础,并考虑了基于不同相关指标的三个非参数测试和一个参数测试来评估Z ~ 1/X是否在基因组中。最强大的检验是基于相关比率的新型非参数检验,它能够在11个分类类群中排除9个Z ~ 1/X,并检测出一个边缘类群。Z ~ 1/X不是事实,而是真实基因组不符合的基线。Menzerath-Altmann法则不可避免的观点是有严重缺陷的。
{"title":"When is Menzerath-Altmann law mathematically trivial? A new approach.","authors":"Ramon Ferrer-i-Cancho,&nbsp;Antoni Hernández-Fernández,&nbsp;Jaume Baixeries,&nbsp;Łukasz Dębowski,&nbsp;Ján Mačutek","doi":"10.1515/sagmb-2013-0034","DOIUrl":"https://doi.org/10.1515/sagmb-2013-0034","url":null,"abstract":"<p><p>Menzerath's law, the tendency of Z (the mean size of the parts) to decrease as X (the number of parts) increases, is found in language, music and genomes. Recently, it has been argued that the presence of the law in genomes is an inevitable consequence of the fact that Z=Y/X, which would imply that Z scales with X as Z ∼ 1/X. That scaling is a very particular case of Menzerath-Altmann law that has been rejected by means of a correlation test between X and Y in genomes, being X the number of chromosomes of a species, Y its genome size in bases and Z the mean chromosome size. Here we review the statistical foundations of that test and consider three non-parametric tests based upon different correlation metrics and one parametric test to evaluate if Z ∼ 1/X in genomes. The most powerful test is a new non-parametric one based upon the correlation ratio, which is able to reject Z ∼ 1/X in nine out of 11 taxonomic groups and detect a borderline group. Rather than a fact, Z ∼ 1/X is a baseline that real genomes do not meet. The view of Menzerath-Altmann law as inevitable is seriously flawed.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"13 6","pages":"633-44"},"PeriodicalIF":0.9,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2013-0034","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32906121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Markovianness and conditional independence in annotated bacterial DNA. 注释细菌DNA的马尔可夫性和条件独立性。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2014-12-01 DOI: 10.1515/sagmb-2014-0002
Andrew Hart, Servet Martínez

We explore the probabilistic structure of DNA in a number of bacterial genomes and conclude that a form of Markovianness is present at the boundaries between coding and non-coding regions, that is, the sequence of START and STOP codons annotated for the bacterial genome. This sequence is shown to satisfy a conditional independence property which allows its governing Markov chain to be uniquely identified from the abundances of START and STOP codons. Furthermore, we show that the annotated sequence of STARTs and STOPs complies with Chargaff's second parity rule.

我们在许多细菌基因组中探索了DNA的概率结构,并得出结论,在编码区和非编码区之间的边界存在一种形式的马尔可夫性,即细菌基因组注释的START和STOP密码子序列。该序列满足条件独立性质,允许其控制马尔可夫链从START和STOP密码子的丰度唯一识别。进一步,我们证明了标注的start和stop序列符合Chargaff第二奇偶规则。
{"title":"Markovianness and conditional independence in annotated bacterial DNA.","authors":"Andrew Hart,&nbsp;Servet Martínez","doi":"10.1515/sagmb-2014-0002","DOIUrl":"https://doi.org/10.1515/sagmb-2014-0002","url":null,"abstract":"<p><p>We explore the probabilistic structure of DNA in a number of bacterial genomes and conclude that a form of Markovianness is present at the boundaries between coding and non-coding regions, that is, the sequence of START and STOP codons annotated for the bacterial genome. This sequence is shown to satisfy a conditional independence property which allows its governing Markov chain to be uniquely identified from the abundances of START and STOP codons. Furthermore, we show that the annotated sequence of STARTs and STOPs complies with Chargaff's second parity rule.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"13 6","pages":"693-716"},"PeriodicalIF":0.9,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2014-0002","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32906122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Robust methods to detect disease-genotype association in genetic association studies: calculate p-values using exact conditional enumeration instead of simulated permutations or asymptotic approximations. 在遗传关联研究中检测疾病-基因型关联的鲁棒方法:使用精确条件枚举而不是模拟排列或渐近近似计算p值。
IF 0.9 4区 数学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Pub Date : 2014-12-01 DOI: 10.1515/sagmb-2013-0084
Mette Langaas, Øyvind Bakke

In genetic association studies, detecting disease-genotype association is a primary goal. We study seven robust test statistics for such association when the underlying genetic model is unknown, for data on disease status (case or control) and genotype (three genotypes of a biallelic genetic marker). In such studies, p-values have predominantly been calculated by asymptotic approximations or by simulated permutations. We consider an exact method, conditional enumeration. When the number of simulated permutations tends to infinity, the permutation p-value approaches the conditional enumeration p-value, but calculating the latter is much more efficient than performing simulated permutations. We have studied case-control sample sizes with 500-5000 cases and 500-15,000 controls, and significance levels from 5 × 10(-8) to 0.05, thus our results are applicable to genetic association studies with only a few genetic markers under study, intermediate follow-up studies, and genome-wide association studies. Our main findings are: (i) If all monotone genetic models are of interest, the best performance in the situations under study is achieved for the robust test statistics based on the maximum over a range of Cochran-Armitage trend tests with different scores and for the constrained likelihood ratio test. (ii) For significance levels below 0.05, for the test statistics under study, asymptotic approximations may give a test size up to 20 times the nominal level, and should therefore be used with caution. (iii) Calculating p-values based on exact conditional enumeration is a powerful, valid and computationally feasible approach, and we advocate its use in genetic association studies.

在遗传关联研究中,检测疾病-基因型关联是一个主要目标。当潜在的遗传模型未知时,我们研究了7个关于疾病状态(病例或对照)和基因型(双等位基因遗传标记的三种基因型)的相关检验统计数据。在这类研究中,p值主要是通过渐近逼近或模拟排列来计算的。我们考虑一种精确方法,条件枚举。当模拟排列的数量趋于无穷大时,排列的p值接近条件枚举的p值,但计算后者比执行模拟排列要有效得多。我们研究了500-5000例病例和500- 15000例对照的病例-对照样本量,显著性水平从5 × 10(-8)到0.05,因此我们的结果适用于仅研究少数遗传标记的遗传关联研究、中期随访研究和全基因组关联研究。我们的主要发现是:(i)如果所有的单调遗传模型都是感兴趣的,那么在研究的情况下,基于不同分数的Cochran-Armitage趋势检验范围内的最大值的稳健检验统计量和约束似然比检验达到了最佳性能。(ii)对于0.05以下的显著性水平,对于正在研究的检验统计量,渐近近似可能会给出高达名义水平20倍的检验大小,因此应谨慎使用。(iii)基于精确条件枚举计算p值是一种强大、有效和计算可行的方法,我们提倡在遗传关联研究中使用它。
{"title":"Robust methods to detect disease-genotype association in genetic association studies: calculate p-values using exact conditional enumeration instead of simulated permutations or asymptotic approximations.","authors":"Mette Langaas,&nbsp;Øyvind Bakke","doi":"10.1515/sagmb-2013-0084","DOIUrl":"https://doi.org/10.1515/sagmb-2013-0084","url":null,"abstract":"<p><p>In genetic association studies, detecting disease-genotype association is a primary goal. We study seven robust test statistics for such association when the underlying genetic model is unknown, for data on disease status (case or control) and genotype (three genotypes of a biallelic genetic marker). In such studies, p-values have predominantly been calculated by asymptotic approximations or by simulated permutations. We consider an exact method, conditional enumeration. When the number of simulated permutations tends to infinity, the permutation p-value approaches the conditional enumeration p-value, but calculating the latter is much more efficient than performing simulated permutations. We have studied case-control sample sizes with 500-5000 cases and 500-15,000 controls, and significance levels from 5 × 10(-8) to 0.05, thus our results are applicable to genetic association studies with only a few genetic markers under study, intermediate follow-up studies, and genome-wide association studies. Our main findings are: (i) If all monotone genetic models are of interest, the best performance in the situations under study is achieved for the robust test statistics based on the maximum over a range of Cochran-Armitage trend tests with different scores and for the constrained likelihood ratio test. (ii) For significance levels below 0.05, for the test statistics under study, asymptotic approximations may give a test size up to 20 times the nominal level, and should therefore be used with caution. (iii) Calculating p-values based on exact conditional enumeration is a powerful, valid and computationally feasible approach, and we advocate its use in genetic association studies.</p>","PeriodicalId":48980,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"13 6","pages":"675-92"},"PeriodicalIF":0.9,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2013-0084","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32755665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
Statistical Applications in Genetics and Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1