交叉熵法用于高通量基因组数据分析中排列检验的准确、快速小p值估计。

IF 0.9 4区 数学 Q3 Mathematics Statistical Applications in Genetics and Molecular Biology Pub Date : 2023-01-01 DOI:10.1515/sagmb-2021-0067
Yang Shi, Weiping Shi, Mengqiao Wang, Ji-Hyun Lee, Huining Kang, Hui Jiang
{"title":"交叉熵法用于高通量基因组数据分析中排列检验的准确、快速小p值估计。","authors":"Yang Shi,&nbsp;Weiping Shi,&nbsp;Mengqiao Wang,&nbsp;Ji-Hyun Lee,&nbsp;Huining Kang,&nbsp;Hui Jiang","doi":"10.1515/sagmb-2021-0067","DOIUrl":null,"url":null,"abstract":"<p><p>Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small <i>p</i>-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small <i>p</i>-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small <i>p</i>-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Accurate and fast small <i>p</i>-value estimation for permutation tests in high-throughput genomic data analysis with the cross-entropy method.\",\"authors\":\"Yang Shi,&nbsp;Weiping Shi,&nbsp;Mengqiao Wang,&nbsp;Ji-Hyun Lee,&nbsp;Huining Kang,&nbsp;Hui Jiang\",\"doi\":\"10.1515/sagmb-2021-0067\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small <i>p</i>-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small <i>p</i>-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small <i>p</i>-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.</p>\",\"PeriodicalId\":49477,\"journal\":{\"name\":\"Statistical Applications in Genetics and Molecular Biology\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistical Applications in Genetics and Molecular Biology\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1515/sagmb-2021-0067\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Applications in Genetics and Molecular Biology","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1515/sagmb-2021-0067","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 1

摘要

当检验统计量在零假设下的抽样分布由于样本量有限而难以分析或不可靠时,排列检验被广泛用于统计假设检验。在基因组研究中应用排列测试的一个关键挑战是,通常需要大量的排列来获得非常小的p值的可靠估计,从而导致大量的计算工作。为了解决这个问题,我们开发了一种算法,用于准确有效地估计配对和独立的两组基因组数据的排列检验中的小p值,我们的方法利用一个新的框架,分别使用伯努利分布和条件伯努利分布,结合交叉熵方法,参数化这两类数据的排列样本空间。通过应用于两个模拟数据集和两个由微阵列和RNA-Seq技术生成的真实基因表达数据集,并与现有方法(如粗排列和SAMC)进行比较,证明了我们提出的算法的性能,结果表明我们的方法在估计小p值方面可以实现数量级的计算效率提升。我们的方法为提高现有排列测试程序的计算效率和开发基因组数据分析中使用排列的新测试方法提供了有希望的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Accurate and fast small p-value estimation for permutation tests in high-throughput genomic data analysis with the cross-entropy method.

Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small p-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small p-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small p-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
1.20
自引率
11.10%
发文量
8
审稿时长
6-12 weeks
期刊介绍: Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies. Both original research and review articles will be warmly received.
期刊最新文献
Empirically adjusted fixed-effects meta-analysis methods in genomic studies. A CNN-CBAM-BIGRU model for protein function prediction. A heavy-tailed model for analyzing miRNA-seq raw read counts. Flexible model-based non-negative matrix factorization with application to mutational signatures. Choice of baseline hazards in joint modeling of longitudinal and time-to-event cancer survival data.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1