Double truncation method for controlling local false discovery rate in case of spiky null

IF 1.4 4区数学 Q3 STATISTICS & PROBABILITY Computational Statistics Pub Date : 2024-06-05 DOI:10.1007/s00180-024-01510-4

Shinjune Kim, Youngjae Oh, Johan Lim, DoHwan Park, Erin M. Green, Mark L. Ramos, Jaesik Jeong

{"title":"Double truncation method for controlling local false discovery rate in case of spiky null","authors":"Shinjune Kim, Youngjae Oh, Johan Lim, DoHwan Park, Erin M. Green, Mark L. Ramos, Jaesik Jeong","doi":"10.1007/s00180-024-01510-4","DOIUrl":null,"url":null,"abstract":"<p>Many multiple test procedures, which control the false discovery rate, have been developed to identify some cases (e.g. genes) showing statistically significant difference between two different groups. However, a common issue encountered in some practical data sets is the presence of highly spiky null distributions. Existing methods struggle to control type I error in such cases due to the “inflated false positives,\" but this problem has not been addressed in previous literature. Our team recently encountered this issue while analyzing SET4 gene deletion data and proposed modeling the null distribution using a scale mixture normal distribution. However, the use of this approach is limited due to strong assumptions on the spiky peak. In this paper, we present a novel multiple test procedure that can be applied to any type of spiky peak data, including situations with no spiky peak or with one or two spiky peaks. Our approach involves truncating the central statistics around 0, which primarily contribute to the null spike, as well as the two tails that may be contaminated by alternative distributions. We refer to this method as the “double truncation method.\" After applying double truncation, we estimate the null density using the doubly truncated maximum likelihood estimator. We demonstrate numerically that our proposed method effectively controls the false discovery rate at the desired level using simulated data. Furthermore, we apply our method to two real data sets, namely the SET protein data and peony data.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"25 1","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1007/s00180-024-01510-4","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

Many multiple test procedures, which control the false discovery rate, have been developed to identify some cases (e.g. genes) showing statistically significant difference between two different groups. However, a common issue encountered in some practical data sets is the presence of highly spiky null distributions. Existing methods struggle to control type I error in such cases due to the “inflated false positives," but this problem has not been addressed in previous literature. Our team recently encountered this issue while analyzing SET4 gene deletion data and proposed modeling the null distribution using a scale mixture normal distribution. However, the use of this approach is limited due to strong assumptions on the spiky peak. In this paper, we present a novel multiple test procedure that can be applied to any type of spiky peak data, including situations with no spiky peak or with one or two spiky peaks. Our approach involves truncating the central statistics around 0, which primarily contribute to the null spike, as well as the two tails that may be contaminated by alternative distributions. We refer to this method as the “double truncation method." After applying double truncation, we estimate the null density using the doubly truncated maximum likelihood estimator. We demonstrate numerically that our proposed method effectively controls the false discovery rate at the desired level using simulated data. Furthermore, we apply our method to two real data sets, namely the SET protein data and peony data.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

控制尖空情况下局部误发现率的双重截断法

目前已开发出许多控制误发现率的多重检验程序，用于识别一些在两个不同组别之间显示出显著统计学差异的情况（如基因）。然而，在一些实际数据集中遇到的一个常见问题是存在高度尖峰的空分布。在这种情况下，由于 "虚假阳性 "的存在，现有的方法很难控制 I 类错误，但这一问题在以往的文献中还没有得到解决。我们的团队最近在分析 SET4 基因缺失数据时遇到了这个问题，并建议使用比例混合正态分布来模拟空分布。然而，由于对尖峰的强烈假设，这种方法的使用受到了限制。在本文中，我们提出了一种新的多重检验程序，它可应用于任何类型的尖峰数据，包括无尖峰或有一个或两个尖峰的情况。我们的方法包括截断 0 附近的中心统计量（这是空尖峰的主要贡献），以及可能被其他分布污染的两个尾部。我们将这种方法称为 "双重截断法"。应用双重截断法后，我们使用双重截断最大似然估计法估计空密度。我们利用模拟数据用数字证明了我们提出的方法能有效地将误发现率控制在理想水平。此外，我们还将我们的方法应用于两个真实数据集，即 SET 蛋白质数据和牡丹数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computational Statistics 数学-统计学与概率论

CiteScore

2.90

自引率

0.00%

发文量

122

审稿时长

>12 weeks

期刊介绍： Computational Statistics (CompStat) is an international journal which promotes the publication of applications and methodological research in the field of Computational Statistics. The focus of papers in CompStat is on the contribution to and influence of computing on statistics and vice versa. The journal provides a forum for computer scientists, mathematicians, and statisticians in a variety of fields of statistics such as biometrics, econometrics, data analysis, graphics, simulation, algorithms, knowledge based systems, and Bayesian computing. CompStat publishes hardware, software plus package reports.