Optimal Estimation of the Null Distribution in Large-Scale Inference

IF 2.2 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Information Theory Pub Date : 2025-01-14 DOI:10.1109/TIT.2025.3529457

Subhodh Kotekal;Chao Gao

{"title":"Optimal Estimation of the Null Distribution in Large-Scale Inference","authors":"Subhodh Kotekal;Chao Gao","doi":"10.1109/TIT.2025.3529457","DOIUrl":null,"url":null,"abstract":"The advent of large-scale inference has spurred reexamination of conventional statistical thinking. In a series of highly original articles, Efron persuasively illustrated the danger for downstream inference in assuming the veracity of a posited null distribution. In a Gaussian model for n many z-scores with at most <inline-formula> <tex-math>$k \\lt \\frac {n}{2}$ </tex-math></inline-formula> nonnulls, Efron suggests estimating the parameters of an empirical null <inline-formula> <tex-math>$N(\\theta , \\sigma ^{2})$ </tex-math></inline-formula> instead of assuming the theoretical null <inline-formula> <tex-math>$N(0, 1)$ </tex-math></inline-formula>. Looking to the robust statistics literature by viewing the nonnulls as outliers is unsatisfactory as the question of optimal rates is still open; even consistency is not known in the regime <inline-formula> <tex-math>$k \\asymp n$ </tex-math></inline-formula> which is especially relevant to many large-scale inference applications. However, provably rate-optimal robust estimators have been developed in other models (e.g. Huber contamination) which appear quite close to Efron’s proposal. Notably, the impossibility of consistency when <inline-formula> <tex-math>$k \\asymp n$ </tex-math></inline-formula> in these other models may suggest the same major weakness afflicts Efron’s popularly adopted recommendation. A sound evaluation thus requires a complete understanding of information-theoretic limits. We characterize the regime of k for which consistent estimation is possible, notably without imposing any assumptions at all on the nonnull effects. Unlike in other robust models, it is shown consistent estimation of the location parameter is possible if and only if <inline-formula> <tex-math>$\\frac {n}{2} {-} k = \\omega (\\sqrt {n})$ </tex-math></inline-formula>, and of the scale parameter in the entire regime <inline-formula> <tex-math>$k \\lt \\frac {n}{2}$ </tex-math></inline-formula>. Furthermore, we establish sharp minimax rates and show estimators based on the empirical characteristic function are optimal by exploiting the Gaussian character of the data.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"71 3","pages":"2075-2103"},"PeriodicalIF":2.2000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10841456/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The advent of large-scale inference has spurred reexamination of conventional statistical thinking. In a series of highly original articles, Efron persuasively illustrated the danger for downstream inference in assuming the veracity of a posited null distribution. In a Gaussian model for n many z-scores with at most

$k \lt \frac {n}{2}$

nonnulls, Efron suggests estimating the parameters of an empirical null

$N(\theta , \sigma ^{2})$

instead of assuming the theoretical null

$N(0, 1)$

. Looking to the robust statistics literature by viewing the nonnulls as outliers is unsatisfactory as the question of optimal rates is still open; even consistency is not known in the regime

$k \asymp n$

which is especially relevant to many large-scale inference applications. However, provably rate-optimal robust estimators have been developed in other models (e.g. Huber contamination) which appear quite close to Efron’s proposal. Notably, the impossibility of consistency when

$k \asymp n$

in these other models may suggest the same major weakness afflicts Efron’s popularly adopted recommendation. A sound evaluation thus requires a complete understanding of information-theoretic limits. We characterize the regime of k for which consistent estimation is possible, notably without imposing any assumptions at all on the nonnull effects. Unlike in other robust models, it is shown consistent estimation of the location parameter is possible if and only if

$\frac {n}{2} {-} k = \omega (\sqrt {n})$

, and of the scale parameter in the entire regime

$k \lt \frac {n}{2}$

. Furthermore, we establish sharp minimax rates and show estimators based on the empirical characteristic function are optimal by exploiting the Gaussian character of the data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Information Theory 工程技术-工程：电子与电气

CiteScore

5.70

自引率

20.00%

发文量

514

审稿时长

12 months

期刊介绍： The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.

期刊最新文献

Table of Contents IEEE Transactions on Information Theory Information for Authors IEEE Transactions on Information Theory Publication Information Error Exponents for Entanglement Transformations From Degenerations Bounds and Constructions of Quantum Locally Recoverable Codes From Quantum CSS Codes