{"title":"Optimal Estimation of the Null Distribution in Large-Scale Inference","authors":"Subhodh Kotekal;Chao Gao","doi":"10.1109/TIT.2025.3529457","DOIUrl":null,"url":null,"abstract":"The advent of large-scale inference has spurred reexamination of conventional statistical thinking. In a series of highly original articles, Efron persuasively illustrated the danger for downstream inference in assuming the veracity of a posited null distribution. In a Gaussian model for n many z-scores with at most <inline-formula> <tex-math>$k \\lt \\frac {n}{2}$ </tex-math></inline-formula> nonnulls, Efron suggests estimating the parameters of an empirical null <inline-formula> <tex-math>$N(\\theta , \\sigma ^{2})$ </tex-math></inline-formula> instead of assuming the theoretical null <inline-formula> <tex-math>$N(0, 1)$ </tex-math></inline-formula>. Looking to the robust statistics literature by viewing the nonnulls as outliers is unsatisfactory as the question of optimal rates is still open; even consistency is not known in the regime <inline-formula> <tex-math>$k \\asymp n$ </tex-math></inline-formula> which is especially relevant to many large-scale inference applications. However, provably rate-optimal robust estimators have been developed in other models (e.g. Huber contamination) which appear quite close to Efron’s proposal. Notably, the impossibility of consistency when <inline-formula> <tex-math>$k \\asymp n$ </tex-math></inline-formula> in these other models may suggest the same major weakness afflicts Efron’s popularly adopted recommendation. A sound evaluation thus requires a complete understanding of information-theoretic limits. We characterize the regime of k for which consistent estimation is possible, notably without imposing any assumptions at all on the nonnull effects. Unlike in other robust models, it is shown consistent estimation of the location parameter is possible if and only if <inline-formula> <tex-math>$\\frac {n}{2} {-} k = \\omega (\\sqrt {n})$ </tex-math></inline-formula>, and of the scale parameter in the entire regime <inline-formula> <tex-math>$k \\lt \\frac {n}{2}$ </tex-math></inline-formula>. Furthermore, we establish sharp minimax rates and show estimators based on the empirical characteristic function are optimal by exploiting the Gaussian character of the data.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"71 3","pages":"2075-2103"},"PeriodicalIF":2.2000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10841456/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The advent of large-scale inference has spurred reexamination of conventional statistical thinking. In a series of highly original articles, Efron persuasively illustrated the danger for downstream inference in assuming the veracity of a posited null distribution. In a Gaussian model for n many z-scores with at most $k \lt \frac {n}{2}$ nonnulls, Efron suggests estimating the parameters of an empirical null $N(\theta , \sigma ^{2})$ instead of assuming the theoretical null $N(0, 1)$ . Looking to the robust statistics literature by viewing the nonnulls as outliers is unsatisfactory as the question of optimal rates is still open; even consistency is not known in the regime $k \asymp n$ which is especially relevant to many large-scale inference applications. However, provably rate-optimal robust estimators have been developed in other models (e.g. Huber contamination) which appear quite close to Efron’s proposal. Notably, the impossibility of consistency when $k \asymp n$ in these other models may suggest the same major weakness afflicts Efron’s popularly adopted recommendation. A sound evaluation thus requires a complete understanding of information-theoretic limits. We characterize the regime of k for which consistent estimation is possible, notably without imposing any assumptions at all on the nonnull effects. Unlike in other robust models, it is shown consistent estimation of the location parameter is possible if and only if $\frac {n}{2} {-} k = \omega (\sqrt {n})$ , and of the scale parameter in the entire regime $k \lt \frac {n}{2}$ . Furthermore, we establish sharp minimax rates and show estimators based on the empirical characteristic function are optimal by exploiting the Gaussian character of the data.
期刊介绍:
The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.