David S. Robertson, James M. S. Wason, Aaditya Ramdas
Modern data analysis frequently involves large-scale hypothesis testing, which naturally gives rise to the problem of maintaining control of a suitable type I error rate, such as the false discovery rate (FDR). In many biomedical and technological applications, an additional complexity is that hypotheses are tested in an online manner, one-by-one over time. However, traditional procedures that control the FDR, such as the Benjamini-Hochberg procedure, assume that all p-values are available to be tested at a single time point. To address these challenges, a new field of methodology has developed over the past 15 years showing how to control error rates for online multiple hypothesis testing. In this framework, hypotheses arrive in a stream, and at each time point the analyst decides whether to reject the current hypothesis based both on the evidence against it, and on the previous rejection decisions. In this paper, we present a comprehensive exposition of the literature on online error rate control, with a review of key theory as well as a focus on applied examples. We also provide simulation results comparing different online testing algorithms and an up-to-date overview of the many methodological extensions that have been proposed.
{"title":"Online Multiple Hypothesis Testing","authors":"David S. Robertson, James M. S. Wason, Aaditya Ramdas","doi":"10.1214/23-sts901","DOIUrl":"https://doi.org/10.1214/23-sts901","url":null,"abstract":"Modern data analysis frequently involves large-scale hypothesis testing, which naturally gives rise to the problem of maintaining control of a suitable type I error rate, such as the false discovery rate (FDR). In many biomedical and technological applications, an additional complexity is that hypotheses are tested in an online manner, one-by-one over time. However, traditional procedures that control the FDR, such as the Benjamini-Hochberg procedure, assume that all p-values are available to be tested at a single time point. To address these challenges, a new field of methodology has developed over the past 15 years showing how to control error rates for online multiple hypothesis testing. In this framework, hypotheses arrive in a stream, and at each time point the analyst decides whether to reject the current hypothesis based both on the evidence against it, and on the previous rejection decisions. In this paper, we present a comprehensive exposition of the literature on online error rate control, with a review of key theory as well as a focus on applied examples. We also provide simulation results comparing different online testing algorithms and an up-to-date overview of the many methodological extensions that have been proposed.","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":"1 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135515281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We discuss recently developed methods that quantify the stability and generalizability of statistical findings under distributional changes. In many practical problems, the data is not drawn i.i.d. from the target population. For example, unobserved sampling bias, batch effects, or unknown associations might inflate the variance compared to i.i.d. sampling. For reliable statistical inference, it is thus necessary to account for these types of variation. We discuss and review two methods that allow to quantify distribution stability based on a single dataset. The first method computes the sensitivity of a parameter under worst-case distributional perturbations to understand which types of shift pose a threat to external validity. The second method treats distributional shifts as random which allows to assess average robustness (instead of worst-case). Based on a stability analysis of multiple estimators on a single dataset, it integrates both sampling and distributional uncertainty into a single confidence interval.
{"title":"Distributionally Robust and Generalizable Inference","authors":"Dominik Rothenhäusler, Peter Bühlmann","doi":"10.1214/23-sts902","DOIUrl":"https://doi.org/10.1214/23-sts902","url":null,"abstract":"We discuss recently developed methods that quantify the stability and generalizability of statistical findings under distributional changes. In many practical problems, the data is not drawn i.i.d. from the target population. For example, unobserved sampling bias, batch effects, or unknown associations might inflate the variance compared to i.i.d. sampling. For reliable statistical inference, it is thus necessary to account for these types of variation. We discuss and review two methods that allow to quantify distribution stability based on a single dataset. The first method computes the sensitivity of a parameter under worst-case distributional perturbations to understand which types of shift pose a threat to external validity. The second method treats distributional shifts as random which allows to assess average robustness (instead of worst-case). Based on a stability analysis of multiple estimators on a single dataset, it integrates both sampling and distributional uncertainty into a single confidence interval.","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":"7 1-2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135515521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meta-analysis is routinely performed in many scientific disciplines. This analysis is attractive since discoveries are possible even when all the individual studies are underpowered. However, the meta-analytic discoveries may be entirely driven by signal in a single study, and thus nonreplicable. Although the great majority of meta-analyses carried out to date do not infer on the replicability of their findings, it is possible to do so. We provide a selective overview of analyses that can be carried out towards establishing replicability of the scientific findings. We describe methods for the setting where a single outcome is examined in multiple studies (as is common in systematic reviews of medical interventions), as well as for the setting where multiple studies each examine multiple features (as in genomics applications). We also discuss some of the current shortcomings and future directions.
{"title":"Replicability Across Multiple Studies","authors":"Marina Bogomolov, Ruth Heller","doi":"10.1214/23-sts892","DOIUrl":"https://doi.org/10.1214/23-sts892","url":null,"abstract":"Meta-analysis is routinely performed in many scientific disciplines. This analysis is attractive since discoveries are possible even when all the individual studies are underpowered. However, the meta-analytic discoveries may be entirely driven by signal in a single study, and thus nonreplicable. Although the great majority of meta-analyses carried out to date do not infer on the replicability of their findings, it is possible to do so. We provide a selective overview of analyses that can be carried out towards establishing replicability of the scientific findings. We describe methods for the setting where a single outcome is examined in multiple studies (as is common in systematic reviews of medical interventions), as well as for the setting where multiple studies each examine multiple features (as in genomics applications). We also discuss some of the current shortcomings and future directions.","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135509930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, I propose an approach for defining replicability for prediction rules. Motivated by a recent report by the U.S.A. National Academy of Sciences, I start from the perspective that replicability is obtaining consistent results across studies suitable to address the same prediction question, each of which has obtained its own data. I then discuss concept and issues in defining key elements of this statement. I focus specifically on the meaning of “consistent results” in typical utilization contexts, and propose a multi-agent framework for defining replicability, in which agents are neither allied nor adversaries. I recover some of the prevalent practical approaches as special cases. I hope to provide guidance for a more systematic assessment of replicability in machine learning.
{"title":"Defining Replicability of Prediction Rules","authors":"Giovanni Parmigiani","doi":"10.1214/23-sts891","DOIUrl":"https://doi.org/10.1214/23-sts891","url":null,"abstract":"In this article, I propose an approach for defining replicability for prediction rules. Motivated by a recent report by the U.S.A. National Academy of Sciences, I start from the perspective that replicability is obtaining consistent results across studies suitable to address the same prediction question, each of which has obtained its own data. I then discuss concept and issues in defining key elements of this statement. I focus specifically on the meaning of “consistent results” in typical utilization contexts, and propose a multi-agent framework for defining replicability, in which agents are neither allied nor adversaries. I recover some of the prevalent practical approaches as special cases. I hope to provide guidance for a more systematic assessment of replicability in machine learning.","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135509771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The measurement of a quantity is reproducible when mutually independent, multiple measurements made of it yield mutually consistent measurement results, that is, when the measured values, after due allowance for their associated uncertainties, do not differ significantly from one another. Interlaboratory comparisons organized deliberately for the purpose, and meta-analyses that are structured so as to be fit for the same purpose, are procedures of choice to ascertain measurement reproducibility. The realistic evaluation of measurement uncertainty is a key preliminary to the assessment of reproducibility because lack of reproducibility manifests itself as dispersion or variability of measured values in excess of what their associated uncertainties suggest that they should exhibit. For this reason, we review the distinctive traits of measurement in the physical sciences and technologies, including medicine, and discuss the meaning and expression of measurement uncertainty. This contribution illustrates the application of statistical models and methods to quantify measurement uncertainty and to assess reproducibility in four concrete, real-life examples, in the process revealing that lack of reproducibility can be a consequence of one or more of the following: intrinsic differences between laboratories making measurements; choice of statistical model and of procedure for data reduction or of causes yet to be identified. Despite the instances of lack of reproducibility that we review, and many others like them, the outlook is optimistic. First, because “lack of reproducibility is not necessarily bad news; it may herald new discoveries and signal scientific progress” (Nat. Phys. 16 (2020) 117–119). Second, and as the example about the measurement of the Newtonian constant of gravitation, G, illustrates, when faced with a reproducibility crisis the scientific community often engages in cooperative efforts to understand the root causes of the lack of reproducibility, leading to advances in scientific knowledge.
{"title":"Tracking Truth Through Measurement and the Spyglass of Statistics","authors":"Antonio Possolo","doi":"10.1214/23-sts899","DOIUrl":"https://doi.org/10.1214/23-sts899","url":null,"abstract":"The measurement of a quantity is reproducible when mutually independent, multiple measurements made of it yield mutually consistent measurement results, that is, when the measured values, after due allowance for their associated uncertainties, do not differ significantly from one another. Interlaboratory comparisons organized deliberately for the purpose, and meta-analyses that are structured so as to be fit for the same purpose, are procedures of choice to ascertain measurement reproducibility. The realistic evaluation of measurement uncertainty is a key preliminary to the assessment of reproducibility because lack of reproducibility manifests itself as dispersion or variability of measured values in excess of what their associated uncertainties suggest that they should exhibit. For this reason, we review the distinctive traits of measurement in the physical sciences and technologies, including medicine, and discuss the meaning and expression of measurement uncertainty. This contribution illustrates the application of statistical models and methods to quantify measurement uncertainty and to assess reproducibility in four concrete, real-life examples, in the process revealing that lack of reproducibility can be a consequence of one or more of the following: intrinsic differences between laboratories making measurements; choice of statistical model and of procedure for data reduction or of causes yet to be identified. Despite the instances of lack of reproducibility that we review, and many others like them, the outlook is optimistic. First, because “lack of reproducibility is not necessarily bad news; it may herald new discoveries and signal scientific progress” (Nat. Phys. 16 (2020) 117–119). Second, and as the example about the measurement of the Newtonian constant of gravitation, G, illustrates, when faced with a reproducibility crisis the scientific community often engages in cooperative efforts to understand the root causes of the lack of reproducibility, leading to advances in scientific knowledge.","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":"45 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135509773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Note on Legendre’s Method of Least Squares","authors":"J. Nyblom","doi":"10.1214/23-sts887","DOIUrl":"https://doi.org/10.1214/23-sts887","url":null,"abstract":"","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":"1 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42188583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Robertson, K. M. Lee, Boryana C. López-Kolkovska, S. Villar
{"title":"Rejoinder: Response-Adaptive Randomization in Clinical Trials","authors":"D. Robertson, K. M. Lee, Boryana C. López-Kolkovska, S. Villar","doi":"10.1214/23-sts865rej","DOIUrl":"https://doi.org/10.1214/23-sts865rej","url":null,"abstract":"","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45531681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comment: Response Adaptive Randomization in Practice","authors":"Scott M. Berry, K. Viele","doi":"10.1214/23-sts865f","DOIUrl":"https://doi.org/10.1214/23-sts865f","url":null,"abstract":"","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47112521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comment: Is Response-Adaptive Randomization a “Good Thing” or Not in Clinical Trials? Why We Cannot Take Sides","authors":"A. Giovagnoli","doi":"10.1214/23-sts865e","DOIUrl":"https://doi.org/10.1214/23-sts865e","url":null,"abstract":"","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45294939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comment: Response-Adaptive Randomization in Clinical Trials: From Myths to Practical Considerations","authors":"Yunshan Duan, P. Müller, Yuan Ji","doi":"10.1214/23-sts865b","DOIUrl":"https://doi.org/10.1214/23-sts865b","url":null,"abstract":"","PeriodicalId":51172,"journal":{"name":"Statistical Science","volume":"1 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66089271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}