{"title":"章节编者按:使用Power Insights更好地规划实验","authors":"Laura R. Peck","doi":"10.1177/10982140231154695","DOIUrl":null,"url":null,"abstract":"How many people need to be in my evaluation in order to be able to detect a policyor programrelevant impact? If the program being evaluated is assigned to participants at an aggregate “cluster” or group level—such as classrooms filled with students—how many of those groups do I need? How many participants within each group? What if I am interested in subgroup effects; how many people or groups do I need then? Answers to the questions are essential for smart planning of experimental evaluations and are the motivation for this Experimental Methodology Section. Before I summarize the contributions of this Section’s three articles, let me first define some key concepts and explain what I see to be the main issues for this piece of experimental evaluation work. To begin, statistical “power” refers to an evaluation’s ability to detect an effect that is statistically significant; and minimum detectable effects (MDEs) are the smallest estimated effect that a given design can detect as statistically significant. Ultimately, the effect size is what a given evaluation is designed to estimate, and the evaluator will have to determine (1) what sample design and size is needed to detect that effect, or (2) what MDE is feasible, given budget and sample design and size realities. Several interrelated factors influence a study’s MDE, including (as drawn partly from Peck, 2020, Appendix Box A.1) the choices and realities of statistical significance threshold, statistical power, variance of the impact estimate, the level and variability of the outcome measure, and the clustered nature of the data, as elaborated next. Statistical significance threshold. The statistical significance level is the probability of identifying a false positive result (also referred to as Type I error). The MDE becomes larger as the statistical significance level decreases. All else equal, an impact must be larger to be detected with a statistical significance threshold of 1% than with a statistical significance threshold of 10%. Substantial debate in statistics and related fields focuses on “the p-value” and its value to establishing evidence (e.g., Wasserstein & Lazar, 2016). Statistical power. The statistical power is equal to the probability of correctly rejecting the null hypothesis (or, one minus the probability of a false negative result, or Type II error). In other words, power relates to the analyst’s ability to detect an impact that is statistically significant, should it exist. Statistical power is typically set to 80%, although other values may be reasonable too. Missing the detection of a favorable impact (Type II error) has lower up-front cost implications for the study, relative to falsely claiming that a favorable impact exists (Type I error). That said, an insufficiently powered study might lead to not generating new information (or, worse, to incorrect null findings), an ill-funded investment.","PeriodicalId":51449,"journal":{"name":"American Journal of Evaluation","volume":null,"pages":null},"PeriodicalIF":1.1000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Section Editor's Note: Using Power Insights to Better Plan Experiments\",\"authors\":\"Laura R. Peck\",\"doi\":\"10.1177/10982140231154695\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"How many people need to be in my evaluation in order to be able to detect a policyor programrelevant impact? If the program being evaluated is assigned to participants at an aggregate “cluster” or group level—such as classrooms filled with students—how many of those groups do I need? How many participants within each group? What if I am interested in subgroup effects; how many people or groups do I need then? Answers to the questions are essential for smart planning of experimental evaluations and are the motivation for this Experimental Methodology Section. Before I summarize the contributions of this Section’s three articles, let me first define some key concepts and explain what I see to be the main issues for this piece of experimental evaluation work. To begin, statistical “power” refers to an evaluation’s ability to detect an effect that is statistically significant; and minimum detectable effects (MDEs) are the smallest estimated effect that a given design can detect as statistically significant. Ultimately, the effect size is what a given evaluation is designed to estimate, and the evaluator will have to determine (1) what sample design and size is needed to detect that effect, or (2) what MDE is feasible, given budget and sample design and size realities. Several interrelated factors influence a study’s MDE, including (as drawn partly from Peck, 2020, Appendix Box A.1) the choices and realities of statistical significance threshold, statistical power, variance of the impact estimate, the level and variability of the outcome measure, and the clustered nature of the data, as elaborated next. Statistical significance threshold. The statistical significance level is the probability of identifying a false positive result (also referred to as Type I error). The MDE becomes larger as the statistical significance level decreases. All else equal, an impact must be larger to be detected with a statistical significance threshold of 1% than with a statistical significance threshold of 10%. Substantial debate in statistics and related fields focuses on “the p-value” and its value to establishing evidence (e.g., Wasserstein & Lazar, 2016). Statistical power. The statistical power is equal to the probability of correctly rejecting the null hypothesis (or, one minus the probability of a false negative result, or Type II error). In other words, power relates to the analyst’s ability to detect an impact that is statistically significant, should it exist. Statistical power is typically set to 80%, although other values may be reasonable too. Missing the detection of a favorable impact (Type II error) has lower up-front cost implications for the study, relative to falsely claiming that a favorable impact exists (Type I error). That said, an insufficiently powered study might lead to not generating new information (or, worse, to incorrect null findings), an ill-funded investment.\",\"PeriodicalId\":51449,\"journal\":{\"name\":\"American Journal of Evaluation\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.1000,\"publicationDate\":\"2023-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"American Journal of Evaluation\",\"FirstCategoryId\":\"90\",\"ListUrlMain\":\"https://doi.org/10.1177/10982140231154695\",\"RegionNum\":3,\"RegionCategory\":\"社会学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"SOCIAL SCIENCES, INTERDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Evaluation","FirstCategoryId":"90","ListUrlMain":"https://doi.org/10.1177/10982140231154695","RegionNum":3,"RegionCategory":"社会学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SOCIAL SCIENCES, INTERDISCIPLINARY","Score":null,"Total":0}
Section Editor's Note: Using Power Insights to Better Plan Experiments
How many people need to be in my evaluation in order to be able to detect a policyor programrelevant impact? If the program being evaluated is assigned to participants at an aggregate “cluster” or group level—such as classrooms filled with students—how many of those groups do I need? How many participants within each group? What if I am interested in subgroup effects; how many people or groups do I need then? Answers to the questions are essential for smart planning of experimental evaluations and are the motivation for this Experimental Methodology Section. Before I summarize the contributions of this Section’s three articles, let me first define some key concepts and explain what I see to be the main issues for this piece of experimental evaluation work. To begin, statistical “power” refers to an evaluation’s ability to detect an effect that is statistically significant; and minimum detectable effects (MDEs) are the smallest estimated effect that a given design can detect as statistically significant. Ultimately, the effect size is what a given evaluation is designed to estimate, and the evaluator will have to determine (1) what sample design and size is needed to detect that effect, or (2) what MDE is feasible, given budget and sample design and size realities. Several interrelated factors influence a study’s MDE, including (as drawn partly from Peck, 2020, Appendix Box A.1) the choices and realities of statistical significance threshold, statistical power, variance of the impact estimate, the level and variability of the outcome measure, and the clustered nature of the data, as elaborated next. Statistical significance threshold. The statistical significance level is the probability of identifying a false positive result (also referred to as Type I error). The MDE becomes larger as the statistical significance level decreases. All else equal, an impact must be larger to be detected with a statistical significance threshold of 1% than with a statistical significance threshold of 10%. Substantial debate in statistics and related fields focuses on “the p-value” and its value to establishing evidence (e.g., Wasserstein & Lazar, 2016). Statistical power. The statistical power is equal to the probability of correctly rejecting the null hypothesis (or, one minus the probability of a false negative result, or Type II error). In other words, power relates to the analyst’s ability to detect an impact that is statistically significant, should it exist. Statistical power is typically set to 80%, although other values may be reasonable too. Missing the detection of a favorable impact (Type II error) has lower up-front cost implications for the study, relative to falsely claiming that a favorable impact exists (Type I error). That said, an insufficiently powered study might lead to not generating new information (or, worse, to incorrect null findings), an ill-funded investment.
期刊介绍:
The American Journal of Evaluation (AJE) publishes original papers about the methods, theory, practice, and findings of evaluation. The general goal of AJE is to present the best work in and about evaluation, in order to improve the knowledge base and practice of its readers. Because the field of evaluation is diverse, with different intellectual traditions, approaches to practice, and domains of application, the papers published in AJE will reflect this diversity. Nevertheless, preference is given to papers that are likely to be of interest to a wide range of evaluators and that are written to be accessible to most readers.