{"title":"Statistical Significance Testing in Theory and in Practice","authors":"Ben Carterette","doi":"10.1145/3341981.3358959","DOIUrl":null,"url":null,"abstract":"The past 25 years have seen a great improvement in the rigor of experimentation on information access problems. This is due primarily to three factors: high-quality, public, portable test collections such as those produced by TREC (the Text REtreval Conference~\\citetrecbook ), the increased ease of online A/B testing on large user populations, and the increased practice of statistical hypothesis testing to determine whether observed improvements can be ascribed to something other than random chance. Together these create a very useful standard for reviewers, program committees, and journal editors; work on information access (IA) problems such as search and recommendation increasingly cannot be published unless it has been evaluated offline using a well-constructed test collection or online on a large user base and shown to produce a statistically significant improvement over a good baseline. But, as the saying goes, any tool sharp enough to be useful is also sharp enough to be dangerous. Statistical tests of significance are widely misunderstood. Most researchers and developers treat them as a \"black box'': evaluation results go in and a p-value comes out. But because significance is such an important factor in determining what directions to explore and what is published or deployed, using p-values obtained without thought can have consequences for everyone working in IA. Ioannidis has argued that the main consequence in the biomedical sciences is that most published research findings are false; could that be the case for IA as well?","PeriodicalId":173154,"journal":{"name":"Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3341981.3358959","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The past 25 years have seen a great improvement in the rigor of experimentation on information access problems. This is due primarily to three factors: high-quality, public, portable test collections such as those produced by TREC (the Text REtreval Conference~\citetrecbook ), the increased ease of online A/B testing on large user populations, and the increased practice of statistical hypothesis testing to determine whether observed improvements can be ascribed to something other than random chance. Together these create a very useful standard for reviewers, program committees, and journal editors; work on information access (IA) problems such as search and recommendation increasingly cannot be published unless it has been evaluated offline using a well-constructed test collection or online on a large user base and shown to produce a statistically significant improvement over a good baseline. But, as the saying goes, any tool sharp enough to be useful is also sharp enough to be dangerous. Statistical tests of significance are widely misunderstood. Most researchers and developers treat them as a "black box'': evaluation results go in and a p-value comes out. But because significance is such an important factor in determining what directions to explore and what is published or deployed, using p-values obtained without thought can have consequences for everyone working in IA. Ioannidis has argued that the main consequence in the biomedical sciences is that most published research findings are false; could that be the case for IA as well?