Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled Experiments

Proceedings of the 13th International Conference on Web Search and Data Mining Pub Date : 2020-01-20 DOI:10.1145/3336191.3371871

Somit Gupta, Xiaolin Shi, Pavel A. Dmitriev, Xin Fu, Avijit Mukherjee

{"title":"Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled Experiments","authors":"Somit Gupta, Xiaolin Shi, Pavel A. Dmitriev, Xin Fu, Avijit Mukherjee","doi":"10.1145/3336191.3371871","DOIUrl":null,"url":null,"abstract":"A/B Testing is the gold standard to estimate the causal relationship between a change in a product and its impact on key outcome measures. It is widely used in the industry to test changes ranging from simple copy change or UI change to more complex changes like using machine learning models to personalize user experience. The key aspect of A/B testing is evaluation of experiment results. Designing the right set of metrics - correct outcome measures, data quality indicators, guardrails that prevent harm to business, and a comprehensive set of supporting metrics to understand the \"why\" behind the key movements is the #1 challenge practitioners face when trying to scale their experimentation program [11, 14]. On the technical side, improving sensitivity of experiment metrics is a hard problem and an active research area, with large practical implications as more and more small and medium size businesses are trying to adopt A/B testing and suffer from insufficient power. In this tutorial we will discuss challenges, best practices, and pitfalls in evaluating experiment results, focusing on both lessons learned and practical guidelines as well as open research questions. A version of this tutorial was also present at KDD 2019 [23]. It was attended by around 150 participants.","PeriodicalId":319008,"journal":{"name":"Proceedings of the 13th International Conference on Web Search and Data Mining","volume":"110 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3336191.3371871","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

A/B Testing is the gold standard to estimate the causal relationship between a change in a product and its impact on key outcome measures. It is widely used in the industry to test changes ranging from simple copy change or UI change to more complex changes like using machine learning models to personalize user experience. The key aspect of A/B testing is evaluation of experiment results. Designing the right set of metrics - correct outcome measures, data quality indicators, guardrails that prevent harm to business, and a comprehensive set of supporting metrics to understand the "why" behind the key movements is the #1 challenge practitioners face when trying to scale their experimentation program [11, 14]. On the technical side, improving sensitivity of experiment metrics is a hard problem and an active research area, with large practical implications as more and more small and medium size businesses are trying to adopt A/B testing and suffer from insufficient power. In this tutorial we will discuss challenges, best practices, and pitfalls in evaluating experiment results, focusing on both lessons learned and practical guidelines as well as open research questions. A version of this tutorial was also present at KDD 2019 [23]. It was attended by around 150 participants.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在线控制实验结果评估中的挑战、最佳实践和陷阱

A/B测试是评估产品变更及其对关键结果度量的影响之间因果关系的黄金标准。它在行业中广泛用于测试更改，从简单的副本更改或UI更改到更复杂的更改，如使用机器学习模型来个性化用户体验。A/B测试的关键是评估实验结果。设计一套正确的度量标准——正确的结果度量、数据质量指标、防止对业务造成损害的防护措施，以及一套全面的支持性度量标准，以理解关键动作背后的“原因”，这是从业者在尝试扩展实验计划时面临的首要挑战[11,14]。在技术方面，提高实验指标的灵敏度是一个难题，也是一个活跃的研究领域，随着越来越多的中小型企业尝试采用a /B测试，并受到功率不足的困扰，提高实验指标的灵敏度具有很大的实际意义。在本教程中，我们将讨论评估实验结果的挑战，最佳实践和陷阱，重点是经验教训和实践指南以及开放的研究问题。本教程的一个版本也出现在KDD 2019上[23]。约有150人参加。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 13th International Conference on Web Search and Data Mining

自引率

0.00%

发文量

期刊最新文献

Recurrent Memory Reasoning Network for Expert Finding in Community Question Answering Joint Recognition of Names and Publications in Academic Homepages LouvainNE Enhancing Re-finding Behavior with External Memories for Personalized Search Temporal Pattern of Retweet(s) Help to Maximize Information Diffusion in Twitter