Algorithmic stability for adaptive data analysis

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing Pub Date : 2015-11-08 DOI:10.1145/2897518.2897566

Raef Bassily, Kobbi Nissim, Adam D. Smith, T. Steinke, Uri Stemmer, Jonathan Ullman

{"title":"Algorithmic stability for adaptive data analysis","authors":"Raef Bassily, Kobbi Nissim, Adam D. Smith, T. Steinke, Uri Stemmer, Jonathan Ullman","doi":"10.1145/2897518.2897566","DOIUrl":null,"url":null,"abstract":"Adaptivity is an important feature of data analysis - the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014) initiated a general formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis. Specifically, suppose there is an unknown distribution P and a set of n independent samples x is drawn from P. We seek an algorithm that, given x as input, accurately answers a sequence of adaptively chosen ``queries'' about the unknown distribution P. How many samples n must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy? In this work we make two new contributions towards resolving this question: We give upper bounds on the number of samples n that are needed to answer statistical queries. The bounds improve and simplify the work of Dwork et al. (STOC, 2015), and have been applied in subsequent work by those authors (Science, 2015; NIPS, 2015). We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and an important class of optimization queries (alternatively, risk minimization queries). As in Dwork et al., our algorithms are based on a connection with algorithmic stability in the form of differential privacy. We extend their work by giving a quantitatively optimal, more general, and simpler proof of their main theorem that the stability notion guaranteed by differential privacy implies low generalization error. We also show that weaker stability guarantees such as bounded KL divergence and total variation distance lead to correspondingly weaker generalization guarantees.","PeriodicalId":442965,"journal":{"name":"Proceedings of the forty-eighth annual ACM symposium on Theory of Computing","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"244","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the forty-eighth annual ACM symposium on Theory of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2897518.2897566","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 244

Abstract

Adaptivity is an important feature of data analysis - the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014) initiated a general formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis. Specifically, suppose there is an unknown distribution P and a set of n independent samples x is drawn from P. We seek an algorithm that, given x as input, accurately answers a sequence of adaptively chosen ``queries'' about the unknown distribution P. How many samples n must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy? In this work we make two new contributions towards resolving this question: We give upper bounds on the number of samples n that are needed to answer statistical queries. The bounds improve and simplify the work of Dwork et al. (STOC, 2015), and have been applied in subsequent work by those authors (Science, 2015; NIPS, 2015). We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and an important class of optimization queries (alternatively, risk minimization queries). As in Dwork et al., our algorithms are based on a connection with algorithmic stability in the form of differential privacy. We extend their work by giving a quantitatively optimal, more general, and simpler proof of their main theorem that the stability notion guaranteed by differential privacy implies low generalization error. We also show that weaker stability guarantees such as bounded KL divergence and total variation distance lead to correspondingly weaker generalization guarantees.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

自适应数据分析的算法稳定性

适应性是数据分析的一个重要特征——对数据集提出问题的选择通常取决于以前与同一数据集的交互。然而，统计有效性通常是在非自适应模型中研究的，其中所有问题都是在绘制数据集之前指定的。Dwork等人(STOC, 2015)和Hardt和Ullman (FOCS, 2014)最近的工作启动了对该问题的一般正式研究，并给出了自适应数据分析可实现泛化误差的第一个上界和下界。具体来说，假设存在一个未知分布P，并且从P中抽取了一组n个独立样本x。我们寻求一种算法，在给定x作为输入的情况下，准确地回答一系列自适应选择的关于未知分布P的“查询”。作为查询类型、查询数量和所需精度水平的函数，我们必须从分布中抽取多少个样本n ?在这项工作中，我们为解决这个问题做出了两个新的贡献:我们给出了回答统计查询所需的样本数量n的上限。边界改进和简化了Dwork等人(STOC, 2015)的工作，并已被这些作者应用于后续工作(Science, 2015;少量的酒,2015)。我们证明了回答更一般的查询族所需的样本数量的第一个上界。这些查询包括任意的低灵敏度查询和一类重要的优化查询(或者，风险最小化查询)。与Dwork等人一样，我们的算法基于与差分隐私形式的算法稳定性的联系。我们通过给出一个定量最优的、更一般的、更简单的证明来扩展他们的工作，证明他们的主要定理，即微分隐私保证的稳定性概念意味着低泛化误差。我们还表明，有界KL散度和总变异距离等较弱的稳定性保证导致相应较弱的泛化保证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

自引率

0.00%

发文量

期刊最新文献

Exponential separation of communication and external information Proceedings of the forty-eighth annual ACM symposium on Theory of Computing Explicit two-source extractors and resilient functions Constant-rate coding for multiparty interactive communication is impossible Approximating connectivity domination in weighted bounded-genus graphs