A New Approach for Testing Properties of Discrete Distributions

2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS) Pub Date : 2016-01-21 DOI:10.1109/FOCS.2016.78

Ilias Diakonikolas, D. Kane

{"title":"A New Approach for Testing Properties of Discrete Distributions","authors":"Ilias Diakonikolas, D. Kane","doi":"10.1109/FOCS.2016.78","DOIUrl":null,"url":null,"abstract":"We study problems in distribution property testing: Given sample access to one or more unknown discrete distributions, we want to determine whether they have some global property or are epsilon-far from having the property in L1 distance (equivalently, total variation distance, or \"statistical distance\").In this work, we give a novel general approach for distribution testing. We describe two techniques: our first technique gives sample-optimal testers, while our second technique gives matching sample lower bounds. As a consequence, we resolve the sample complexity of a wide variety of testing problems. Our upper bounds are obtained via a modular reduction-based approach. Our approach yields optimal testers for numerous problemsby using a standard L2-identity tester as a black-box. Using this recipe, we obtain simple estimators for a wide range of problems, encompassing many problems previously studied in the TCS literature, namely: (1) identity testing to a fixed distribution, (2) closeness testing between two unknown distributions (with equal/unequal sample sizes), (3) independence testing (in any number of dimensions), (4) closeness testing for collections of distributions, and(5) testing histograms. For all of these problems, our testers are sample-optimal, up to constant factors. With the exception of (1), ours are the first sample-optimal testers for the corresponding problems. Moreover, our estimators are significantly simpler to state and analyze compared to previous results. As an important application of our reduction-based technique, we obtain the first adaptive algorithm for testing equivalence betweentwo unknown distributions. The sample complexity of our algorithm depends on the structure of the unknown distributions - as opposed to merely their domain size -and is significantly better compared to the worst-case optimal L1-tester in many natural instances. Moreover, our technique naturally generalizes to other metrics beyond the L1-distance. As an illustration of its flexibility, we use it to obtain the first near-optimal equivalence testerunder the Hellinger distance. Our lower bounds are obtained via a direct information-theoretic approach: Given a candidate hard instance, our proof proceeds by boundingthe mutual information between appropriate random variables. While this is a classical method in information theory, prior to our work, it had not been used in this context. Previous lower bounds relied either on the birthday paradox, oron moment-matching and were thus restricted to symmetric properties. Our lower bound approach does not suffer from any such restrictions and gives tight sample lower bounds for the aforementioned problems.","PeriodicalId":414001,"journal":{"name":"2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"149","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FOCS.2016.78","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 149

Abstract

We study problems in distribution property testing: Given sample access to one or more unknown discrete distributions, we want to determine whether they have some global property or are epsilon-far from having the property in L1 distance (equivalently, total variation distance, or "statistical distance").In this work, we give a novel general approach for distribution testing. We describe two techniques: our first technique gives sample-optimal testers, while our second technique gives matching sample lower bounds. As a consequence, we resolve the sample complexity of a wide variety of testing problems. Our upper bounds are obtained via a modular reduction-based approach. Our approach yields optimal testers for numerous problemsby using a standard L2-identity tester as a black-box. Using this recipe, we obtain simple estimators for a wide range of problems, encompassing many problems previously studied in the TCS literature, namely: (1) identity testing to a fixed distribution, (2) closeness testing between two unknown distributions (with equal/unequal sample sizes), (3) independence testing (in any number of dimensions), (4) closeness testing for collections of distributions, and(5) testing histograms. For all of these problems, our testers are sample-optimal, up to constant factors. With the exception of (1), ours are the first sample-optimal testers for the corresponding problems. Moreover, our estimators are significantly simpler to state and analyze compared to previous results. As an important application of our reduction-based technique, we obtain the first adaptive algorithm for testing equivalence betweentwo unknown distributions. The sample complexity of our algorithm depends on the structure of the unknown distributions - as opposed to merely their domain size -and is significantly better compared to the worst-case optimal L1-tester in many natural instances. Moreover, our technique naturally generalizes to other metrics beyond the L1-distance. As an illustration of its flexibility, we use it to obtain the first near-optimal equivalence testerunder the Hellinger distance. Our lower bounds are obtained via a direct information-theoretic approach: Given a candidate hard instance, our proof proceeds by boundingthe mutual information between appropriate random variables. While this is a classical method in information theory, prior to our work, it had not been used in this context. Previous lower bounds relied either on the birthday paradox, oron moment-matching and were thus restricted to symmetric properties. Our lower bound approach does not suffer from any such restrictions and gives tight sample lower bounds for the aforementioned problems.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种检验离散分布性质的新方法

我们研究分布性质测试中的问题:给定一个或多个未知离散分布的样本访问权，我们想确定它们是否具有一些全局性质，或者在L1距离(等效地，总变异距离，或“统计距离”)中是否具有epsilon-far的性质。在这项工作中，我们给出了一种新的通用分布测试方法。我们描述了两种技术:我们的第一种技术给出了样本最优的测试器，而我们的第二种技术给出了匹配的样本下界。因此，我们解决了各种测试问题的样本复杂性。我们的上界是通过基于模约化的方法得到的。我们的方法通过使用标准的l2身份测试器作为黑盒，为许多问题生成最佳测试器。使用这个配方，我们获得了广泛问题的简单估计，包括以前在TCS文献中研究的许多问题，即:(1)对固定分布的同一性检验，(2)两个未知分布之间的紧密性检验(具有相等/不等样本量)，(3)独立性检验(在任意数量的维度上)，(4)分布集合的紧密性检验，以及(5)直方图检验。对于所有这些问题，我们的测试人员都是样本最优的，直到常数因素。除了(1)，我们是第一个针对相应问题的样本最优测试者。此外，与以前的结果相比，我们的估计器的陈述和分析明显更简单。作为我们基于约简技术的一个重要应用，我们获得了第一个用于测试两个未知分布之间等价性的自适应算法。我们算法的样本复杂性取决于未知分布的结构——而不仅仅是它们的域大小——在许多自然实例中，与最坏情况下的最优l1测试器相比，我们的算法明显更好。此外，我们的技术自然地推广到l1距离以外的其他指标。为了说明它的灵活性，我们用它获得了海灵格距离下的第一个近似最优等效测试仪。我们的下界是通过直接的信息论方法得到的:给定一个候选的硬实例，我们的证明通过对适当的随机变量之间的互信息进行定界来进行。虽然这是信息论中的经典方法，但在我们的工作之前，它并没有在这种情况下使用。之前的下界要么依赖于生日悖论，要么依赖于矩匹配，因此仅限于对称性质。我们的下界方法不受任何此类限制，并为上述问题提供了严格的样本下界。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)

自引率

0.00%

发文量