Instance optimal learning of discrete distributions

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing Pub Date : 2016-06-19 DOI:10.1145/2897518.2897641

G. Valiant, Paul Valiant

{"title":"Instance optimal learning of discrete distributions","authors":"G. Valiant, Paul Valiant","doi":"10.1145/2897518.2897641","DOIUrl":null,"url":null,"abstract":"We consider the following basic learning task: given independent draws from an unknown distribution over a discrete support, output an approximation of the distribution that is as accurate as possible in L1 distance (equivalently, total variation distance, or \"statistical distance\"). Perhaps surprisingly, it is often possible to \"de-noise\" the empirical distribution of the samples to return an approximation of the true distribution that is significantly more accurate than the empirical distribution, without relying on any prior assumptions on the distribution. We present an instance optimal learning algorithm which optimally performs this de-noising for every distribution for which such a de-noising is possible. More formally, given n independent draws from a distribution p, our algorithm returns a labelled vector whose expected distance from p is equal to the minimum possible expected error that could be obtained by any algorithm, even one that is given the true unlabeled vector of probabilities of distribution p and simply needs to assign labels---up to an additive subconstant term that is independent of p and goes to zero as n gets large. This somewhat surprising result has several conceptual implications, including the fact that, for any large sample from a distribution over discrete support, prior knowledge of the rates of decay of the tails of the distribution (e.g. power-law type assumptions) is not significantly helpful for the task of learning the distribution. As a consequence of our techniques, we also show that given a set of n samples from an arbitrary distribution, one can accurately estimate the expected number of distinct elements that will be observed in a sample of any size up to n log n. This sort of extrapolation is practically relevant, particularly to domains such as genomics where it is important to understand how much more might be discovered given larger sample sizes, and we are optimistic that our approach is practically viable.","PeriodicalId":442965,"journal":{"name":"Proceedings of the forty-eighth annual ACM symposium on Theory of Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"55","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the forty-eighth annual ACM symposium on Theory of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2897518.2897641","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 55

Abstract

We consider the following basic learning task: given independent draws from an unknown distribution over a discrete support, output an approximation of the distribution that is as accurate as possible in L1 distance (equivalently, total variation distance, or "statistical distance"). Perhaps surprisingly, it is often possible to "de-noise" the empirical distribution of the samples to return an approximation of the true distribution that is significantly more accurate than the empirical distribution, without relying on any prior assumptions on the distribution. We present an instance optimal learning algorithm which optimally performs this de-noising for every distribution for which such a de-noising is possible. More formally, given n independent draws from a distribution p, our algorithm returns a labelled vector whose expected distance from p is equal to the minimum possible expected error that could be obtained by any algorithm, even one that is given the true unlabeled vector of probabilities of distribution p and simply needs to assign labels---up to an additive subconstant term that is independent of p and goes to zero as n gets large. This somewhat surprising result has several conceptual implications, including the fact that, for any large sample from a distribution over discrete support, prior knowledge of the rates of decay of the tails of the distribution (e.g. power-law type assumptions) is not significantly helpful for the task of learning the distribution. As a consequence of our techniques, we also show that given a set of n samples from an arbitrary distribution, one can accurately estimate the expected number of distinct elements that will be observed in a sample of any size up to n log n. This sort of extrapolation is practically relevant, particularly to domains such as genomics where it is important to understand how much more might be discovered given larger sample sizes, and we are optimistic that our approach is practically viable.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

离散分布的实例最优学习

我们考虑以下基本学习任务:在离散支持上给定来自未知分布的独立绘图，输出在L1距离(相当于总变化距离或“统计距离”)中尽可能准确的分布近似值。也许令人惊讶的是，通常可以对样本的经验分布进行“去噪”，以返回比经验分布精确得多的真实分布的近似值，而不依赖于对分布的任何先前假设。我们提出了一种实例最优学习算法，该算法对每个可能进行这种去噪的分布都进行了最优的去噪。更正式地说，给定n个独立的分布p，我们的算法返回一个有标签的向量，它到p的期望距离等于任何算法可以获得的最小可能的期望误差，即使给定分布p的概率的真正未标记向量，只需要分配标签——直到一个与p无关的附加次常数项，随着n变大而趋于零。这个有点令人惊讶的结果有几个概念含义，包括这样一个事实，即对于离散支持分布的任何大样本，分布尾部衰减率的先验知识(例如幂律类型的假设)对学习分布的任务没有显着帮助。由于我们的技术,我们也表明,给定一组n任意分布的样本,可以准确地估计预期数量的不同的元素,将观察到的任何大小的样本到n o (log n)。这种外推法实际上是相关的,尤其是基因组学等领域,重要的是要了解更可能发现更大的样本量,我们乐观,我们的方法是实际可行的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

自引率

0.00%

发文量

期刊最新文献

Exponential separation of communication and external information Proceedings of the forty-eighth annual ACM symposium on Theory of Computing Explicit two-source extractors and resilient functions Constant-rate coding for multiparty interactive communication is impossible Approximating connectivity domination in weighted bounded-genus graphs