Learning from untrusted data

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing Pub Date : 2016-11-07 DOI:10.1145/3055399.3055491

M. Charikar, J. Steinhardt, G. Valiant

{"title":"Learning from untrusted data","authors":"M. Charikar, J. Steinhardt, G. Valiant","doi":"10.1145/3055399.3055491","DOIUrl":null,"url":null,"abstract":"The vast majority of theoretical results in machine learning and statistics assume that the training data is a reliable reflection of the phenomena to be learned. Similarly, most learning techniques used in practice are brittle to the presence of large amounts of biased or malicious data. Motivated by this, we consider two frameworks for studying estimation, learning, and optimization in the presence of significant fractions of arbitrary data. The first framework, list-decodable learning, asks whether it is possible to return a list of answers such that at least one is accurate. For example, given a dataset of n points for which an unknown subset of αn points are drawn from a distribution of interest, and no assumptions are made about the remaining (1 - α)n points, is it possible to return a list of poly(1/α) answers? The second framework, which we term the semi-verified model, asks whether a small dataset of trusted data (drawn from the distribution in question) can be used to extract accurate information from a much larger but untrusted dataset (of which only an α-fraction is drawn from the distribution). We show strong positive results in both settings, and provide an algorithm for robust learning in a very general stochastic optimization setting. This result has immediate implications for robustly estimating the mean of distributions with bounded second moments, robustly learning mixtures of such distributions, and robustly finding planted partitions in random graphs in which significant portions of the graph have been perturbed by an adversary.","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"258","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3055399.3055491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 258

Abstract

The vast majority of theoretical results in machine learning and statistics assume that the training data is a reliable reflection of the phenomena to be learned. Similarly, most learning techniques used in practice are brittle to the presence of large amounts of biased or malicious data. Motivated by this, we consider two frameworks for studying estimation, learning, and optimization in the presence of significant fractions of arbitrary data. The first framework, list-decodable learning, asks whether it is possible to return a list of answers such that at least one is accurate. For example, given a dataset of n points for which an unknown subset of αn points are drawn from a distribution of interest, and no assumptions are made about the remaining (1 - α)n points, is it possible to return a list of poly(1/α) answers? The second framework, which we term the semi-verified model, asks whether a small dataset of trusted data (drawn from the distribution in question) can be used to extract accurate information from a much larger but untrusted dataset (of which only an α-fraction is drawn from the distribution). We show strong positive results in both settings, and provide an algorithm for robust learning in a very general stochastic optimization setting. This result has immediate implications for robustly estimating the mean of distributions with bounded second moments, robustly learning mixtures of such distributions, and robustly finding planted partitions in random graphs in which significant portions of the graph have been perturbed by an adversary.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

从不可信的数据中学习

机器学习和统计学中的绝大多数理论结果都假设训练数据是待学习现象的可靠反映。同样，在实践中使用的大多数学习技术对大量有偏见或恶意数据的存在是脆弱的。受此启发，我们考虑了两种框架，用于在任意数据的显著部分存在的情况下研究估计、学习和优化。第一个框架是列表可解码学习，它询问是否有可能返回一个答案列表，使得至少有一个答案是准确的。例如，给定一个n个点的数据集，其中αn个点的未知子集是从感兴趣的分布中提取的，并且没有对剩余的(1 - α)n个点进行假设，是否有可能返回poly(1/α)答案列表?第二个框架，我们称之为半验证模型，它询问是否可以使用可信数据的小数据集(从有问题的分布中提取)来从更大但不可信的数据集(其中只有α-部分是从分布中提取的)中提取准确的信息。我们在这两种情况下都显示了强有力的积极结果，并提供了一种在非常一般的随机优化设置下进行鲁棒学习的算法。这个结果对于稳健地估计有界秒矩分布的平均值，稳健地学习这些分布的混合，以及稳健地在图的重要部分被对手扰动的随机图中发现种植分区具有直接的意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

自引率

0.00%

发文量

期刊最新文献

Online service with delay A simpler and faster strongly polynomial algorithm for generalized flow maximization Low rank approximation with entrywise l1-norm error Fast convergence of learning in games (invited talk) Surviving in directed graphs: a quasi-polynomial-time polylogarithmic approximation for two-connected directed Steiner tree