Training Gaussian Mixture Models at Scale via Coresets

J. Mach. Learn. Res. Pub Date : 2017-03-23 DOI:10.3929/ETHZ-B-000269194

Mario Lucic, Matthew Faulkner, Andreas Krause, Dan Feldman

引用次数: 82

Abstract

How can we train a statistical mixture model on a massive data set? In this work we show how to construct coresets for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real-world data sets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过核心集大规模训练高斯混合模型

我们如何在海量数据集上训练统计混合模型?在这项工作中，我们展示了如何构建高斯混合的核心集。核心集是数据的加权子集，它保证拟合核心集的模型也能很好地拟合原始数据集。我们表明，也许令人惊讶的是，高斯混合物在维度和混合成分数量上承认大小为多项式的核心集，而与数据集大小无关。因此，可以利用计算密集型算法在一个小得多的数据集上计算出一个好的近似值。更重要的是，这些核心集可以在分布式和流设置中有效地构建，并且不会对数据生成过程施加限制。我们的结果依赖于对计算几何问题的统计估计的新颖减少和高斯混合的新组合复杂性结果。对几个真实世界数据集的经验评估表明，我们基于核心集的方法可以显著减少训练时间，近似误差可以忽略不计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

J. Mach. Learn. Res.

自引率

0.00%

发文量

期刊最新文献

Scalable Computation of Causal Bounds A Unified Framework for Factorizing Distributional Value Functions for Multi-Agent Reinforcement Learning Adaptive False Discovery Rate Control with Privacy Guarantee Fairlearn: Assessing and Improving Fairness of AI Systems Generalization Bounds for Adversarial Contrastive Learning