{"title":"Clustering Mixtures with Almost Optimal Separation in Polynomial Time","authors":"Jerry Li, Allen Liu","doi":"10.1137/22m1538788","DOIUrl":null,"url":null,"abstract":"SIAM Journal on Computing, Ahead of Print. <br/> Abstract. We consider the problem of clustering mixtures of mean-separated Gaussians in high dimensions. We are given samples from a mixture of [math] identity covariance Gaussians, so that the minimum pairwise distance between any two pairs of means is at least [math], for some parameter [math], and the goal is to recover the ground truth clustering of these samples. It is folklore that separation [math] is both necessary and sufficient to recover a good clustering (say, with constant or [math] error), at least information-theoretically. However, the estimators which achieve this guarantee are inefficient. We give the first algorithm which runs in polynomial time in both [math] and the dimension [math], and which almost matches this guarantee. More precisely, we give an algorithm which takes polynomially many samples and time, and which can successfully recover a good clustering, so long as the separation is [math], for any [math]. Previously, polynomial time algorithms were only known for this problem when the separation was polynomial in [math], and all algorithms which could tolerate [math] separation required quasipolynomial time. We also extend our result to mixtures of translations of a distribution which satisfies the Poincaré inequality, under additional mild assumptions. Our main technical tool, which we believe is of independent interest, is a novel way to implicitly represent and estimate high degree moments of a distribution, which allows us to extract important information about high degree moments without ever writing down the full moment tensors explicitly.","PeriodicalId":49532,"journal":{"name":"SIAM Journal on Computing","volume":"2015 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIAM Journal on Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1137/22m1538788","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
SIAM Journal on Computing, Ahead of Print. Abstract. We consider the problem of clustering mixtures of mean-separated Gaussians in high dimensions. We are given samples from a mixture of [math] identity covariance Gaussians, so that the minimum pairwise distance between any two pairs of means is at least [math], for some parameter [math], and the goal is to recover the ground truth clustering of these samples. It is folklore that separation [math] is both necessary and sufficient to recover a good clustering (say, with constant or [math] error), at least information-theoretically. However, the estimators which achieve this guarantee are inefficient. We give the first algorithm which runs in polynomial time in both [math] and the dimension [math], and which almost matches this guarantee. More precisely, we give an algorithm which takes polynomially many samples and time, and which can successfully recover a good clustering, so long as the separation is [math], for any [math]. Previously, polynomial time algorithms were only known for this problem when the separation was polynomial in [math], and all algorithms which could tolerate [math] separation required quasipolynomial time. We also extend our result to mixtures of translations of a distribution which satisfies the Poincaré inequality, under additional mild assumptions. Our main technical tool, which we believe is of independent interest, is a novel way to implicitly represent and estimate high degree moments of a distribution, which allows us to extract important information about high degree moments without ever writing down the full moment tensors explicitly.
期刊介绍:
The SIAM Journal on Computing aims to provide coverage of the most significant work going on in the mathematical and formal aspects of computer science and nonnumerical computing. Submissions must be clearly written and make a significant technical contribution. Topics include but are not limited to analysis and design of algorithms, algorithmic game theory, data structures, computational complexity, computational algebra, computational aspects of combinatorics and graph theory, computational biology, computational geometry, computational robotics, the mathematical aspects of programming languages, artificial intelligence, computational learning, databases, information retrieval, cryptography, networks, distributed computing, parallel algorithms, and computer architecture.