k-Clustering with Fair Outliers

Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining Pub Date : 2022-02-11 DOI:10.1145/3488560.3498485

Matteo Almanza, Alessandro Epasto, A. Panconesi, Giuseppe Re

{"title":"k-Clustering with Fair Outliers","authors":"Matteo Almanza, Alessandro Epasto, A. Panconesi, Giuseppe Re","doi":"10.1145/3488560.3498485","DOIUrl":null,"url":null,"abstract":"Clustering problems and clustering algorithms are often overly sensitive to the presence of outliers: even a handful of points can greatly affect the structure of the optimal solution and its cost. This is why many algorithms for robust clustering problems have been formulated in recent years. These algorithms discard some points as outliers, excluding them from the clustering. However, outlier selection can be unfair: some categories of input points may be disproportionately affected by the outlier removal algorithm. We study the problem of k-clustering with fair outlier removal and provide the first approximation algorithm for well-known clustering formulations, such as k-means and k-median. We analyze this algorithm and prove that it has strong theoretical guarantees. We complement this result with an empirical evaluation showing that, while standard methods for outlier removal have a disproportionate impact across categories of input points, our algorithm equalizes the impact while retaining strong experimental performances on multiple real--world datasets. We also show how the fairness of outlier removal can influence the performance of a downstream learning task. Finally, we provide a coreset construction, which makes our algorithm scalable to very large datasets.","PeriodicalId":348686,"journal":{"name":"Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3488560.3498485","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Clustering problems and clustering algorithms are often overly sensitive to the presence of outliers: even a handful of points can greatly affect the structure of the optimal solution and its cost. This is why many algorithms for robust clustering problems have been formulated in recent years. These algorithms discard some points as outliers, excluding them from the clustering. However, outlier selection can be unfair: some categories of input points may be disproportionately affected by the outlier removal algorithm. We study the problem of k-clustering with fair outlier removal and provide the first approximation algorithm for well-known clustering formulations, such as k-means and k-median. We analyze this algorithm and prove that it has strong theoretical guarantees. We complement this result with an empirical evaluation showing that, while standard methods for outlier removal have a disproportionate impact across categories of input points, our algorithm equalizes the impact while retaining strong experimental performances on multiple real--world datasets. We also show how the fairness of outlier removal can influence the performance of a downstream learning task. Finally, we provide a coreset construction, which makes our algorithm scalable to very large datasets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

具有公平离群值的k聚类

聚类问题和聚类算法通常对异常值的存在过于敏感:即使是少数几个点也会极大地影响最优解的结构及其成本。这就是为什么近年来已经制定了许多用于鲁棒聚类问题的算法。这些算法丢弃一些点作为异常值，将它们排除在聚类之外。然而，离群值的选择可能是不公平的:某些类别的输入点可能会受到离群值去除算法的不成比例的影响。我们研究了具有公平离群值去除的k-聚类问题，并为众所周知的聚类公式(如k-means和k-median)提供了第一个近似算法。对该算法进行了分析，证明其具有较强的理论保证。我们用一项经验评估来补充这一结果，该评估表明，虽然标准的异常值去除方法在不同类别的输入点上具有不成比例的影响，但我们的算法在多个真实世界数据集上保持强大的实验性能的同时平衡了影响。我们还展示了异常值去除的公平性如何影响下游学习任务的性能。最后，我们提供了一个核心集结构，这使得我们的算法可扩展到非常大的数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量

期刊最新文献

AdaptKT: A Domain Adaptable Method for Knowledge Tracing Doctoral Consortium of WSDM'22: Exploring the Bias of Adversarial Defenses Half-Day Tutorial on Combating Online Hate Speech: The Role of Content, Networks, Psychology, User Behavior, etc. Near Real Time AI Personalization for Notifications at LinkedIn k-Clustering with Fair Outliers