k-Clustering with Fair Outliers

Matteo Almanza, Alessandro Epasto, A. Panconesi, Giuseppe Re
{"title":"k-Clustering with Fair Outliers","authors":"Matteo Almanza, Alessandro Epasto, A. Panconesi, Giuseppe Re","doi":"10.1145/3488560.3498485","DOIUrl":null,"url":null,"abstract":"Clustering problems and clustering algorithms are often overly sensitive to the presence of outliers: even a handful of points can greatly affect the structure of the optimal solution and its cost. This is why many algorithms for robust clustering problems have been formulated in recent years. These algorithms discard some points as outliers, excluding them from the clustering. However, outlier selection can be unfair: some categories of input points may be disproportionately affected by the outlier removal algorithm. We study the problem of k-clustering with fair outlier removal and provide the first approximation algorithm for well-known clustering formulations, such as k-means and k-median. We analyze this algorithm and prove that it has strong theoretical guarantees. We complement this result with an empirical evaluation showing that, while standard methods for outlier removal have a disproportionate impact across categories of input points, our algorithm equalizes the impact while retaining strong experimental performances on multiple real--world datasets. We also show how the fairness of outlier removal can influence the performance of a downstream learning task. Finally, we provide a coreset construction, which makes our algorithm scalable to very large datasets.","PeriodicalId":348686,"journal":{"name":"Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3488560.3498485","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Clustering problems and clustering algorithms are often overly sensitive to the presence of outliers: even a handful of points can greatly affect the structure of the optimal solution and its cost. This is why many algorithms for robust clustering problems have been formulated in recent years. These algorithms discard some points as outliers, excluding them from the clustering. However, outlier selection can be unfair: some categories of input points may be disproportionately affected by the outlier removal algorithm. We study the problem of k-clustering with fair outlier removal and provide the first approximation algorithm for well-known clustering formulations, such as k-means and k-median. We analyze this algorithm and prove that it has strong theoretical guarantees. We complement this result with an empirical evaluation showing that, while standard methods for outlier removal have a disproportionate impact across categories of input points, our algorithm equalizes the impact while retaining strong experimental performances on multiple real--world datasets. We also show how the fairness of outlier removal can influence the performance of a downstream learning task. Finally, we provide a coreset construction, which makes our algorithm scalable to very large datasets.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
具有公平离群值的k聚类
聚类问题和聚类算法通常对异常值的存在过于敏感:即使是少数几个点也会极大地影响最优解的结构及其成本。这就是为什么近年来已经制定了许多用于鲁棒聚类问题的算法。这些算法丢弃一些点作为异常值,将它们排除在聚类之外。然而,离群值的选择可能是不公平的:某些类别的输入点可能会受到离群值去除算法的不成比例的影响。我们研究了具有公平离群值去除的k-聚类问题,并为众所周知的聚类公式(如k-means和k-median)提供了第一个近似算法。对该算法进行了分析,证明其具有较强的理论保证。我们用一项经验评估来补充这一结果,该评估表明,虽然标准的异常值去除方法在不同类别的输入点上具有不成比例的影响,但我们的算法在多个真实世界数据集上保持强大的实验性能的同时平衡了影响。我们还展示了异常值去除的公平性如何影响下游学习任务的性能。最后,我们提供了一个核心集结构,这使得我们的算法可扩展到非常大的数据集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
AdaptKT: A Domain Adaptable Method for Knowledge Tracing Doctoral Consortium of WSDM'22: Exploring the Bias of Adversarial Defenses Half-Day Tutorial on Combating Online Hate Speech: The Role of Content, Networks, Psychology, User Behavior, etc. Near Real Time AI Personalization for Notifications at LinkedIn k-Clustering with Fair Outliers
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1