k-Center Clustering with Outliers in the MPC and Streaming Model

M. D. Berg, Leyla Biabani, M. Monemizadeh
{"title":"k-Center Clustering with Outliers in the MPC and Streaming Model","authors":"M. D. Berg, Leyla Biabani, M. Monemizadeh","doi":"10.1109/IPDPS54959.2023.00090","DOIUrl":null,"url":null,"abstract":"Given a point set P ⊆ X of size n in a metric space (X, dist) of doubling dimension d and two parameters k ∈ ℕ and z ∈ ℕ, the k-center problem with z outliers asks to return a set ${{\\mathcal{C}}^ * } = \\{ c_1^ * , \\cdots ,c_k^ * \\} \\subseteq X$ of k centers such that the maximum distance of all but z points of P to their nearest center in C* is minimized. An (ε, k, z)-coreset for this problem is a weighted point set P* such that an optimal solution for the k-center problem with z outliers on P* gives a (1 ± ε)-approximation for the k-center problem with z outliers on P. We study the construction of such coresets in the Massively Parallel Computing (MPC) model, and in the insertion-only as well as the fully dynamic streaming model. We obtain the following results, for any given 0 < ε ⩽ 1: In all cases, the size of the computed coreset is O(k/εd + z).• In the MPC model the data are distributed over m machines. One is the coordinator machine, which will contain the final answer, the others are worker machines.We present a deterministic 2-round algorithm using $O(\\sqrt n )$ machines, where the worker machines have $O(\\sqrt {nk/{\\varepsilon ^d}} + \\sqrt n \\cdot \\log (z + 1))$ local memory, and the coordinator has $O(\\sqrt {nk/{\\varepsilon ^d}} + \\sqrt n \\cdot \\log (z + 1) + z)$ local memory. The algorithm can handle point sets P that are distributed arbitrarily (possibly adversarially) over the machines. We also present a randomized algorithm that uses only a single round, under the assumption that the input set P is initially distributed randomly over the machines. Then we present a deterministic algorithm that obtains a trade-off between the number of rounds, R, and the storage per machine.In the streaming model we have a single machine with limited storage, and P is revealed in a streaming fashion.○ We present the first lower bound for the insertion-only streaming model, where the points arrive one by one and no points are deleted. We show that any deterministic algorithm that maintains an (ε, k, z)-coreset must use Ω(k/εd + z) space. We complement this by a deterministic streaming algorithm using O(k/εd + z) space, which is thus optimal. ○ For the fully dynamic data streams, where points can be inserted as well as deleted we give a randomized algorithm for point sets from a d-dimensional discrete Euclidean space [Δ]d, where Δ ∈ ℕ indicates the size of the universe from which the coordinates are taken. Our algorithm uses only O((k/εd + z)log4(kΔ/εδ)) space, and it is the first algorithm for this setting. We also present an Ω((k/εd)logΔ + z) lower bound for deterministic fully dynamic streaming algorithms. ○ For the sliding-window model, we show that any deterministic streaming algorithm that guarantees a (1 + ε)-approximation for the k-center problem with outliers in ℝd must use Ω((kz/εd) logσ) space, where σ is the ratio of the largest and smallest distance between any two points in the stream. This (negatively) answers a question posed by De Berg, Monemizadeh, and Zhong [1].","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Given a point set P ⊆ X of size n in a metric space (X, dist) of doubling dimension d and two parameters k ∈ ℕ and z ∈ ℕ, the k-center problem with z outliers asks to return a set ${{\mathcal{C}}^ * } = \{ c_1^ * , \cdots ,c_k^ * \} \subseteq X$ of k centers such that the maximum distance of all but z points of P to their nearest center in C* is minimized. An (ε, k, z)-coreset for this problem is a weighted point set P* such that an optimal solution for the k-center problem with z outliers on P* gives a (1 ± ε)-approximation for the k-center problem with z outliers on P. We study the construction of such coresets in the Massively Parallel Computing (MPC) model, and in the insertion-only as well as the fully dynamic streaming model. We obtain the following results, for any given 0 < ε ⩽ 1: In all cases, the size of the computed coreset is O(k/εd + z).• In the MPC model the data are distributed over m machines. One is the coordinator machine, which will contain the final answer, the others are worker machines.We present a deterministic 2-round algorithm using $O(\sqrt n )$ machines, where the worker machines have $O(\sqrt {nk/{\varepsilon ^d}} + \sqrt n \cdot \log (z + 1))$ local memory, and the coordinator has $O(\sqrt {nk/{\varepsilon ^d}} + \sqrt n \cdot \log (z + 1) + z)$ local memory. The algorithm can handle point sets P that are distributed arbitrarily (possibly adversarially) over the machines. We also present a randomized algorithm that uses only a single round, under the assumption that the input set P is initially distributed randomly over the machines. Then we present a deterministic algorithm that obtains a trade-off between the number of rounds, R, and the storage per machine.In the streaming model we have a single machine with limited storage, and P is revealed in a streaming fashion.○ We present the first lower bound for the insertion-only streaming model, where the points arrive one by one and no points are deleted. We show that any deterministic algorithm that maintains an (ε, k, z)-coreset must use Ω(k/εd + z) space. We complement this by a deterministic streaming algorithm using O(k/εd + z) space, which is thus optimal. ○ For the fully dynamic data streams, where points can be inserted as well as deleted we give a randomized algorithm for point sets from a d-dimensional discrete Euclidean space [Δ]d, where Δ ∈ ℕ indicates the size of the universe from which the coordinates are taken. Our algorithm uses only O((k/εd + z)log4(kΔ/εδ)) space, and it is the first algorithm for this setting. We also present an Ω((k/εd)logΔ + z) lower bound for deterministic fully dynamic streaming algorithms. ○ For the sliding-window model, we show that any deterministic streaming algorithm that guarantees a (1 + ε)-approximation for the k-center problem with outliers in ℝd must use Ω((kz/εd) logσ) space, where σ is the ratio of the largest and smallest distance between any two points in the stream. This (negatively) answers a question posed by De Berg, Monemizadeh, and Zhong [1].
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
MPC和流模型中具有离群值的k-中心聚类
给定一个倍维d的度量空间(X, dist)中大小为n的点集P∈X,两个参数k∈_1和z∈_1,具有z个离群点的k中心问题要求返回一个包含k个中心的集合${{\mathcal{C}}^ * } = \{ c_1^ * , \cdots ,c_k^ * \} \subseteq X$,使得P中除z个点外的所有点到C*中最近的中心的最大距离最小。该问题的(ε, k, z)-核心集是一个加权点集P*,使得P*上有z个离群点的k中心问题的最优解给出P上有z个离群点的k中心问题的(1±ε)-逼近。我们研究了这种核心集在大规模并行计算(MPC)模型、仅插入模型以及全动态流模型中的构造。对于任何给定的0 < ε≥1,我们得到以下结果:在所有情况下,计算的核心集的大小为O(k/εd + z)。•在MPC模型中,数据分布在m台机器上。一个是协调机,它将包含最终答案,其他是工作机。我们提出了一个使用$O(\sqrt n )$机器的确定性2轮算法,其中工作机器具有$O(\sqrt {nk/{\varepsilon ^d}} + \sqrt n \cdot \log (z + 1))$本地内存,协调器具有$O(\sqrt {nk/{\varepsilon ^d}} + \sqrt n \cdot \log (z + 1) + z)$本地内存。该算法可以处理任意(可能是对抗性)分布在机器上的点集P。我们还提出了一个随机算法,它只使用一个回合,假设输入集P最初随机分布在机器上。然后,我们提出了一种确定性算法,该算法可以在每台机器的轮数R和存储空间之间进行权衡。在流模型中,我们有一台存储有限的机器,P以流方式显示。我们给出了仅插入流模型的第一个下界,其中点一个接一个到达并且没有点被删除。我们证明了任何维持(ε, k, z)-核集的确定性算法必须使用Ω(k/εd + z)空间。我们补充了一个使用O(k/εd + z)空间的确定性流算法,这是最优的。〇对于完全动态的数据流,点可以插入也可以删除,我们给出了一个d维离散欧几里得空间[Δ]d的点集的随机化算法,其中Δ∈∧表示取坐标的宇宙的大小。我们的算法仅使用O((k/εd + z)log4(kΔ/εδ))空间,并且是该设置的第一个算法。我们还提出了确定性全动态流算法的Ω((k/εd)logΔ + z)下界。对于滑动窗口模型,我们证明了对于具有离群点的k中心问题,任何保证(1 + ε)-近似的确定性流算法必须使用Ω((kz/εd) logσ)空间,其中σ是流中任意两点之间的最大和最小距离之比。这(消极地)回答了De Berg、Monemizadeh和钟[1]提出的问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
GPU-Accelerated Error-Bounded Compression Framework for Quantum Circuit Simulations Generalizable Reinforcement Learning-Based Coarsening Model for Resource Allocation over Large and Diverse Stream Processing Graphs Smart Redbelly Blockchain: Reducing Congestion for Web3 QoS-Aware and Cost-Efficient Dynamic Resource Allocation for Serverless ML Workflows Fast Sparse GPU Kernels for Accelerated Training of Graph Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1