{"title":"k-Center Clustering with Outliers in the MPC and Streaming Model","authors":"M. D. Berg, Leyla Biabani, M. Monemizadeh","doi":"10.1109/IPDPS54959.2023.00090","DOIUrl":null,"url":null,"abstract":"Given a point set P ⊆ X of size n in a metric space (X, dist) of doubling dimension d and two parameters k ∈ ℕ and z ∈ ℕ, the k-center problem with z outliers asks to return a set ${{\\mathcal{C}}^ * } = \\{ c_1^ * , \\cdots ,c_k^ * \\} \\subseteq X$ of k centers such that the maximum distance of all but z points of P to their nearest center in C* is minimized. An (ε, k, z)-coreset for this problem is a weighted point set P* such that an optimal solution for the k-center problem with z outliers on P* gives a (1 ± ε)-approximation for the k-center problem with z outliers on P. We study the construction of such coresets in the Massively Parallel Computing (MPC) model, and in the insertion-only as well as the fully dynamic streaming model. We obtain the following results, for any given 0 < ε ⩽ 1: In all cases, the size of the computed coreset is O(k/εd + z).• In the MPC model the data are distributed over m machines. One is the coordinator machine, which will contain the final answer, the others are worker machines.We present a deterministic 2-round algorithm using $O(\\sqrt n )$ machines, where the worker machines have $O(\\sqrt {nk/{\\varepsilon ^d}} + \\sqrt n \\cdot \\log (z + 1))$ local memory, and the coordinator has $O(\\sqrt {nk/{\\varepsilon ^d}} + \\sqrt n \\cdot \\log (z + 1) + z)$ local memory. The algorithm can handle point sets P that are distributed arbitrarily (possibly adversarially) over the machines. We also present a randomized algorithm that uses only a single round, under the assumption that the input set P is initially distributed randomly over the machines. Then we present a deterministic algorithm that obtains a trade-off between the number of rounds, R, and the storage per machine.In the streaming model we have a single machine with limited storage, and P is revealed in a streaming fashion.○ We present the first lower bound for the insertion-only streaming model, where the points arrive one by one and no points are deleted. We show that any deterministic algorithm that maintains an (ε, k, z)-coreset must use Ω(k/εd + z) space. We complement this by a deterministic streaming algorithm using O(k/εd + z) space, which is thus optimal. ○ For the fully dynamic data streams, where points can be inserted as well as deleted we give a randomized algorithm for point sets from a d-dimensional discrete Euclidean space [Δ]d, where Δ ∈ ℕ indicates the size of the universe from which the coordinates are taken. Our algorithm uses only O((k/εd + z)log4(kΔ/εδ)) space, and it is the first algorithm for this setting. We also present an Ω((k/εd)logΔ + z) lower bound for deterministic fully dynamic streaming algorithms. ○ For the sliding-window model, we show that any deterministic streaming algorithm that guarantees a (1 + ε)-approximation for the k-center problem with outliers in ℝd must use Ω((kz/εd) logσ) space, where σ is the ratio of the largest and smallest distance between any two points in the stream. This (negatively) answers a question posed by De Berg, Monemizadeh, and Zhong [1].","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Given a point set P ⊆ X of size n in a metric space (X, dist) of doubling dimension d and two parameters k ∈ ℕ and z ∈ ℕ, the k-center problem with z outliers asks to return a set ${{\mathcal{C}}^ * } = \{ c_1^ * , \cdots ,c_k^ * \} \subseteq X$ of k centers such that the maximum distance of all but z points of P to their nearest center in C* is minimized. An (ε, k, z)-coreset for this problem is a weighted point set P* such that an optimal solution for the k-center problem with z outliers on P* gives a (1 ± ε)-approximation for the k-center problem with z outliers on P. We study the construction of such coresets in the Massively Parallel Computing (MPC) model, and in the insertion-only as well as the fully dynamic streaming model. We obtain the following results, for any given 0 < ε ⩽ 1: In all cases, the size of the computed coreset is O(k/εd + z).• In the MPC model the data are distributed over m machines. One is the coordinator machine, which will contain the final answer, the others are worker machines.We present a deterministic 2-round algorithm using $O(\sqrt n )$ machines, where the worker machines have $O(\sqrt {nk/{\varepsilon ^d}} + \sqrt n \cdot \log (z + 1))$ local memory, and the coordinator has $O(\sqrt {nk/{\varepsilon ^d}} + \sqrt n \cdot \log (z + 1) + z)$ local memory. The algorithm can handle point sets P that are distributed arbitrarily (possibly adversarially) over the machines. We also present a randomized algorithm that uses only a single round, under the assumption that the input set P is initially distributed randomly over the machines. Then we present a deterministic algorithm that obtains a trade-off between the number of rounds, R, and the storage per machine.In the streaming model we have a single machine with limited storage, and P is revealed in a streaming fashion.○ We present the first lower bound for the insertion-only streaming model, where the points arrive one by one and no points are deleted. We show that any deterministic algorithm that maintains an (ε, k, z)-coreset must use Ω(k/εd + z) space. We complement this by a deterministic streaming algorithm using O(k/εd + z) space, which is thus optimal. ○ For the fully dynamic data streams, where points can be inserted as well as deleted we give a randomized algorithm for point sets from a d-dimensional discrete Euclidean space [Δ]d, where Δ ∈ ℕ indicates the size of the universe from which the coordinates are taken. Our algorithm uses only O((k/εd + z)log4(kΔ/εδ)) space, and it is the first algorithm for this setting. We also present an Ω((k/εd)logΔ + z) lower bound for deterministic fully dynamic streaming algorithms. ○ For the sliding-window model, we show that any deterministic streaming algorithm that guarantees a (1 + ε)-approximation for the k-center problem with outliers in ℝd must use Ω((kz/εd) logσ) space, where σ is the ratio of the largest and smallest distance between any two points in the stream. This (negatively) answers a question posed by De Berg, Monemizadeh, and Zhong [1].