Hang Zhang, Yang Xu, Lei Gong, Ye Zhu, Kai Ming Ting
{"title":"Distributed Clustering based on Distributional Kernel","authors":"Hang Zhang, Yang Xu, Lei Gong, Ye Zhu, Kai Ming Ting","doi":"arxiv-2409.09418","DOIUrl":null,"url":null,"abstract":"This paper introduces a new framework for clustering in a distributed network\ncalled Distributed Clustering based on Distributional Kernel (K) or KDC that\nproduces the final clusters based on the similarity with respect to the\ndistributions of initial clusters, as measured by K. It is the only framework\nthat satisfies all three of the following properties. First, KDC guarantees\nthat the combined clustering outcome from all sites is equivalent to the\nclustering outcome of its centralized counterpart from the combined dataset\nfrom all sites. Second, the maximum runtime cost of any site in distributed\nmode is smaller than the runtime cost in centralized mode. Third, it is\ndesigned to discover clusters of arbitrary shapes, sizes and densities. To the\nbest of our knowledge, this is the first distributed clustering framework that\nemploys a distributional kernel. The distribution-based clustering leads\ndirectly to significantly better clustering outcomes than existing methods of\ndistributed clustering. In addition, we introduce a new clustering algorithm\ncalled Kernel Bounded Cluster Cores, which is the best clustering algorithm\napplied to KDC among existing clustering algorithms. We also show that KDC is a\ngeneric framework that enables a quadratic time clustering algorithm to deal\nwith large datasets that would otherwise be impossible.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09418","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper introduces a new framework for clustering in a distributed network
called Distributed Clustering based on Distributional Kernel (K) or KDC that
produces the final clusters based on the similarity with respect to the
distributions of initial clusters, as measured by K. It is the only framework
that satisfies all three of the following properties. First, KDC guarantees
that the combined clustering outcome from all sites is equivalent to the
clustering outcome of its centralized counterpart from the combined dataset
from all sites. Second, the maximum runtime cost of any site in distributed
mode is smaller than the runtime cost in centralized mode. Third, it is
designed to discover clusters of arbitrary shapes, sizes and densities. To the
best of our knowledge, this is the first distributed clustering framework that
employs a distributional kernel. The distribution-based clustering leads
directly to significantly better clustering outcomes than existing methods of
distributed clustering. In addition, we introduce a new clustering algorithm
called Kernel Bounded Cluster Cores, which is the best clustering algorithm
applied to KDC among existing clustering algorithms. We also show that KDC is a
generic framework that enables a quadratic time clustering algorithm to deal
with large datasets that would otherwise be impossible.