Spectral clustering is one of the most popular clustering techniques in statistical inference. When applied to large-scale datasets, distributed spectral clustering typically faces two major challenges. First, distributed storage may disrupt the original network structure. Second, communication among computers within a distributed system results in high communication costs. In this work, we propose a communication-efficient algorithm for distributed spectral clustering. Our motivation stems from a theoretical comparison between spectral clustering on the entire dataset (global spectral clustering) and on a subsample (local spectral clustering), where we analyze the key factors underlying their performance differences. Based on the comparison, we propose a communication-efficient distributed spectral clustering (CEDSC) method, which iteratively aggregates intermediate outputs from local spectral clustering to approximate the corresponding global quantity. In this process, only low-dimensional vectors are exchanged between computers, which is shown to be communication efficient. Simulation studies and real-data applications show that CEDSC attains higher clustering accuracy than existing distributed spectral clustering methods while using only modest communication. When clustering 10,000 objects, CEDSC improves clustering accuracy by about 37% over the best baseline, with communication time below 0.4 seconds and comparable to the most communication-efficient method.
扫码关注我们
求助内容:
应助结果提醒方式:
