EGG-SynC: Exact GPU-parallelized Grid-based Clustering by Synchronization

Advances in database technology : proceedings. International Conference on Extending Database Technology Pub Date : 2023-01-01 DOI:10.48786/edbt.2023.16

Jakob Rødsgaard Jørgensen, I. Assent

{"title":"EGG-SynC: Exact GPU-parallelized Grid-based Clustering by Synchronization","authors":"Jakob Rødsgaard Jørgensen, I. Assent","doi":"10.48786/edbt.2023.16","DOIUrl":null,"url":null,"abstract":"Clustering by synchronization (SynC) is a clustering method that is motivated by the natural phenomena of synchronization and is based on the Kuramoto model. The idea is to iteratively drag similar objects closer to each other until they have synchronized. SynC has been adapted to solve several well-known data mining tasks such as subspace clustering, hierarchical clustering, and streaming clustering. This shows that the SynC model is very versatile. Sadly, SynC has an 𝑂 ( 𝑇 × 𝑛 2 × 𝑑 ) complexity, which makes it impractical for larger datasets. E.g., Chen et al. [8] show runtimes of more than 10 hours for just 𝑛 = 70 , 000 data points, but improve this to just above one hour by using R-Trees in their method FSynC. Both are still impractical in real-life scenarios. Furthermore, SynC uses a termination criterion that brings no guarantees that the points have synchronized but instead just stops when most points are close to synchronizing. In this paper, our contributions are manifold. We propose a new termination criterion that guarantees that all points have synchronized. To achieve a much-needed reduction in runtime, we propose a strategy to summarize partitions of the data into a grid structure, a GPU-friendly grid structure to support this and neighborhood queries, and a GPU-parallelized algorithm for clustering by synchronization (EGG-SynC) that utilize these ideas. Furthermore, we provide an extensive evaluation against state-of-the-art showing 2 to 3 orders of magnitude speedup compared to SynC and FSynC.","PeriodicalId":88813,"journal":{"name":"Advances in database technology : proceedings. International Conference on Extending Database Technology","volume":"27 1","pages":"195-207"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in database technology : proceedings. International Conference on Extending Database Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48786/edbt.2023.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Clustering by synchronization (SynC) is a clustering method that is motivated by the natural phenomena of synchronization and is based on the Kuramoto model. The idea is to iteratively drag similar objects closer to each other until they have synchronized. SynC has been adapted to solve several well-known data mining tasks such as subspace clustering, hierarchical clustering, and streaming clustering. This shows that the SynC model is very versatile. Sadly, SynC has an 𝑂 ( 𝑇 × 𝑛 2 × 𝑑 ) complexity, which makes it impractical for larger datasets. E.g., Chen et al. [8] show runtimes of more than 10 hours for just 𝑛 = 70 , 000 data points, but improve this to just above one hour by using R-Trees in their method FSynC. Both are still impractical in real-life scenarios. Furthermore, SynC uses a termination criterion that brings no guarantees that the points have synchronized but instead just stops when most points are close to synchronizing. In this paper, our contributions are manifold. We propose a new termination criterion that guarantees that all points have synchronized. To achieve a much-needed reduction in runtime, we propose a strategy to summarize partitions of the data into a grid structure, a GPU-friendly grid structure to support this and neighborhood queries, and a GPU-parallelized algorithm for clustering by synchronization (EGG-SynC) that utilize these ideas. Furthermore, we provide an extensive evaluation against state-of-the-art showing 2 to 3 orders of magnitude speedup compared to SynC and FSynC.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

EGG-SynC:基于同步的精确gpu并行网格聚类

同步聚类(SynC)是一种基于Kuramoto模型的基于同步自然现象的聚类方法。其思想是迭代地拖动相似的对象彼此靠近，直到它们同步。SynC已被用于解决几个众所周知的数据挖掘任务，如子空间聚类、分层聚类和流聚类。这表明SynC模型是非常通用的。遗憾的是，SynC具有𝑂(𝑇×𝑛2 ×𝑑)的复杂性，这使得它不适合大型数据集。例如，Chen等人的[8]显示，对于𝑛= 70,000个数据点，运行时间超过10小时，但通过在他们的方法FSynC中使用R-Trees，将其改善到略高于1小时。这两种方法在现实生活中仍然不切实际。此外，SynC使用的终止条件不能保证点已经同步，而是在大多数点接近同步时停止。在本文中，我们的贡献是多方面的。我们提出了一个新的终止准则，保证所有的点已经同步。为了减少运行时间，我们提出了一种策略，将数据的分区总结为网格结构，一种gpu友好的网格结构来支持此查询和邻域查询，以及一种利用这些思想的gpu并行化算法(EGG-SynC)进行同步聚类。此外，我们提供了一个广泛的评估，对最先进的显示2到3个数量级的加速相比，同步和FSynC。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Advances in database technology : proceedings. International Conference on Extending Database Technology

自引率

0.00%

发文量

期刊最新文献

Computing Generic Abstractions from Application Datasets Fair Spatial Indexing: A paradigm for Group Spatial Fairness. Data Coverage for Detecting Representation Bias in Image Datasets: A Crowdsourcing Approach Auditing for Spatial Fairness TransEdge: Supporting Efficient Read Queries Across Untrusted Edge Nodes