Performance-Portable Distributed k-Nearest Neighbors using Locality-Sensitive Hashing and SYCL

International Workshop on OpenCL Pub Date : 2021-04-27 DOI:10.1145/3456669.3456692

Marcel Breyer, Gregor Daiß, D. Pflüger

{"title":"Performance-Portable Distributed k-Nearest Neighbors using Locality-Sensitive Hashing and SYCL","authors":"Marcel Breyer, Gregor Daiß, D. Pflüger","doi":"10.1145/3456669.3456692","DOIUrl":null,"url":null,"abstract":"In the age of data collection, machine learning algorithms have to be able to efficiently cope with vast data sets. This requires scalable algorithms and efficient implementations that can cope with heterogeneous hardware. We propose a new, performance-portable implementation of a well-known, robust, and versatile multi-class classification method that supports multiple Graphics Processing Units (GPUs) from different vendors. It is based on a performance-portable implementation of the approximate k-nearest neighbors (k-NN) algorithm in SYCL. The k-NN assigns a class to a data point based on a majority vote of its neighborhood. The naive approach compares a data point x to all other data points in the training data to identify the k nearest ones. However, this has quadratic runtime and is infeasible for large data sets. Therefore, approximate variants have been developed. Such an algorithm is the Locality-Sensitive Hashing (LSH) algorithm, which uses hash tables together with locality-sensitive hash functions to reduce the data points that have to be examined to compute the k-NN. To the best of our knowledge, there is no distributed LSH version supporting multiple GPUs from different vendors available so far despite the fact that k-NNs are frequently employed. Therefore, we have developed the library. It provides the first hardware-independent, yet efficient and distributed implementation of the LSH algorithm that is suited for modern supercomputers. The implementation uses C++17 together with SYCL 1.2.1, which is an abstraction layer for OpenCL that allows targeting different hardware with a single implementation. To support large data sets, we utilize multiple GPUs using the Message Passing Interface (MPI) to enable the usage of both shared and distributed memory systems. We have tested different parameter combinations for two locality-sensitive hash function implementations, which we compare. Our results show that our library can easily scale on multiple GPUs using both hash function types, achieving a nearly optimal parallel speedup of up to 7.6 on 8 GPUs. Furthermore, we demonstrate that the library supports different SYCL implementations—ComputeCpp, hipSYCL, and DPC++—to target different hardware architectures without significant performance differences.","PeriodicalId":73497,"journal":{"name":"International Workshop on OpenCL","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3456669.3456692","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In the age of data collection, machine learning algorithms have to be able to efficiently cope with vast data sets. This requires scalable algorithms and efficient implementations that can cope with heterogeneous hardware. We propose a new, performance-portable implementation of a well-known, robust, and versatile multi-class classification method that supports multiple Graphics Processing Units (GPUs) from different vendors. It is based on a performance-portable implementation of the approximate k-nearest neighbors (k-NN) algorithm in SYCL. The k-NN assigns a class to a data point based on a majority vote of its neighborhood. The naive approach compares a data point x to all other data points in the training data to identify the k nearest ones. However, this has quadratic runtime and is infeasible for large data sets. Therefore, approximate variants have been developed. Such an algorithm is the Locality-Sensitive Hashing (LSH) algorithm, which uses hash tables together with locality-sensitive hash functions to reduce the data points that have to be examined to compute the k-NN. To the best of our knowledge, there is no distributed LSH version supporting multiple GPUs from different vendors available so far despite the fact that k-NNs are frequently employed. Therefore, we have developed the library. It provides the first hardware-independent, yet efficient and distributed implementation of the LSH algorithm that is suited for modern supercomputers. The implementation uses C++17 together with SYCL 1.2.1, which is an abstraction layer for OpenCL that allows targeting different hardware with a single implementation. To support large data sets, we utilize multiple GPUs using the Message Passing Interface (MPI) to enable the usage of both shared and distributed memory systems. We have tested different parameter combinations for two locality-sensitive hash function implementations, which we compare. Our results show that our library can easily scale on multiple GPUs using both hash function types, achieving a nearly optimal parallel speedup of up to 7.6 on 8 GPUs. Furthermore, we demonstrate that the library supports different SYCL implementations—ComputeCpp, hipSYCL, and DPC++—to target different hardware architectures without significant performance differences.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用位置敏感散列和SYCL的性能可移植分布式k近邻

在数据收集时代，机器学习算法必须能够有效地处理大量数据集。这需要可伸缩的算法和能够处理异构硬件的高效实现。我们提出了一种新的、性能可移植的实现，它是一种众所周知的、健壮的、通用的多类分类方法，支持来自不同供应商的多个图形处理单元(gpu)。它基于SYCL中近似k近邻(k-NN)算法的性能可移植实现。k-NN根据邻域的多数票将类分配给数据点。朴素的方法将数据点x与训练数据中的所有其他数据点进行比较，以确定k个最接近的数据点。然而，这种方法的运行时间是二次的，对于大型数据集来说是不可行的。因此，近似变体已经被开发出来。这种算法就是位置敏感哈希(LSH)算法，它使用哈希表和位置敏感哈希函数来减少计算k-NN时必须检查的数据点。据我们所知，尽管k- nn经常被使用，但到目前为止还没有支持来自不同供应商的多个gpu的分布式LSH版本。因此，我们开发了这个库。它提供了适用于现代超级计算机的LSH算法的第一个独立于硬件但高效的分布式实现。该实现使用c++ 17和SYCL 1.2.1, SYCL 1.2.1是OpenCL的一个抽象层，允许使用单个实现针对不同的硬件。为了支持大型数据集，我们利用使用消息传递接口(MPI)的多个gpu来启用共享和分布式内存系统。我们为两个对位置敏感的散列函数实现测试了不同的参数组合，并进行了比较。我们的结果表明，我们的库可以使用两种哈希函数类型轻松地在多个gpu上扩展，在8个gpu上实现近乎最佳的并行加速，最高可达7.6。此外，我们还演示了该库支持不同的SYCL实现(computecpp、hipSYCL和dpc++)，以针对不同的硬件体系结构，而不会产生显著的性能差异。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Workshop on OpenCL

自引率

0.00%

发文量

期刊最新文献

Improving Performance Portability of the Procedurally Generated High Energy Physics Event Generator MadGraph Using SYCL Acceleration of Quantum Transport Simulations with OpenCL CodePin: An Instrumentation-Based Debug Tool of SYCLomatic An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs Ray Tracer based lidar simulation using SYCL