SelB-k-NN: A Mini-Batch K-Nearest Neighbors Algorithm on AI Processors

Yifeng Tang, Cho-Li Wang
{"title":"SelB-k-NN: A Mini-Batch K-Nearest Neighbors Algorithm on AI Processors","authors":"Yifeng Tang, Cho-Li Wang","doi":"10.1109/IPDPS54959.2023.00088","DOIUrl":null,"url":null,"abstract":"The popularity of Artificial Intelligence (AI) motivates novel domain-specific hardware named AI processors. With a design trade-off, the AI processors feature incredible computation power for matrix multiplications and activations, while some leave other operations less powerful, e.g., scalar operations and vectorized comparisons & selections. For k-nearest neighbors (k-NN) algorithm, consisting of distance computation phase and k-selection phase, while the former is naturally accelerated, the previous efficient k-selection becomes problematic. Moreover, limited memory forces k-NN to adopt a mini-batch manner with tiling technique. As the distance computation’s results are the k-selection’s inputs, the former’s tiling shape determines that of the latter. Since the two phases execute on separate hardware units requiring different performance analyses, whether the former’s tiling strategies benefit the latter and entire k-NN is doubtful.To address the new challenges brought by the AI processors, this paper proposes SelB-k-NN (Selection-Bitonic-k-NN), a mini-batch algorithm inspired by selection sort and bitonic k-selection. SelB-k-NN avoids the expansion of the weakly-supported operations on the huge scale of datasets. To apply SelB-k-NN to various AI processors, we propose two algorithms to reduce the hardware support requirements. Since the matrix multiplication operates data with the specifically-designed memory hierarchy which k-selection does not share, the tiling shape of the former cannot guarantee the best execution of the latter and vice versa. By quantifying the runtime workload variations of k-selection, we formulate an optimization problem to search for the optimal tiling shapes of both phases with an offline pruning method, which reduces the search space in the preprocessing stage. Evaluations show that on Huawei Ascend 310 AI processor, SelB-k-NN achieves 2.01× speedup of the bitonic k-selection, 23.93× of the heap approach, 78.52× of the CPU approach. For mini-batch SelB-k-NN, the optimal tiling shapes for two phases respectively achieve 1.48× acceleration compared with the matrix multiplication tiling shapes and 1.14× with the k-selection tiling shapes, with 72.80% of the search space pruned.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00088","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The popularity of Artificial Intelligence (AI) motivates novel domain-specific hardware named AI processors. With a design trade-off, the AI processors feature incredible computation power for matrix multiplications and activations, while some leave other operations less powerful, e.g., scalar operations and vectorized comparisons & selections. For k-nearest neighbors (k-NN) algorithm, consisting of distance computation phase and k-selection phase, while the former is naturally accelerated, the previous efficient k-selection becomes problematic. Moreover, limited memory forces k-NN to adopt a mini-batch manner with tiling technique. As the distance computation’s results are the k-selection’s inputs, the former’s tiling shape determines that of the latter. Since the two phases execute on separate hardware units requiring different performance analyses, whether the former’s tiling strategies benefit the latter and entire k-NN is doubtful.To address the new challenges brought by the AI processors, this paper proposes SelB-k-NN (Selection-Bitonic-k-NN), a mini-batch algorithm inspired by selection sort and bitonic k-selection. SelB-k-NN avoids the expansion of the weakly-supported operations on the huge scale of datasets. To apply SelB-k-NN to various AI processors, we propose two algorithms to reduce the hardware support requirements. Since the matrix multiplication operates data with the specifically-designed memory hierarchy which k-selection does not share, the tiling shape of the former cannot guarantee the best execution of the latter and vice versa. By quantifying the runtime workload variations of k-selection, we formulate an optimization problem to search for the optimal tiling shapes of both phases with an offline pruning method, which reduces the search space in the preprocessing stage. Evaluations show that on Huawei Ascend 310 AI processor, SelB-k-NN achieves 2.01× speedup of the bitonic k-selection, 23.93× of the heap approach, 78.52× of the CPU approach. For mini-batch SelB-k-NN, the optimal tiling shapes for two phases respectively achieve 1.48× acceleration compared with the matrix multiplication tiling shapes and 1.14× with the k-selection tiling shapes, with 72.80% of the search space pruned.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SelB-k-NN: AI处理器上的一种小批k近邻算法
人工智能(AI)的普及激发了新的领域专用硬件——AI处理器。通过设计权衡,AI处理器在矩阵乘法和激活方面具有令人难以置信的计算能力,而有些操作则使其他操作不那么强大,例如标量操作和矢量化比较和选择。对于由距离计算阶段和k-选择阶段组成的k-近邻(k-NN)算法来说,距离计算阶段的速度自然会加快,而k-近邻算法之前高效的k-选择就成了问题。此外,有限的内存迫使k-NN采用带有平铺技术的小批量方式。由于距离计算的结果是k-selection的输入,因此前者的平铺形状决定了后者的平铺形状。由于这两个阶段在不同的硬件单元上执行,需要不同的性能分析,因此前者的平铺策略是否有利于后者和整个k-NN是值得怀疑的。为了解决人工智能处理器带来的新挑战,本文提出了SelB-k-NN (selection - bitonic -k- nn),这是一种受选择排序和bitonic k-selection启发的小型批处理算法。SelB-k-NN避免了弱支持操作在大规模数据集上的扩展。为了将SelB-k-NN应用于各种人工智能处理器,我们提出了两种算法来降低硬件支持要求。由于矩阵乘法使用k-selection不共享的专门设计的内存层次结构来操作数据,因此前者的平铺形状不能保证后者的最佳执行,反之亦然。通过量化k-selection的运行时工作量变化,我们制定了一个优化问题,利用离线剪枝方法搜索两个阶段的最优平铺形状,从而减少了预处理阶段的搜索空间。评估表明,在华为Ascend 310 AI处理器上,SelB-k-NN的速度比bitonic k-selection提高2.01倍,比heap方法提高23.93倍,比CPU方法提高78.52倍。对于小批量SelB-k-NN,两阶段的最优平铺形状分别比矩阵乘法平铺形状加速1.48倍,比k-选择平铺形状加速1.14倍,72.80%的搜索空间被修剪。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
GPU-Accelerated Error-Bounded Compression Framework for Quantum Circuit Simulations Generalizable Reinforcement Learning-Based Coarsening Model for Resource Allocation over Large and Diverse Stream Processing Graphs Smart Redbelly Blockchain: Reducing Congestion for Web3 QoS-Aware and Cost-Efficient Dynamic Resource Allocation for Serverless ML Workflows Fast Sparse GPU Kernels for Accelerated Training of Graph Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1