An optimized tensor completion library for multiple GPUs

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing Pub Date : 2021-06-03 DOI:10.1145/3447818.3460692

Ming Dun, Yunchun Li, Hailong Yang, Qingxiao Sun, Zhongzhi Luan, D. Qian

{"title":"An optimized tensor completion library for multiple GPUs","authors":"Ming Dun, Yunchun Li, Hailong Yang, Qingxiao Sun, Zhongzhi Luan, D. Qian","doi":"10.1145/3447818.3460692","DOIUrl":null,"url":null,"abstract":"Tensor computations are gaining wide adoption in big data analysis and artificial intelligence. Among them, tensor completion is used to predict the missing or unobserved value in tensors. The decomposition-based tensor completion algorithms have attracted significant research attention since they exhibit better parallelization and scalability. However, existing optimization techniques for tensor completion cannot sustain the increasing demand for applying tensor completion on ever larger tensor data. To address the above limitations, we develop the first tensor completion library cuTC on multiple Graphics Processing Units (GPUs) with three widely used optimization algorithms such as alternating least squares (ALS), stochastic gradient descent (SGD) and coordinate descent (CCD+). We propose a novel TB-COO format that leverages warp shuffle and shared memory on GPU to enable efficient reduction. In addition, we adopt the auto-tuning method to determine the optimal parameters for better convergence and performance. We compare cuTC with state-of-the-art tensor completion libraries on real-world datasets, and the results show cuTC achieves significant speedup with similar or even better accuracy.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447818.3460692","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Tensor computations are gaining wide adoption in big data analysis and artificial intelligence. Among them, tensor completion is used to predict the missing or unobserved value in tensors. The decomposition-based tensor completion algorithms have attracted significant research attention since they exhibit better parallelization and scalability. However, existing optimization techniques for tensor completion cannot sustain the increasing demand for applying tensor completion on ever larger tensor data. To address the above limitations, we develop the first tensor completion library cuTC on multiple Graphics Processing Units (GPUs) with three widely used optimization algorithms such as alternating least squares (ALS), stochastic gradient descent (SGD) and coordinate descent (CCD+). We propose a novel TB-COO format that leverages warp shuffle and shared memory on GPU to enable efficient reduction. In addition, we adopt the auto-tuning method to determine the optimal parameters for better convergence and performance. We compare cuTC with state-of-the-art tensor completion libraries on real-world datasets, and the results show cuTC achieves significant speedup with similar or even better accuracy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一个优化的张量补全库为多个gpu

张量计算在大数据分析和人工智能中得到了广泛的应用。其中，张量补全用于预测张量中的缺失值或未观测值。基于分解的张量补全算法由于具有更好的并行性和可扩展性而引起了广泛的研究关注。然而，现有的张量补全优化技术无法满足在越来越大的张量数据上应用张量补全的需求。为了解决上述限制，我们在多个图形处理单元(gpu)上开发了第一个张量补全库cuTC，其中包括三种广泛使用的优化算法，如交替最小二乘(ALS)，随机梯度下降(SGD)和坐标下降(CCD+)。我们提出了一种新的TB-COO格式，它利用GPU上的warp shuffle和共享内存来实现有效的缩减。此外，我们采用自调谐方法来确定最优参数，以获得更好的收敛性和性能。我们将cuTC与现实世界数据集上最先进的张量补全库进行了比较，结果表明cuTC在相似甚至更好的精度下实现了显著的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

自引率

0.00%

发文量

期刊最新文献

Accelerating BWA-MEM Read Mapping on GPUs. Dynamic Memory Management in Massively Parallel Systems: A Case on GPUs. Priority Algorithms with Advice for Disjoint Path Allocation Problems From Data of Internet of Things to Domain Knowledge: A Case Study of Exploration in Smart Agriculture On Two Variants of Induced Matchings