Pub Date : 2024-07-02DOI: 10.1016/j.parco.2024.103093
Alexander Agathos , Philip Azariadis
The k-Nearest Neighbors algorithm is a fundamental algorithm that finds applications in many fields like Machine Learning, Computer Graphics, Computer Vision, and others. The algorithm determines the closest points (d-dimensional) of a reference set R according to a query set of points Q under a specific metric (Euclidean, Mahalanobis, Manhattan, etc.). This work focuses on the utilization of multiple Graphical Processing Units for the acceleration of the k-Nearest Neighbors algorithm with large or very large sets of 3D points. With the proposed approach the space of the reference set is divided into a 3D grid which is used to facilitate the search for the nearest neighbors. The search in the grid is performed in a multiresolution manner starting from a high-resolution grid and ending up in a coarse one, thus accounting for point clouds that may have non-uniform sampling and/or outliers. Three important algorithms in reverse engineering are revisited and new multi-GPU versions are proposed based on the introduced KNN algorithm. More specifically, the new multi-GPU approach is applied to the Iterative Closest Point algorithm, to the point cloud smoothing, and to the point cloud normal vectors computation and orientation problem. A series of tests and experiments have been conducted and discussed in the paper showing the merits of the proposed multi-GPU approach.
k 近邻算法是一种基本算法,在机器学习、计算机图形学、计算机视觉等许多领域都有应用。该算法根据特定度量(欧氏、马哈罗诺比、曼哈顿等)下的查询点集合 Q,确定参考集合 R 的最近点(d 维)。这项工作的重点是利用多个图形处理单元来加速大型或超大型三维点集的 k 近邻算法。利用所提出的方法,参考集的空间被划分为一个三维网格,用于促进近邻搜索。网格中的搜索是以多分辨率方式进行的,从高分辨率网格开始,到粗网格结束,从而考虑到可能存在非均匀采样和/或异常值的点云。我们重新审视了逆向工程中的三种重要算法,并基于引入的 KNN 算法提出了新的多 GPU 版本。更具体地说,新的多 GPU 方法适用于迭代最接近点算法、点云平滑以及点云法向量计算和定向问题。文中进行了一系列测试和实验,并讨论了所提出的多 GPU 方法的优点。
{"title":"Multi-GPU 3D k-nearest neighbors computation with application to ICP, point cloud smoothing and normals computation","authors":"Alexander Agathos , Philip Azariadis","doi":"10.1016/j.parco.2024.103093","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103093","url":null,"abstract":"<div><p>The k-Nearest Neighbors algorithm is a fundamental algorithm that finds applications in many fields like Machine Learning, Computer Graphics, Computer Vision, and others. The algorithm determines the closest points (d-dimensional) of a reference set R according to a query set of points Q under a specific metric (Euclidean, Mahalanobis, Manhattan, etc.). This work focuses on the utilization of multiple Graphical Processing Units for the acceleration of the k-Nearest Neighbors algorithm with large or very large sets of 3D points. With the proposed approach the space of the reference set is divided into a 3D grid which is used to facilitate the search for the nearest neighbors. The search in the grid is performed in a multiresolution manner starting from a high-resolution grid and ending up in a coarse one, thus accounting for point clouds that may have non-uniform sampling and/or outliers. Three important algorithms in reverse engineering are revisited and new multi-GPU versions are proposed based on the introduced KNN algorithm. More specifically, the new multi-GPU approach is applied to the Iterative Closest Point algorithm, to the point cloud smoothing, and to the point cloud normal vectors computation and orientation problem. A series of tests and experiments have been conducted and discussed in the paper showing the merits of the proposed multi-GPU approach.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103093"},"PeriodicalIF":2.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-29DOI: 10.1016/j.parco.2024.103092
Duo Yang , Bing Hu , An Liu , A-Long Jin , Kwan L. Yeung , Yang You
Parameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers’ computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To address this issue, we propose a solution called Worker-Busy Synchronous Parallel (WBSP). This approach eliminates the waiting time of fast workers during the synchronization process and decouples the gradient upload and model download of fast workers into asymmetric parts. By doing so, it allows fast workers to complete multiple steps of local training and upload more gradients to the server, improving computational resource utilization. Additionally, the global model is only updated when the slowest worker uploads the gradients, ensuring the consistency of global models that are pulled down by all workers and the convergence of the global model. Building upon WBSP, we propose an optimized version to further reduce the communication overhead. It enables parallel execution of communication and computation tasks on workers to shorten the global synchronization interval, thereby improving training speed. We conduct theoretical analyses for the proposed mechanisms. Extensive experiments verify that our mechanism can reduce the required time to achieve the target accuracy by up to 60% compared with the fastest method and increase the proportion of computation time from 55%–72% in existing methods to 91%.
{"title":"WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel","authors":"Duo Yang , Bing Hu , An Liu , A-Long Jin , Kwan L. Yeung , Yang You","doi":"10.1016/j.parco.2024.103092","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103092","url":null,"abstract":"<div><p>Parameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers’ computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To address this issue, we propose a solution called Worker-Busy Synchronous Parallel (WBSP). This approach eliminates the waiting time of fast workers during the synchronization process and decouples the gradient upload and model download of fast workers into asymmetric parts. By doing so, it allows fast workers to complete multiple steps of local training and upload more gradients to the server, improving computational resource utilization. Additionally, the global model is only updated when the slowest worker uploads the gradients, ensuring the consistency of global models that are pulled down by all workers and the convergence of the global model. Building upon WBSP, we propose an optimized version to further reduce the communication overhead. It enables parallel execution of communication and computation tasks on workers to shorten the global synchronization interval, thereby improving training speed. We conduct theoretical analyses for the proposed mechanisms. Extensive experiments verify that our mechanism can reduce the required time to achieve the target accuracy by up to 60% compared with the fastest method and increase the proportion of computation time from 55%–72% in existing methods to 91%.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103092"},"PeriodicalIF":2.0,"publicationDate":"2024-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-04DOI: 10.1016/j.parco.2024.103085
Qingcai Jiang , Zhenwei Cao , Xinhui Cui , Lingyun Wan , Xinming Qin , Huanqi Cao , Hong An , Junshi Chen , Jie Liu , Wei Hu , Jinlong Yang
First-principles time-dependent density functional theory (TDDFT) is a powerful tool to accurately describe the excited-state properties of molecules and solids in condensed matter physics, computational chemistry, and materials science. However, a perceived drawback in TDDFT calculations is its ultrahigh computational cost and large memory usage especially for plane-wave basis set, confining its applications to large systems containing thousands of atoms. Here, we present a massively parallel implementation of linear-response TDDFT (LR-TDDFT) and accelerate LR-TDDFT in two different aspects: (1) numerical algorithms on the X86 supercomputer and (2) optimizations on the heterogeneous architecture of the new Sunway supercomputer. Furthermore, we carefully design the parallel data and task distribution schemes to accommodate the physical nature of different computation steps. By utilizing these two different methods, our implementation can gain an overall speedup of 10x and 80x and efficiently scales to large systems up to 4096 and 2744 atoms within dozens of seconds.
{"title":"Extending the limit of LR-TDDFT on two different approaches: Numerical algorithms and new Sunway heterogeneous supercomputer","authors":"Qingcai Jiang , Zhenwei Cao , Xinhui Cui , Lingyun Wan , Xinming Qin , Huanqi Cao , Hong An , Junshi Chen , Jie Liu , Wei Hu , Jinlong Yang","doi":"10.1016/j.parco.2024.103085","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103085","url":null,"abstract":"<div><p>First-principles time-dependent density functional theory (TDDFT) is a powerful tool to accurately describe the excited-state properties of molecules and solids in condensed matter physics, computational chemistry, and materials science. However, a perceived drawback in TDDFT calculations is its ultrahigh computational cost <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>5</mn></mrow></msup><mo>∼</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>6</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> and large memory usage <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>4</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> especially for plane-wave basis set, confining its applications to large systems containing thousands of atoms. Here, we present a massively parallel implementation of linear-response TDDFT (LR-TDDFT) and accelerate LR-TDDFT in two different aspects: (1) numerical algorithms on the X86 supercomputer and (2) optimizations on the heterogeneous architecture of the new Sunway supercomputer. Furthermore, we carefully design the parallel data and task distribution schemes to accommodate the physical nature of different computation steps. By utilizing these two different methods, our implementation can gain an overall speedup of 10x and 80x and efficiently scales to large systems up to 4096 and 2744 atoms within dozens of seconds.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103085"},"PeriodicalIF":1.4,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140894775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-26DOI: 10.1016/j.parco.2024.103084
Fahimeh Yazdanpanah, Mohammad Alaei
PSO (particle swarm optimization), is an intelligent search method for finding the best solution according to population state. Various parallel implementations of this algorithm have been presented for intensive-computing applications. The ALC-PSO algorithm (PSO with an aging leader and challengers) is an improved population-based procedure that increases convergence rapidity, compared to the traditional PSO. In this paper, we propose a low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA, for execution on both CPU and GPU cores. This is the first effort to heterogeneous parallel implementing ALC-PSO algorithm with combination of OmpSs and CUDA. This hybrid parallel programming approach increases the performance and efficiency of the intensive-computing applications. The proposed approach of this article is also useful and applicable for heterogeneous parallel execution of the other improved versions of PSO algorithm, on both CPUs and GPUs. The results demonstrate that the proposed approach provides higher performance, in terms of delay and power consumption, than the existence implementations of ALC-PSO algorithm.
PSO(粒子群优化)是一种智能搜索方法,可根据种群状态找到最佳解决方案。针对密集型计算应用,该算法有多种并行实施方案。与传统的 PSO 相比,ALC-PSO 算法(带有老化领导者和挑战者的 PSO)是一种基于种群的改进程序,可提高收敛速度。在本文中,我们提出了一种使用 OmpSs 和 CUDA 的 ALC-PSO 算法的低功耗异构并行实施方案,可在 CPU 和 GPU 内核上执行。这是首次使用 OmpSs 和 CUDA 对 ALC-PSO 算法进行异构并行计算。这种混合并行编程方法提高了密集型计算应用的性能和效率。本文提出的方法也适用于在 CPU 和 GPU 上异构并行执行其他改进版本的 PSO 算法。结果表明,与 ALC-PSO 算法的现有实现相比,本文提出的方法在延迟和功耗方面提供了更高的性能。
{"title":"An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA","authors":"Fahimeh Yazdanpanah, Mohammad Alaei","doi":"10.1016/j.parco.2024.103084","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103084","url":null,"abstract":"<div><p>PSO (particle swarm optimization), is an intelligent search method for finding the best solution according to population state. Various parallel implementations of this algorithm have been presented for intensive-computing applications. The ALC-PSO algorithm (PSO with an aging leader and challengers) is an improved population-based procedure that increases convergence rapidity, compared to the traditional PSO. In this paper, we propose a low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA, for execution on both CPU and GPU cores. This is the first effort to heterogeneous parallel implementing ALC-PSO algorithm with combination of OmpSs and CUDA. This hybrid parallel programming approach increases the performance and efficiency of the intensive-computing applications. The proposed approach of this article is also useful and applicable for heterogeneous parallel execution of the other improved versions of PSO algorithm, on both CPUs and GPUs. The results demonstrate that the proposed approach provides higher performance, in terms of delay and power consumption, than the existence implementations of ALC-PSO algorithm.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103084"},"PeriodicalIF":1.4,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140327837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-16DOI: 10.1016/j.parco.2024.103083
Sanjay Bhardwaj, Da-Hye Kim, Dong-Seong Kim
Deep learning (DL)-based automatic modulation classification (AMC) is a primary research field for identifying modulation types. However, traditional DL-based AMC approaches rely on hand-crafted features, which can be time-consuming and may not capture all relevant information in the signal. Additionally, they are centralized solutions that are trained on large amounts of data acquired from local clients and stored on a server, leading to weak performance in terms of correct classification probability. To address these issues, a federated learning (FL)-based AMC approach is proposed, called FL-MP-CNN-AMC, which takes into account the effects of multipath channels (reflected and scattered paths) and considers the use of a modified loss function for solving the class imbalance problem caused by these channels. In addition, hyperparameter tuning and optimization of the loss function are discussed and analyzed to improve the performance of the proposed approach. The classification performance is investigated by considering the effects of interference level, delay spread, scattered and reflected paths, phase offset, and frequency offset. The simulation results show that the proposed approach provides excellent performance in terms of correct classification probability, confusion matrix, convergence and communication overhead when compared to contemporary methods.
{"title":"Federated learning based modulation classification for multipath channels","authors":"Sanjay Bhardwaj, Da-Hye Kim, Dong-Seong Kim","doi":"10.1016/j.parco.2024.103083","DOIUrl":"10.1016/j.parco.2024.103083","url":null,"abstract":"<div><p>Deep learning (DL)-based automatic modulation classification (AMC) is a primary research field for identifying modulation types. However, traditional DL-based AMC approaches rely on hand-crafted features, which can be time-consuming and may not capture all relevant information in the signal. Additionally, they are centralized solutions that are trained on large amounts of data acquired from local clients and stored on a server, leading to weak performance in terms of correct classification probability. To address these issues, a federated learning (FL)-based AMC approach is proposed, called FL-MP-CNN-AMC, which takes into account the effects of multipath channels (reflected and scattered paths) and considers the use of a modified loss function for solving the class imbalance problem caused by these channels. In addition, hyperparameter tuning and optimization of the loss function are discussed and analyzed to improve the performance of the proposed approach. The classification performance is investigated by considering the effects of interference level, delay spread, scattered and reflected paths, phase offset, and frequency offset. The simulation results show that the proposed approach provides excellent performance in terms of correct classification probability, confusion matrix, convergence and communication overhead when compared to contemporary methods.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103083"},"PeriodicalIF":1.4,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140171507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.1016/j.parco.2024.103082
Kaihao Ma , Zhenkun Cai , Xiao Yan , Yang Zhang , Zhi Liu , Yihui Feng , Chao Li , Wei Lin , James Cheng
Multi-tenant GPU clusters are common, where users purchase GPU quota to run their neural network training jobs. However, strict quota-based scheduling often leads to cluster under-utilization, while allowing quota groups to use excess GPUs improves utilization but results in fairness problems. We propose PPS, a probabilistic prediction based scheduler, which uses job history statistics to predict future cluster status for making good scheduling decisions. Different from existing schedulers that rely on deep learning frameworks to adjust bad scheduling decisions and/or require detailed job information, PPS treats jobs as black boxes in that PPS runs a job to completion without adjustment once scheduled and requires only aggregate job statistics. The black-box feature is favorable due to its good generality, compatibility and security, and made possible by the predictability of aggregate resource utilization statistics of large clusters. Extensive experiments show that PPS achieves high cluster utilization and good fairness simultaneously.
{"title":"PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters","authors":"Kaihao Ma , Zhenkun Cai , Xiao Yan , Yang Zhang , Zhi Liu , Yihui Feng , Chao Li , Wei Lin , James Cheng","doi":"10.1016/j.parco.2024.103082","DOIUrl":"10.1016/j.parco.2024.103082","url":null,"abstract":"<div><p>Multi-tenant GPU clusters are common, where users purchase GPU quota to run their neural network training jobs. However, strict quota-based scheduling often leads to cluster under-utilization, while allowing quota groups to use excess GPUs improves utilization but results in fairness problems. We propose PPS, a probabilistic prediction based scheduler, which uses job history statistics to predict future cluster status for making good scheduling decisions. Different from existing schedulers that rely on deep learning frameworks to adjust bad scheduling decisions and/or require detailed job information, PPS treats jobs as black boxes in that PPS runs a job to completion without adjustment once scheduled and requires only aggregate job statistics. The black-box feature is favorable due to its good generality, compatibility and security, and made possible by the predictability of aggregate resource utilization statistics of large clusters. Extensive experiments show that PPS achieves high cluster utilization and good fairness simultaneously.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103082"},"PeriodicalIF":1.4,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140275754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
CUDA toolkits are widely used to develop applications running on NVIDIA GPUs. They include compilers and are frequently updated to integrate state-of-the-art compilation techniques. Hence, many HPC users believe that the latest CUDA toolkit will improve application performance; however, considering results from CPU compilers, there are cases where this is not true. In this paper, we thoroughly evaluate the impact of CUDA toolkit version on the performance, power consumption, and energy consumption of GPU applications with four GPU architectures. Our results show that though the latest CUDA toolkit obtains the best performance, power consumption, and energy consumption for many applications in most cases, but we found a few exceptions. For such applications, we conducted an in-depth analysis using the SASS to identify why older CUDA toolkit achieve performance improvement. Our analysis showed that the factors that caused them are by three phenomena: aggressive loop unrolling, inefficient instruction scheduling, and the impact of host compilers.
CUDA 工具包被广泛用于开发在英伟达™(NVIDIA®)GPU 上运行的应用程序。这些工具包包括编译器,并经常更新,以整合最先进的编译技术。因此,许多高性能计算用户认为最新的CUDA工具包将提高应用程序的性能;然而,考虑到CPU编译器的结果,有些情况下并非如此。在本文中,我们全面评估了 CUDA 工具包版本对采用三种 GPU 架构的 GPU 应用程序的性能、功耗和能耗的影响。我们的结果表明,虽然最新的 CUDA 工具包在大多数情况下都能为许多应用获得最佳性能、功耗和能耗,但我们也发现了一些例外情况。针对这些应用,我们使用 SASS 进行了深入分析,以找出旧版 CUDA 工具包性能提升的原因。我们的分析表明,造成这些问题的因素有三个:激进的循环解卷、低效的指令调度以及主机编译器的影响。
{"title":"Analyzing the impact of CUDA versions on GPU applications","authors":"Kohei Yoshida, Shinobu Miwa, Hayato Yamaki, Hiroki Honda","doi":"10.1016/j.parco.2024.103081","DOIUrl":"10.1016/j.parco.2024.103081","url":null,"abstract":"<div><p>CUDA toolkits are widely used to develop applications running on NVIDIA GPUs. They include compilers and are frequently updated to integrate state-of-the-art compilation techniques. Hence, many HPC users believe that the latest CUDA toolkit will improve application performance; however, considering results from CPU compilers, there are cases where this is not true. In this paper, we thoroughly evaluate the impact of CUDA toolkit version on the performance, power consumption, and energy consumption of GPU applications with four GPU architectures. Our results show that though the latest CUDA toolkit obtains the best performance, power consumption, and energy consumption for many applications in most cases, but we found a few exceptions. For such applications, we conducted an in-depth analysis using the SASS to identify why older CUDA toolkit achieve performance improvement. Our analysis showed that the factors that caused them are by three phenomena: aggressive loop unrolling, inefficient instruction scheduling, and the impact of host compilers.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103081"},"PeriodicalIF":1.4,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016781912400019X/pdfft?md5=62bfbd6666b978a441b0d0daa8420592&pid=1-s2.0-S016781912400019X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-28DOI: 10.1016/j.parco.2024.103080
Jianjiang Li , Lin Li , Qingwei Wang , Wei Xue , Jiabi Liang , Jinliang Shi
Large-scale sparse linear equation solver plays an important role in both numerical simulation and artificial intelligence, and sparse triangular equation solver is a key step in solving sparse linear equation systems. Its parallel optimization can effectively improve the efficiency of solving sparse linear equation systems. In this paper, we design and implement a parallel algorithm for solving sparse triangular equations in combination with the features of the new generation of Sunway architecture, and optimize the access and communication respectively for 949 real equations and 32 complex equations in the SuiteSparse collection. The solution efficiency of the algorithm presented in this paper outperforms the cuSparse algorithm on NVIDIA V100 GPU platforms in more than 71% of the cases, and the speedup is even better in solving larger cases (matrix size greater than 10,000): our method increases the speedup from 1.29 time of the previous version to an average speedup of 5.54 and the best speedup of 32.18 over the sequential method on the next generation of Sunway architecture when using 64 slave cores.
{"title":"Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architecture","authors":"Jianjiang Li , Lin Li , Qingwei Wang , Wei Xue , Jiabi Liang , Jinliang Shi","doi":"10.1016/j.parco.2024.103080","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103080","url":null,"abstract":"<div><p>Large-scale sparse linear equation solver plays an important role in both numerical simulation and artificial intelligence, and sparse triangular equation solver is a key step in solving sparse linear equation systems. Its parallel optimization can effectively improve the efficiency of solving sparse linear equation systems. In this paper, we design and implement a parallel algorithm for solving sparse triangular equations in combination with the features of the new generation of Sunway architecture, and optimize the access and communication respectively for 949 real equations and 32 complex equations in the SuiteSparse collection. The solution efficiency of the algorithm presented in this paper outperforms the cuSparse algorithm on NVIDIA V100 GPU platforms in more than 71% of the cases, and the speedup is even better in solving larger cases (matrix size greater than 10,000): our method increases the speedup from 1.29 time of the previous version to an average speedup of 5.54 and the best speedup of 32.18 over the sequential method on the next generation of Sunway architecture when using 64 slave cores.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103080"},"PeriodicalIF":1.4,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140024112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-01DOI: 10.1016/j.parco.2024.103064
Ke Liu , Haonan Tong , Zhongxiang Sun, Zhixin Ren, Guangkui Huang, Hongyin Zhu, Luyang Liu, Qunyang Lin, Chuang Zhang
The explosion of data over the last decades puts significant strain on the computational capacity of the central processing unit (CPU), challenging online analytical processing (OLAP). While previous studies have shown the potential of using Field Programmable Gate Arrays (FPGAs) in database systems, integrating FPGA-based hardware acceleration with relational databases remains challenging because of the complex nature of relational database operations and the need for specialized FPGA programming skills. Additionally, there are significant challenges related to optimizing FPGA-based acceleration for specific database workloads, ensuring data consistency and reliability, and integrating FPGA-based hardware acceleration with existing database infrastructure. In this study, we proposed a novel end-to-end FPGA-based acceleration system that supports native SQL statements and storage engine. We defined a callback process to reload the database query logic and customize the scanning method for database queries. Through middleware process development, we optimized offloading efficiency on PCIe bus by scheduling data transmission and computation in a pipeline workflow. Additionally, we designed a novel five-stage FPGA microarchitecture module that achieves optimal clock frequency, further enhancing offloading efficiency. Results from systematic evaluations indicate that our solution allows a single FPGA card to perform as well as 8 CPU query processes, while reducing CPU load by 34%. Compared to using 4 CPU cores, our FPGA-based acceleration system reduces query latency by 1.7 times without increasing CPU load. Furthermore, our proposed solution achieves 2.1 times computation speedup for data filtering compared with the software baseline in a single core environment. Overall, our work presents a valuable end-to-end hardware acceleration system for OLAP databases.
过去几十年来,数据量激增,给中央处理器(CPU)的计算能力带来了巨大压力,给联机分析处理(OLAP)带来了挑战。虽然以前的研究已经显示了在数据库系统中使用现场可编程门阵列(FPGA)的潜力,但由于关系数据库操作的复杂性以及对专业 FPGA 编程技能的需求,将基于 FPGA 的硬件加速与关系数据库集成仍然具有挑战性。此外,在针对特定数据库工作负载优化基于 FPGA 的加速、确保数据一致性和可靠性以及将基于 FPGA 的硬件加速与现有数据库基础架构集成等方面也存在重大挑战。在本研究中,我们提出了一种新颖的端到端基于 FPGA 的加速系统,该系统支持本地 SQL 语句和存储引擎。我们定义了一个回调流程,用于重新加载数据库查询逻辑和定制数据库查询的扫描方法。通过中间件流程开发,我们在流水线工作流程中调度数据传输和计算,优化了 PCIe 总线上的卸载效率。此外,我们还设计了一种新颖的五级 FPGA 微体系结构模块,实现了最佳时钟频率,进一步提高了卸载效率。系统评估结果表明,我们的解决方案使单个 FPGA 卡的性能与 8 个 CPU 查询进程相当,同时将 CPU 负载降低了 34%。与使用 4 个 CPU 内核相比,我们基于 FPGA 的加速系统在不增加 CPU 负载的情况下将查询延迟降低了 1.7 倍。此外,与单核环境下的软件基线相比,我们提出的解决方案在数据过滤方面的计算速度提高了 2.1 倍。总之,我们的工作为 OLAP 数据库提供了一个有价值的端到端硬件加速系统。
{"title":"Integrating FPGA-based hardware acceleration with relational databases","authors":"Ke Liu , Haonan Tong , Zhongxiang Sun, Zhixin Ren, Guangkui Huang, Hongyin Zhu, Luyang Liu, Qunyang Lin, Chuang Zhang","doi":"10.1016/j.parco.2024.103064","DOIUrl":"10.1016/j.parco.2024.103064","url":null,"abstract":"<div><p>The explosion of data over the last decades puts significant strain on the computational capacity of the central processing unit (CPU), challenging online analytical processing (OLAP). While previous studies have shown the potential of using Field Programmable Gate Arrays (FPGAs) in database systems, integrating FPGA-based hardware acceleration with relational databases remains challenging because of the complex nature of relational database operations and the need for specialized FPGA programming skills. Additionally, there are significant challenges related to optimizing FPGA-based acceleration for specific database workloads, ensuring data consistency and reliability, and integrating FPGA-based hardware acceleration with existing database infrastructure. In this study, we proposed a novel end-to-end FPGA-based acceleration system that supports native SQL statements and storage engine. We defined a callback process to reload the database query logic and customize the scanning method for database queries. Through middleware process development, we optimized offloading efficiency on PCIe bus by scheduling data transmission and computation in a pipeline workflow. Additionally, we designed a novel five-stage FPGA microarchitecture module that achieves optimal clock frequency, further enhancing offloading efficiency. Results from systematic evaluations indicate that our solution allows a single FPGA card to perform as well as 8 CPU query processes, while reducing CPU load by 34%. Compared to using 4 CPU cores, our FPGA-based acceleration system reduces query latency by 1.7 times without increasing CPU load. Furthermore, our proposed solution achieves 2.1 times computation speedup for data filtering compared with the software baseline in a single core environment. Overall, our work presents a valuable end-to-end hardware acceleration system for OLAP databases.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"119 ","pages":"Article 103064"},"PeriodicalIF":1.4,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000024/pdfft?md5=d270aeec859768a5bff3f5d4988863f9&pid=1-s2.0-S0167819124000024-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139825507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}