Parallel Computing最新文献_第2页

Multi-GPU 3D k-nearest neighbors computation with application to ICP, point cloud smoothing and normals computation 多 GPU 3D k 最近邻计算在 ICP、点云平滑和法线计算中的应用

IF 2 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2024-07-02 DOI: 10.1016/j.parco.2024.103093

Alexander Agathos , Philip Azariadis

The k-Nearest Neighbors algorithm is a fundamental algorithm that finds applications in many fields like Machine Learning, Computer Graphics, Computer Vision, and others. The algorithm determines the closest points (d-dimensional) of a reference set R according to a query set of points Q under a specific metric (Euclidean, Mahalanobis, Manhattan, etc.). This work focuses on the utilization of multiple Graphical Processing Units for the acceleration of the k-Nearest Neighbors algorithm with large or very large sets of 3D points. With the proposed approach the space of the reference set is divided into a 3D grid which is used to facilitate the search for the nearest neighbors. The search in the grid is performed in a multiresolution manner starting from a high-resolution grid and ending up in a coarse one, thus accounting for point clouds that may have non-uniform sampling and/or outliers. Three important algorithms in reverse engineering are revisited and new multi-GPU versions are proposed based on the introduced KNN algorithm. More specifically, the new multi-GPU approach is applied to the Iterative Closest Point algorithm, to the point cloud smoothing, and to the point cloud normal vectors computation and orientation problem. A series of tests and experiments have been conducted and discussed in the paper showing the merits of the proposed multi-GPU approach.

k 近邻算法是一种基本算法，在机器学习、计算机图形学、计算机视觉等许多领域都有应用。该算法根据特定度量（欧氏、马哈罗诺比、曼哈顿等）下的查询点集合 Q，确定参考集合 R 的最近点（d 维）。这项工作的重点是利用多个图形处理单元来加速大型或超大型三维点集的 k 近邻算法。利用所提出的方法，参考集的空间被划分为一个三维网格，用于促进近邻搜索。网格中的搜索是以多分辨率方式进行的，从高分辨率网格开始，到粗网格结束，从而考虑到可能存在非均匀采样和/或异常值的点云。我们重新审视了逆向工程中的三种重要算法，并基于引入的 KNN 算法提出了新的多 GPU 版本。更具体地说，新的多 GPU 方法适用于迭代最接近点算法、点云平滑以及点云法向量计算和定向问题。文中进行了一系列测试和实验，并讨论了所提出的多 GPU 方法的优点。

{"title":"Multi-GPU 3D k-nearest neighbors computation with application to ICP, point cloud smoothing and normals computation","authors":"Alexander Agathos , Philip Azariadis","doi":"10.1016/j.parco.2024.103093","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103093","url":null,"abstract":"<div><p>The k-Nearest Neighbors algorithm is a fundamental algorithm that finds applications in many fields like Machine Learning, Computer Graphics, Computer Vision, and others. The algorithm determines the closest points (d-dimensional) of a reference set R according to a query set of points Q under a specific metric (Euclidean, Mahalanobis, Manhattan, etc.). This work focuses on the utilization of multiple Graphical Processing Units for the acceleration of the k-Nearest Neighbors algorithm with large or very large sets of 3D points. With the proposed approach the space of the reference set is divided into a 3D grid which is used to facilitate the search for the nearest neighbors. The search in the grid is performed in a multiresolution manner starting from a high-resolution grid and ending up in a coarse one, thus accounting for point clouds that may have non-uniform sampling and/or outliers. Three important algorithms in reverse engineering are revisited and new multi-GPU versions are proposed based on the introduced KNN algorithm. More specifically, the new multi-GPU approach is applied to the Iterative Closest Point algorithm, to the point cloud smoothing, and to the point cloud normal vectors computation and orientation problem. A series of tests and experiments have been conducted and discussed in the paper showing the merits of the proposed multi-GPU approach.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103093"},"PeriodicalIF":2.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel WBSP：利用工作繁忙同步并行技术解决分布式机器学习中的落后问题

IF 2 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2024-06-29 DOI: 10.1016/j.parco.2024.103092

Duo Yang , Bing Hu , An Liu , A-Long Jin , Kwan L. Yeung , Yang You

Parameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers’ computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To address this issue, we propose a solution called Worker-Busy Synchronous Parallel (WBSP). This approach eliminates the waiting time of fast workers during the synchronization process and decouples the gradient upload and model download of fast workers into asymmetric parts. By doing so, it allows fast workers to complete multiple steps of local training and upload more gradients to the server, improving computational resource utilization. Additionally, the global model is only updated when the slowest worker uploads the gradients, ensuring the consistency of global models that are pulled down by all workers and the convergence of the global model. Building upon WBSP, we propose an optimized version to further reduce the communication overhead. It enables parallel execution of communication and computation tasks on workers to shorten the global synchronization interval, thereby improving training speed. We conduct theoretical analyses for the proposed mechanisms. Extensive experiments verify that our mechanism can reduce the required time to achieve the target accuracy by up to 60% compared with the fastest method and increase the proportion of computation time from 55%–72% in existing methods to 91%.

参数服务器广泛应用于分布式机器学习，以加速训练。然而，由于工人计算能力的异质性越来越大，导致了散兵游勇的问题，使参数同步变得非常具有挑战性。为了解决这个问题，我们提出了一种名为 "工作繁忙同步并行"（WBSP）的解决方案。这种方法消除了快速工作者在同步过程中的等待时间，并将快速工作者的梯度上传和模型下载分离为非对称部分。这样，快速工作者就能完成多步本地训练，并向服务器上传更多梯度，从而提高计算资源利用率。此外，只有当速度最慢的工作者上传梯度时，才会更新全局模型，从而确保所有工作者下拉的全局模型的一致性和全局模型的收敛性。在 WBSP 的基础上，我们提出了一个优化版本，以进一步减少通信开销。它可以在 Worker 上并行执行通信和计算任务，缩短全局同步间隔，从而提高训练速度。我们对提出的机制进行了理论分析。大量实验验证，与最快的方法相比，我们的机制可以将达到目标精度所需的时间减少 60%，并将计算时间的比例从现有方法的 55%-72% 提高到 91%。

{"title":"WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel","authors":"Duo Yang , Bing Hu , An Liu , A-Long Jin , Kwan L. Yeung , Yang You","doi":"10.1016/j.parco.2024.103092","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103092","url":null,"abstract":"<div><p>Parameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers’ computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To address this issue, we propose a solution called Worker-Busy Synchronous Parallel (WBSP). This approach eliminates the waiting time of fast workers during the synchronization process and decouples the gradient upload and model download of fast workers into asymmetric parts. By doing so, it allows fast workers to complete multiple steps of local training and upload more gradients to the server, improving computational resource utilization. Additionally, the global model is only updated when the slowest worker uploads the gradients, ensuring the consistency of global models that are pulled down by all workers and the convergence of the global model. Building upon WBSP, we propose an optimized version to further reduce the communication overhead. It enables parallel execution of communication and computation tasks on workers to shorten the global synchronization interval, thereby improving training speed. We conduct theoretical analyses for the proposed mechanisms. Extensive experiments verify that our mechanism can reduce the required time to achieve the target accuracy by up to 60% compared with the fastest method and increase the proportion of computation time from 55%–72% in existing methods to 91%.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"121 ","pages":"Article 103092"},"PeriodicalIF":2.0,"publicationDate":"2024-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Extending the limit of LR-TDDFT on two different approaches: Numerical algorithms and new Sunway heterogeneous supercomputer 用两种不同方法扩展 LR-TDDFT 的极限：数值算法和新型 Sunway 异构超级计算机

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2024-05-04 DOI: 10.1016/j.parco.2024.103085

Qingcai Jiang , Zhenwei Cao , Xinhui Cui , Lingyun Wan , Xinming Qin , Huanqi Cao , Hong An , Junshi Chen , Jie Liu , Wei Hu , Jinlong Yang

First-principles time-dependent density functional theory (TDDFT) is a powerful tool to accurately describe the excited-state properties of molecules and solids in condensed matter physics, computational chemistry, and materials science. However, a perceived drawback in TDDFT calculations is its ultrahigh computational cost $O (N^{5} \sim N^{6})$ and large memory usage $O (N^{4})$ especially for plane-wave basis set, confining its applications to large systems containing thousands of atoms. Here, we present a massively parallel implementation of linear-response TDDFT (LR-TDDFT) and accelerate LR-TDDFT in two different aspects: (1) numerical algorithms on the X86 supercomputer and (2) optimizations on the heterogeneous architecture of the new Sunway supercomputer. Furthermore, we carefully design the parallel data and task distribution schemes to accommodate the physical nature of different computation steps. By utilizing these two different methods, our implementation can gain an overall speedup of 10x and 80x and efficiently scales to large systems up to 4096 and 2744 atoms within dozens of seconds.

第一原理时变密度泛函理论（TDDFT）是精确描述凝聚态物理、计算化学和材料科学中分子和固体激发态性质的有力工具。然而，TDDFT 计算的一个明显缺点是超高的计算成本 O(N5∼N6)和超大的内存占用 O(N4)，特别是对于平面波基集，这使得它只能应用于包含成千上万原子的大型系统。在这里，我们提出了线性响应 TDDFT（LR-TDDFT）的大规模并行实现，并从两个不同方面加速了 LR-TDDFT：（1）X86 超级计算机上的数值算法；（2）新的 Sunway 超级计算机异构架构上的优化。此外，我们还精心设计了并行数据和任务分配方案，以适应不同计算步骤的物理特性。通过利用这两种不同的方法，我们的实现可以获得 10 倍和 80 倍的整体加速，并在数十秒内高效扩展到高达 4096 和 2744 个原子的大型系统。

{"title":"Extending the limit of LR-TDDFT on two different approaches: Numerical algorithms and new Sunway heterogeneous supercomputer","authors":"Qingcai Jiang , Zhenwei Cao , Xinhui Cui , Lingyun Wan , Xinming Qin , Huanqi Cao , Hong An , Junshi Chen , Jie Liu , Wei Hu , Jinlong Yang","doi":"10.1016/j.parco.2024.103085","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103085","url":null,"abstract":"<div><p>First-principles time-dependent density functional theory (TDDFT) is a powerful tool to accurately describe the excited-state properties of molecules and solids in condensed matter physics, computational chemistry, and materials science. However, a perceived drawback in TDDFT calculations is its ultrahigh computational cost <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>5</mn></mrow></msup><mo>∼</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>6</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> and large memory usage <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>4</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> especially for plane-wave basis set, confining its applications to large systems containing thousands of atoms. Here, we present a massively parallel implementation of linear-response TDDFT (LR-TDDFT) and accelerate LR-TDDFT in two different aspects: (1) numerical algorithms on the X86 supercomputer and (2) optimizations on the heterogeneous architecture of the new Sunway supercomputer. Furthermore, we carefully design the parallel data and task distribution schemes to accommodate the physical nature of different computation steps. By utilizing these two different methods, our implementation can gain an overall speedup of 10x and 80x and efficiently scales to large systems up to 4096 and 2744 atoms within dozens of seconds.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103085"},"PeriodicalIF":1.4,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140894775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA 利用 OmpSs 和 CUDA 实现 ALC-PSO 算法的低功耗异构并行计算方法

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2024-03-26 DOI: 10.1016/j.parco.2024.103084

Fahimeh Yazdanpanah, Mohammad Alaei

PSO (particle swarm optimization), is an intelligent search method for finding the best solution according to population state. Various parallel implementations of this algorithm have been presented for intensive-computing applications. The ALC-PSO algorithm (PSO with an aging leader and challengers) is an improved population-based procedure that increases convergence rapidity, compared to the traditional PSO. In this paper, we propose a low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA, for execution on both CPU and GPU cores. This is the first effort to heterogeneous parallel implementing ALC-PSO algorithm with combination of OmpSs and CUDA. This hybrid parallel programming approach increases the performance and efficiency of the intensive-computing applications. The proposed approach of this article is also useful and applicable for heterogeneous parallel execution of the other improved versions of PSO algorithm, on both CPUs and GPUs. The results demonstrate that the proposed approach provides higher performance, in terms of delay and power consumption, than the existence implementations of ALC-PSO algorithm.

PSO（粒子群优化）是一种智能搜索方法，可根据种群状态找到最佳解决方案。针对密集型计算应用，该算法有多种并行实施方案。与传统的 PSO 相比，ALC-PSO 算法（带有老化领导者和挑战者的 PSO）是一种基于种群的改进程序，可提高收敛速度。在本文中，我们提出了一种使用 OmpSs 和 CUDA 的 ALC-PSO 算法的低功耗异构并行实施方案，可在 CPU 和 GPU 内核上执行。这是首次使用 OmpSs 和 CUDA 对 ALC-PSO 算法进行异构并行计算。这种混合并行编程方法提高了密集型计算应用的性能和效率。本文提出的方法也适用于在 CPU 和 GPU 上异构并行执行其他改进版本的 PSO 算法。结果表明，与 ALC-PSO 算法的现有实现相比，本文提出的方法在延迟和功耗方面提供了更高的性能。

{"title":"An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA","authors":"Fahimeh Yazdanpanah, Mohammad Alaei","doi":"10.1016/j.parco.2024.103084","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103084","url":null,"abstract":"<div><p>PSO (particle swarm optimization), is an intelligent search method for finding the best solution according to population state. Various parallel implementations of this algorithm have been presented for intensive-computing applications. The ALC-PSO algorithm (PSO with an aging leader and challengers) is an improved population-based procedure that increases convergence rapidity, compared to the traditional PSO. In this paper, we propose a low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA, for execution on both CPU and GPU cores. This is the first effort to heterogeneous parallel implementing ALC-PSO algorithm with combination of OmpSs and CUDA. This hybrid parallel programming approach increases the performance and efficiency of the intensive-computing applications. The proposed approach of this article is also useful and applicable for heterogeneous parallel execution of the other improved versions of PSO algorithm, on both CPUs and GPUs. The results demonstrate that the proposed approach provides higher performance, in terms of delay and power consumption, than the existence implementations of ALC-PSO algorithm.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103084"},"PeriodicalIF":1.4,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140327837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Federated learning based modulation classification for multipath channels 基于联合学习的多径信道调制分类

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2024-03-16 DOI: 10.1016/j.parco.2024.103083

Sanjay Bhardwaj, Da-Hye Kim, Dong-Seong Kim

Deep learning (DL)-based automatic modulation classification (AMC) is a primary research field for identifying modulation types. However, traditional DL-based AMC approaches rely on hand-crafted features, which can be time-consuming and may not capture all relevant information in the signal. Additionally, they are centralized solutions that are trained on large amounts of data acquired from local clients and stored on a server, leading to weak performance in terms of correct classification probability. To address these issues, a federated learning (FL)-based AMC approach is proposed, called FL-MP-CNN-AMC, which takes into account the effects of multipath channels (reflected and scattered paths) and considers the use of a modified loss function for solving the class imbalance problem caused by these channels. In addition, hyperparameter tuning and optimization of the loss function are discussed and analyzed to improve the performance of the proposed approach. The classification performance is investigated by considering the effects of interference level, delay spread, scattered and reflected paths, phase offset, and frequency offset. The simulation results show that the proposed approach provides excellent performance in terms of correct classification probability, confusion matrix, convergence and communication overhead when compared to contemporary methods.

基于深度学习（DL）的自动调制分类（AMC）是识别调制类型的一个主要研究领域。然而，传统的基于深度学习的自动调制分类方法依赖于手工创建的特征，这可能非常耗时，而且可能无法捕捉信号中的所有相关信息。此外，这些方法都是集中式解决方案，需要对从本地客户端获取并存储在服务器上的大量数据进行训练，因此在正确分类概率方面性能较弱。为了解决这些问题，我们提出了一种基于联合学习（FL）的 AMC 方法，称为 FL-MP-CNN-AMC，它考虑到了多径信道（反射和散射路径）的影响，并考虑使用修正的损失函数来解决这些信道造成的类不平衡问题。此外，还讨论和分析了超参数的调整和损失函数的优化，以提高所提方法的性能。通过考虑干扰水平、延迟扩散、散射和反射路径、相位偏移和频率偏移的影响，研究了分类性能。仿真结果表明，与同类方法相比，所提出的方法在正确分类概率、混淆矩阵、收敛性和通信开销等方面都表现出色。

{"title":"Federated learning based modulation classification for multipath channels","authors":"Sanjay Bhardwaj, Da-Hye Kim, Dong-Seong Kim","doi":"10.1016/j.parco.2024.103083","DOIUrl":"10.1016/j.parco.2024.103083","url":null,"abstract":"<div><p>Deep learning (DL)-based automatic modulation classification (AMC) is a primary research field for identifying modulation types. However, traditional DL-based AMC approaches rely on hand-crafted features, which can be time-consuming and may not capture all relevant information in the signal. Additionally, they are centralized solutions that are trained on large amounts of data acquired from local clients and stored on a server, leading to weak performance in terms of correct classification probability. To address these issues, a federated learning (FL)-based AMC approach is proposed, called FL-MP-CNN-AMC, which takes into account the effects of multipath channels (reflected and scattered paths) and considers the use of a modified loss function for solving the class imbalance problem caused by these channels. In addition, hyperparameter tuning and optimization of the loss function are discussed and analyzed to improve the performance of the proposed approach. The classification performance is investigated by considering the effects of interference level, delay spread, scattered and reflected paths, phase offset, and frequency offset. The simulation results show that the proposed approach provides excellent performance in terms of correct classification probability, confusion matrix, convergence and communication overhead when compared to contemporary methods.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103083"},"PeriodicalIF":1.4,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140171507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters PPS：多租户 GPU 集群的公平高效黑盒调度

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2024-03-12 DOI: 10.1016/j.parco.2024.103082

Kaihao Ma , Zhenkun Cai , Xiao Yan , Yang Zhang , Zhi Liu , Yihui Feng , Chao Li , Wei Lin , James Cheng

Multi-tenant GPU clusters are common, where users purchase GPU quota to run their neural network training jobs. However, strict quota-based scheduling often leads to cluster under-utilization, while allowing quota groups to use excess GPUs improves utilization but results in fairness problems. We propose PPS, a probabilistic prediction based scheduler, which uses job history statistics to predict future cluster status for making good scheduling decisions. Different from existing schedulers that rely on deep learning frameworks to adjust bad scheduling decisions and/or require detailed job information, PPS treats jobs as black boxes in that PPS runs a job to completion without adjustment once scheduled and requires only aggregate job statistics. The black-box feature is favorable due to its good generality, compatibility and security, and made possible by the predictability of aggregate resource utilization statistics of large clusters. Extensive experiments show that PPS achieves high cluster utilization and good fairness simultaneously.

多租户 GPU 集群很常见，用户购买 GPU 配额来运行神经网络训练作业。然而，基于配额的严格调度往往会导致集群利用率不足，而允许配额组使用多余的 GPU 虽然能提高利用率，但却会导致公平性问题。我们提出了基于概率预测的调度器 PPS，它利用作业历史统计数据来预测未来集群状态，从而做出正确的调度决策。与依赖深度学习框架来调整不良调度决策和/或需要详细作业信息的现有调度器不同，PPS 将作业视为黑盒子，一旦调度完成，PPS 无需调整即可运行作业，并且只需要作业的总体统计数据。黑盒特性具有良好的通用性、兼容性和安全性，而且大型集群的总体资源利用率统计数据具有可预测性，因此黑盒特性非常有利。大量实验表明，PPS 可同时实现较高的集群利用率和良好的公平性。

{"title":"PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters","authors":"Kaihao Ma , Zhenkun Cai , Xiao Yan , Yang Zhang , Zhi Liu , Yihui Feng , Chao Li , Wei Lin , James Cheng","doi":"10.1016/j.parco.2024.103082","DOIUrl":"10.1016/j.parco.2024.103082","url":null,"abstract":"<div><p>Multi-tenant GPU clusters are common, where users purchase GPU quota to run their neural network training jobs. However, strict quota-based scheduling often leads to cluster under-utilization, while allowing quota groups to use excess GPUs improves utilization but results in fairness problems. We propose PPS, a probabilistic prediction based scheduler, which uses job history statistics to predict future cluster status for making good scheduling decisions. Different from existing schedulers that rely on deep learning frameworks to adjust bad scheduling decisions and/or require detailed job information, PPS treats jobs as black boxes in that PPS runs a job to completion without adjustment once scheduled and requires only aggregate job statistics. The black-box feature is favorable due to its good generality, compatibility and security, and made possible by the predictability of aggregate resource utilization statistics of large clusters. Extensive experiments show that PPS achieves high cluster utilization and good fairness simultaneously.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103082"},"PeriodicalIF":1.4,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140275754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analyzing the impact of CUDA versions on GPU applications 分析 CUDA 版本对 GPU 应用程序的影响

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2024-02-29 DOI: 10.1016/j.parco.2024.103081

Kohei Yoshida, Shinobu Miwa, Hayato Yamaki, Hiroki Honda

CUDA toolkits are widely used to develop applications running on NVIDIA GPUs. They include compilers and are frequently updated to integrate state-of-the-art compilation techniques. Hence, many HPC users believe that the latest CUDA toolkit will improve application performance; however, considering results from CPU compilers, there are cases where this is not true. In this paper, we thoroughly evaluate the impact of CUDA toolkit version on the performance, power consumption, and energy consumption of GPU applications with four GPU architectures. Our results show that though the latest CUDA toolkit obtains the best performance, power consumption, and energy consumption for many applications in most cases, but we found a few exceptions. For such applications, we conducted an in-depth analysis using the SASS to identify why older CUDA toolkit achieve performance improvement. Our analysis showed that the factors that caused them are by three phenomena: aggressive loop unrolling, inefficient instruction scheduling, and the impact of host compilers.

CUDA 工具包被广泛用于开发在英伟达™（NVIDIA®）GPU 上运行的应用程序。这些工具包包括编译器，并经常更新，以整合最先进的编译技术。因此，许多高性能计算用户认为最新的CUDA工具包将提高应用程序的性能；然而，考虑到CPU编译器的结果，有些情况下并非如此。在本文中，我们全面评估了 CUDA 工具包版本对采用三种 GPU 架构的 GPU 应用程序的性能、功耗和能耗的影响。我们的结果表明，虽然最新的 CUDA 工具包在大多数情况下都能为许多应用获得最佳性能、功耗和能耗，但我们也发现了一些例外情况。针对这些应用，我们使用 SASS 进行了深入分析，以找出旧版 CUDA 工具包性能提升的原因。我们的分析表明，造成这些问题的因素有三个：激进的循环解卷、低效的指令调度以及主机编译器的影响。

{"title":"Analyzing the impact of CUDA versions on GPU applications","authors":"Kohei Yoshida, Shinobu Miwa, Hayato Yamaki, Hiroki Honda","doi":"10.1016/j.parco.2024.103081","DOIUrl":"10.1016/j.parco.2024.103081","url":null,"abstract":"<div><p>CUDA toolkits are widely used to develop applications running on NVIDIA GPUs. They include compilers and are frequently updated to integrate state-of-the-art compilation techniques. Hence, many HPC users believe that the latest CUDA toolkit will improve application performance; however, considering results from CPU compilers, there are cases where this is not true. In this paper, we thoroughly evaluate the impact of CUDA toolkit version on the performance, power consumption, and energy consumption of GPU applications with four GPU architectures. Our results show that though the latest CUDA toolkit obtains the best performance, power consumption, and energy consumption for many applications in most cases, but we found a few exceptions. For such applications, we conducted an in-depth analysis using the SASS to identify why older CUDA toolkit achieve performance improvement. Our analysis showed that the factors that caused them are by three phenomena: aggressive loop unrolling, inefficient instruction scheduling, and the impact of host compilers.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103081"},"PeriodicalIF":1.4,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016781912400019X/pdfft?md5=62bfbd6666b978a441b0d0daa8420592&pid=1-s2.0-S016781912400019X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architecture 在新一代 Sunway 架构上并行优化和应用非结构化稀疏三角求解器

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2024-02-28 DOI: 10.1016/j.parco.2024.103080

Jianjiang Li , Lin Li , Qingwei Wang , Wei Xue , Jiabi Liang , Jinliang Shi

Large-scale sparse linear equation solver plays an important role in both numerical simulation and artificial intelligence, and sparse triangular equation solver is a key step in solving sparse linear equation systems. Its parallel optimization can effectively improve the efficiency of solving sparse linear equation systems. In this paper, we design and implement a parallel algorithm for solving sparse triangular equations in combination with the features of the new generation of Sunway architecture, and optimize the access and communication respectively for 949 real equations and 32 complex equations in the SuiteSparse collection. The solution efficiency of the algorithm presented in this paper outperforms the cuSparse algorithm on NVIDIA V100 GPU platforms in more than 71% of the cases, and the speedup is even better in solving larger cases (matrix size greater than 10,000): our method increases the speedup from 1.29 time of the previous version to an average speedup of 5.54 and the best speedup of 32.18 over the sequential method on the next generation of Sunway architecture when using 64 slave cores.

大规模稀疏线性方程求解器在数值模拟和人工智能领域都发挥着重要作用，而稀疏三角方程求解器是求解稀疏线性方程组的关键步骤。其并行优化能有效提高稀疏线性方程组的求解效率。本文结合新一代 Sunway 架构的特点，设计并实现了一种求解稀疏三角形方程的并行算法，并分别对 SuiteSparse 集合中的 949 个实数方程和 32 个复数方程进行了访问和通信优化。在英伟达 V100 GPU 平台上，本文介绍的算法在 71% 以上的情况下求解效率优于 cuSparse 算法，而在求解更大的情况（矩阵大小大于 10,000 个）时，速度提升效果更好：在新一代 Sunway 架构上使用 64 个从属内核时，我们的方法比顺序方法的平均速度从之前版本的 1.29 倍提高到 5.54 倍，最佳速度为 32.18 倍。

{"title":"Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architecture","authors":"Jianjiang Li , Lin Li , Qingwei Wang , Wei Xue , Jiabi Liang , Jinliang Shi","doi":"10.1016/j.parco.2024.103080","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103080","url":null,"abstract":"<div><p>Large-scale sparse linear equation solver plays an important role in both numerical simulation and artificial intelligence, and sparse triangular equation solver is a key step in solving sparse linear equation systems. Its parallel optimization can effectively improve the efficiency of solving sparse linear equation systems. In this paper, we design and implement a parallel algorithm for solving sparse triangular equations in combination with the features of the new generation of Sunway architecture, and optimize the access and communication respectively for 949 real equations and 32 complex equations in the SuiteSparse collection. The solution efficiency of the algorithm presented in this paper outperforms the cuSparse algorithm on NVIDIA V100 GPU platforms in more than 71% of the cases, and the speedup is even better in solving larger cases (matrix size greater than 10,000): our method increases the speedup from 1.29 time of the previous version to an average speedup of 5.54 and the best speedup of 32.18 over the sequential method on the next generation of Sunway architecture when using 64 slave cores.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103080"},"PeriodicalIF":1.4,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140024112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Editorial for parallel computing 并行计算》编辑部

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2024-02-01 DOI: 10.1016/j.parco.2024.103065

Anne Benoit

引用次数: 0

Integrating FPGA-based hardware acceleration with relational databases 将基于 FPGA 的硬件加速与关系数据库相结合

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2024-02-01 DOI: 10.1016/j.parco.2024.103064

Ke Liu , Haonan Tong , Zhongxiang Sun, Zhixin Ren, Guangkui Huang, Hongyin Zhu, Luyang Liu, Qunyang Lin, Chuang Zhang

The explosion of data over the last decades puts significant strain on the computational capacity of the central processing unit (CPU), challenging online analytical processing (OLAP). While previous studies have shown the potential of using Field Programmable Gate Arrays (FPGAs) in database systems, integrating FPGA-based hardware acceleration with relational databases remains challenging because of the complex nature of relational database operations and the need for specialized FPGA programming skills. Additionally, there are significant challenges related to optimizing FPGA-based acceleration for specific database workloads, ensuring data consistency and reliability, and integrating FPGA-based hardware acceleration with existing database infrastructure. In this study, we proposed a novel end-to-end FPGA-based acceleration system that supports native SQL statements and storage engine. We defined a callback process to reload the database query logic and customize the scanning method for database queries. Through middleware process development, we optimized offloading efficiency on PCIe bus by scheduling data transmission and computation in a pipeline workflow. Additionally, we designed a novel five-stage FPGA microarchitecture module that achieves optimal clock frequency, further enhancing offloading efficiency. Results from systematic evaluations indicate that our solution allows a single FPGA card to perform as well as 8 CPU query processes, while reducing CPU load by 34%. Compared to using 4 CPU cores, our FPGA-based acceleration system reduces query latency by 1.7 times without increasing CPU load. Furthermore, our proposed solution achieves 2.1 times computation speedup for data filtering compared with the software baseline in a single core environment. Overall, our work presents a valuable end-to-end hardware acceleration system for OLAP databases.

过去几十年来，数据量激增，给中央处理器（CPU）的计算能力带来了巨大压力，给联机分析处理（OLAP）带来了挑战。虽然以前的研究已经显示了在数据库系统中使用现场可编程门阵列（FPGA）的潜力，但由于关系数据库操作的复杂性以及对专业 FPGA 编程技能的需求，将基于 FPGA 的硬件加速与关系数据库集成仍然具有挑战性。此外，在针对特定数据库工作负载优化基于 FPGA 的加速、确保数据一致性和可靠性以及将基于 FPGA 的硬件加速与现有数据库基础架构集成等方面也存在重大挑战。在本研究中，我们提出了一种新颖的端到端基于 FPGA 的加速系统，该系统支持本地 SQL 语句和存储引擎。我们定义了一个回调流程，用于重新加载数据库查询逻辑和定制数据库查询的扫描方法。通过中间件流程开发，我们在流水线工作流程中调度数据传输和计算，优化了 PCIe 总线上的卸载效率。此外，我们还设计了一种新颖的五级 FPGA 微体系结构模块，实现了最佳时钟频率，进一步提高了卸载效率。系统评估结果表明，我们的解决方案使单个 FPGA 卡的性能与 8 个 CPU 查询进程相当，同时将 CPU 负载降低了 34%。与使用 4 个 CPU 内核相比，我们基于 FPGA 的加速系统在不增加 CPU 负载的情况下将查询延迟降低了 1.7 倍。此外，与单核环境下的软件基线相比，我们提出的解决方案在数据过滤方面的计算速度提高了 2.1 倍。总之，我们的工作为 OLAP 数据库提供了一个有价值的端到端硬件加速系统。

{"title":"Integrating FPGA-based hardware acceleration with relational databases","authors":"Ke Liu , Haonan Tong , Zhongxiang Sun, Zhixin Ren, Guangkui Huang, Hongyin Zhu, Luyang Liu, Qunyang Lin, Chuang Zhang","doi":"10.1016/j.parco.2024.103064","DOIUrl":"10.1016/j.parco.2024.103064","url":null,"abstract":"<div><p>The explosion of data over the last decades puts significant strain on the computational capacity of the central processing unit (CPU), challenging online analytical processing (OLAP). While previous studies have shown the potential of using Field Programmable Gate Arrays (FPGAs) in database systems, integrating FPGA-based hardware acceleration with relational databases remains challenging because of the complex nature of relational database operations and the need for specialized FPGA programming skills. Additionally, there are significant challenges related to optimizing FPGA-based acceleration for specific database workloads, ensuring data consistency and reliability, and integrating FPGA-based hardware acceleration with existing database infrastructure. In this study, we proposed a novel end-to-end FPGA-based acceleration system that supports native SQL statements and storage engine. We defined a callback process to reload the database query logic and customize the scanning method for database queries. Through middleware process development, we optimized offloading efficiency on PCIe bus by scheduling data transmission and computation in a pipeline workflow. Additionally, we designed a novel five-stage FPGA microarchitecture module that achieves optimal clock frequency, further enhancing offloading efficiency. Results from systematic evaluations indicate that our solution allows a single FPGA card to perform as well as 8 CPU query processes, while reducing CPU load by 34%. Compared to using 4 CPU cores, our FPGA-based acceleration system reduces query latency by 1.7 times without increasing CPU load. Furthermore, our proposed solution achieves 2.1 times computation speedup for data filtering compared with the software baseline in a single core environment. Overall, our work presents a valuable end-to-end hardware acceleration system for OLAP databases.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"119 ","pages":"Article 103064"},"PeriodicalIF":1.4,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000024/pdfft?md5=d270aeec859768a5bff3f5d4988863f9&pid=1-s2.0-S0167819124000024-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139825507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0