首页 > 最新文献

Parallel Computing最新文献

英文 中文
Extending the limit of LR-TDDFT on two different approaches: Numerical algorithms and new Sunway heterogeneous supercomputer 用两种不同方法扩展 LR-TDDFT 的极限:数值算法和新型 Sunway 异构超级计算机
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-04 DOI: 10.1016/j.parco.2024.103085
Qingcai Jiang , Zhenwei Cao , Xinhui Cui , Lingyun Wan , Xinming Qin , Huanqi Cao , Hong An , Junshi Chen , Jie Liu , Wei Hu , Jinlong Yang

First-principles time-dependent density functional theory (TDDFT) is a powerful tool to accurately describe the excited-state properties of molecules and solids in condensed matter physics, computational chemistry, and materials science. However, a perceived drawback in TDDFT calculations is its ultrahigh computational cost O(N5N6) and large memory usage O(N4) especially for plane-wave basis set, confining its applications to large systems containing thousands of atoms. Here, we present a massively parallel implementation of linear-response TDDFT (LR-TDDFT) and accelerate LR-TDDFT in two different aspects: (1) numerical algorithms on the X86 supercomputer and (2) optimizations on the heterogeneous architecture of the new Sunway supercomputer. Furthermore, we carefully design the parallel data and task distribution schemes to accommodate the physical nature of different computation steps. By utilizing these two different methods, our implementation can gain an overall speedup of 10x and 80x and efficiently scales to large systems up to 4096 and 2744 atoms within dozens of seconds.

第一原理时变密度泛函理论(TDDFT)是精确描述凝聚态物理、计算化学和材料科学中分子和固体激发态性质的有力工具。然而,TDDFT 计算的一个明显缺点是超高的计算成本 O(N5∼N6)和超大的内存占用 O(N4),特别是对于平面波基集,这使得它只能应用于包含成千上万原子的大型系统。在这里,我们提出了线性响应 TDDFT(LR-TDDFT)的大规模并行实现,并从两个不同方面加速了 LR-TDDFT:(1)X86 超级计算机上的数值算法;(2)新的 Sunway 超级计算机异构架构上的优化。此外,我们还精心设计了并行数据和任务分配方案,以适应不同计算步骤的物理特性。通过利用这两种不同的方法,我们的实现可以获得 10 倍和 80 倍的整体加速,并在数十秒内高效扩展到高达 4096 和 2744 个原子的大型系统。
{"title":"Extending the limit of LR-TDDFT on two different approaches: Numerical algorithms and new Sunway heterogeneous supercomputer","authors":"Qingcai Jiang ,&nbsp;Zhenwei Cao ,&nbsp;Xinhui Cui ,&nbsp;Lingyun Wan ,&nbsp;Xinming Qin ,&nbsp;Huanqi Cao ,&nbsp;Hong An ,&nbsp;Junshi Chen ,&nbsp;Jie Liu ,&nbsp;Wei Hu ,&nbsp;Jinlong Yang","doi":"10.1016/j.parco.2024.103085","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103085","url":null,"abstract":"<div><p>First-principles time-dependent density functional theory (TDDFT) is a powerful tool to accurately describe the excited-state properties of molecules and solids in condensed matter physics, computational chemistry, and materials science. However, a perceived drawback in TDDFT calculations is its ultrahigh computational cost <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>5</mn></mrow></msup><mo>∼</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>6</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> and large memory usage <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>4</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> especially for plane-wave basis set, confining its applications to large systems containing thousands of atoms. Here, we present a massively parallel implementation of linear-response TDDFT (LR-TDDFT) and accelerate LR-TDDFT in two different aspects: (1) numerical algorithms on the X86 supercomputer and (2) optimizations on the heterogeneous architecture of the new Sunway supercomputer. Furthermore, we carefully design the parallel data and task distribution schemes to accommodate the physical nature of different computation steps. By utilizing these two different methods, our implementation can gain an overall speedup of 10x and 80x and efficiently scales to large systems up to 4096 and 2744 atoms within dozens of seconds.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103085"},"PeriodicalIF":1.4,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140894775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA 利用 OmpSs 和 CUDA 实现 ALC-PSO 算法的低功耗异构并行计算方法
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-26 DOI: 10.1016/j.parco.2024.103084
Fahimeh Yazdanpanah, Mohammad Alaei

PSO (particle swarm optimization), is an intelligent search method for finding the best solution according to population state. Various parallel implementations of this algorithm have been presented for intensive-computing applications. The ALC-PSO algorithm (PSO with an aging leader and challengers) is an improved population-based procedure that increases convergence rapidity, compared to the traditional PSO. In this paper, we propose a low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA, for execution on both CPU and GPU cores. This is the first effort to heterogeneous parallel implementing ALC-PSO algorithm with combination of OmpSs and CUDA. This hybrid parallel programming approach increases the performance and efficiency of the intensive-computing applications. The proposed approach of this article is also useful and applicable for heterogeneous parallel execution of the other improved versions of PSO algorithm, on both CPUs and GPUs. The results demonstrate that the proposed approach provides higher performance, in terms of delay and power consumption, than the existence implementations of ALC-PSO algorithm.

PSO(粒子群优化)是一种智能搜索方法,可根据种群状态找到最佳解决方案。针对密集型计算应用,该算法有多种并行实施方案。与传统的 PSO 相比,ALC-PSO 算法(带有老化领导者和挑战者的 PSO)是一种基于种群的改进程序,可提高收敛速度。在本文中,我们提出了一种使用 OmpSs 和 CUDA 的 ALC-PSO 算法的低功耗异构并行实施方案,可在 CPU 和 GPU 内核上执行。这是首次使用 OmpSs 和 CUDA 对 ALC-PSO 算法进行异构并行计算。这种混合并行编程方法提高了密集型计算应用的性能和效率。本文提出的方法也适用于在 CPU 和 GPU 上异构并行执行其他改进版本的 PSO 算法。结果表明,与 ALC-PSO 算法的现有实现相比,本文提出的方法在延迟和功耗方面提供了更高的性能。
{"title":"An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA","authors":"Fahimeh Yazdanpanah,&nbsp;Mohammad Alaei","doi":"10.1016/j.parco.2024.103084","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103084","url":null,"abstract":"<div><p>PSO (particle swarm optimization), is an intelligent search method for finding the best solution according to population state. Various parallel implementations of this algorithm have been presented for intensive-computing applications. The ALC-PSO algorithm (PSO with an aging leader and challengers) is an improved population-based procedure that increases convergence rapidity, compared to the traditional PSO. In this paper, we propose a low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDA, for execution on both CPU and GPU cores. This is the first effort to heterogeneous parallel implementing ALC-PSO algorithm with combination of OmpSs and CUDA. This hybrid parallel programming approach increases the performance and efficiency of the intensive-computing applications. The proposed approach of this article is also useful and applicable for heterogeneous parallel execution of the other improved versions of PSO algorithm, on both CPUs and GPUs. The results demonstrate that the proposed approach provides higher performance, in terms of delay and power consumption, than the existence implementations of ALC-PSO algorithm.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103084"},"PeriodicalIF":1.4,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140327837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Federated learning based modulation classification for multipath channels 基于联合学习的多径信道调制分类
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-16 DOI: 10.1016/j.parco.2024.103083
Sanjay Bhardwaj, Da-Hye Kim, Dong-Seong Kim

Deep learning (DL)-based automatic modulation classification (AMC) is a primary research field for identifying modulation types. However, traditional DL-based AMC approaches rely on hand-crafted features, which can be time-consuming and may not capture all relevant information in the signal. Additionally, they are centralized solutions that are trained on large amounts of data acquired from local clients and stored on a server, leading to weak performance in terms of correct classification probability. To address these issues, a federated learning (FL)-based AMC approach is proposed, called FL-MP-CNN-AMC, which takes into account the effects of multipath channels (reflected and scattered paths) and considers the use of a modified loss function for solving the class imbalance problem caused by these channels. In addition, hyperparameter tuning and optimization of the loss function are discussed and analyzed to improve the performance of the proposed approach. The classification performance is investigated by considering the effects of interference level, delay spread, scattered and reflected paths, phase offset, and frequency offset. The simulation results show that the proposed approach provides excellent performance in terms of correct classification probability, confusion matrix, convergence and communication overhead when compared to contemporary methods.

基于深度学习(DL)的自动调制分类(AMC)是识别调制类型的一个主要研究领域。然而,传统的基于深度学习的自动调制分类方法依赖于手工创建的特征,这可能非常耗时,而且可能无法捕捉信号中的所有相关信息。此外,这些方法都是集中式解决方案,需要对从本地客户端获取并存储在服务器上的大量数据进行训练,因此在正确分类概率方面性能较弱。为了解决这些问题,我们提出了一种基于联合学习(FL)的 AMC 方法,称为 FL-MP-CNN-AMC,它考虑到了多径信道(反射和散射路径)的影响,并考虑使用修正的损失函数来解决这些信道造成的类不平衡问题。此外,还讨论和分析了超参数的调整和损失函数的优化,以提高所提方法的性能。通过考虑干扰水平、延迟扩散、散射和反射路径、相位偏移和频率偏移的影响,研究了分类性能。仿真结果表明,与同类方法相比,所提出的方法在正确分类概率、混淆矩阵、收敛性和通信开销等方面都表现出色。
{"title":"Federated learning based modulation classification for multipath channels","authors":"Sanjay Bhardwaj,&nbsp;Da-Hye Kim,&nbsp;Dong-Seong Kim","doi":"10.1016/j.parco.2024.103083","DOIUrl":"10.1016/j.parco.2024.103083","url":null,"abstract":"<div><p>Deep learning (DL)-based automatic modulation classification (AMC) is a primary research field for identifying modulation types. However, traditional DL-based AMC approaches rely on hand-crafted features, which can be time-consuming and may not capture all relevant information in the signal. Additionally, they are centralized solutions that are trained on large amounts of data acquired from local clients and stored on a server, leading to weak performance in terms of correct classification probability. To address these issues, a federated learning (FL)-based AMC approach is proposed, called FL-MP-CNN-AMC, which takes into account the effects of multipath channels (reflected and scattered paths) and considers the use of a modified loss function for solving the class imbalance problem caused by these channels. In addition, hyperparameter tuning and optimization of the loss function are discussed and analyzed to improve the performance of the proposed approach. The classification performance is investigated by considering the effects of interference level, delay spread, scattered and reflected paths, phase offset, and frequency offset. The simulation results show that the proposed approach provides excellent performance in terms of correct classification probability, confusion matrix, convergence and communication overhead when compared to contemporary methods.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103083"},"PeriodicalIF":1.4,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140171507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters PPS:多租户 GPU 集群的公平高效黑盒调度
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-12 DOI: 10.1016/j.parco.2024.103082
Kaihao Ma , Zhenkun Cai , Xiao Yan , Yang Zhang , Zhi Liu , Yihui Feng , Chao Li , Wei Lin , James Cheng

Multi-tenant GPU clusters are common, where users purchase GPU quota to run their neural network training jobs. However, strict quota-based scheduling often leads to cluster under-utilization, while allowing quota groups to use excess GPUs improves utilization but results in fairness problems. We propose PPS, a probabilistic prediction based scheduler, which uses job history statistics to predict future cluster status for making good scheduling decisions. Different from existing schedulers that rely on deep learning frameworks to adjust bad scheduling decisions and/or require detailed job information, PPS treats jobs as black boxes in that PPS runs a job to completion without adjustment once scheduled and requires only aggregate job statistics. The black-box feature is favorable due to its good generality, compatibility and security, and made possible by the predictability of aggregate resource utilization statistics of large clusters. Extensive experiments show that PPS achieves high cluster utilization and good fairness simultaneously.

多租户 GPU 集群很常见,用户购买 GPU 配额来运行神经网络训练作业。然而,基于配额的严格调度往往会导致集群利用率不足,而允许配额组使用多余的 GPU 虽然能提高利用率,但却会导致公平性问题。我们提出了基于概率预测的调度器 PPS,它利用作业历史统计数据来预测未来集群状态,从而做出正确的调度决策。与依赖深度学习框架来调整不良调度决策和/或需要详细作业信息的现有调度器不同,PPS 将作业视为黑盒子,一旦调度完成,PPS 无需调整即可运行作业,并且只需要作业的总体统计数据。黑盒特性具有良好的通用性、兼容性和安全性,而且大型集群的总体资源利用率统计数据具有可预测性,因此黑盒特性非常有利。大量实验表明,PPS 可同时实现较高的集群利用率和良好的公平性。
{"title":"PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters","authors":"Kaihao Ma ,&nbsp;Zhenkun Cai ,&nbsp;Xiao Yan ,&nbsp;Yang Zhang ,&nbsp;Zhi Liu ,&nbsp;Yihui Feng ,&nbsp;Chao Li ,&nbsp;Wei Lin ,&nbsp;James Cheng","doi":"10.1016/j.parco.2024.103082","DOIUrl":"10.1016/j.parco.2024.103082","url":null,"abstract":"<div><p>Multi-tenant GPU clusters are common, where users purchase GPU quota to run their neural network training jobs. However, strict quota-based scheduling often leads to cluster under-utilization, while allowing quota groups to use excess GPUs improves utilization but results in fairness problems. We propose PPS, a probabilistic prediction based scheduler, which uses job history statistics to predict future cluster status for making good scheduling decisions. Different from existing schedulers that rely on deep learning frameworks to adjust bad scheduling decisions and/or require detailed job information, PPS treats jobs as black boxes in that PPS runs a job to completion without adjustment once scheduled and requires only aggregate job statistics. The black-box feature is favorable due to its good generality, compatibility and security, and made possible by the predictability of aggregate resource utilization statistics of large clusters. Extensive experiments show that PPS achieves high cluster utilization and good fairness simultaneously.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103082"},"PeriodicalIF":1.4,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140275754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing the impact of CUDA versions on GPU applications 分析 CUDA 版本对 GPU 应用程序的影响
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-02-29 DOI: 10.1016/j.parco.2024.103081
Kohei Yoshida, Shinobu Miwa, Hayato Yamaki, Hiroki Honda

CUDA toolkits are widely used to develop applications running on NVIDIA GPUs. They include compilers and are frequently updated to integrate state-of-the-art compilation techniques. Hence, many HPC users believe that the latest CUDA toolkit will improve application performance; however, considering results from CPU compilers, there are cases where this is not true. In this paper, we thoroughly evaluate the impact of CUDA toolkit version on the performance, power consumption, and energy consumption of GPU applications with four GPU architectures. Our results show that though the latest CUDA toolkit obtains the best performance, power consumption, and energy consumption for many applications in most cases, but we found a few exceptions. For such applications, we conducted an in-depth analysis using the SASS to identify why older CUDA toolkit achieve performance improvement. Our analysis showed that the factors that caused them are by three phenomena: aggressive loop unrolling, inefficient instruction scheduling, and the impact of host compilers.

CUDA 工具包被广泛用于开发在英伟达™(NVIDIA®)GPU 上运行的应用程序。这些工具包包括编译器,并经常更新,以整合最先进的编译技术。因此,许多高性能计算用户认为最新的CUDA工具包将提高应用程序的性能;然而,考虑到CPU编译器的结果,有些情况下并非如此。在本文中,我们全面评估了 CUDA 工具包版本对采用三种 GPU 架构的 GPU 应用程序的性能、功耗和能耗的影响。我们的结果表明,虽然最新的 CUDA 工具包在大多数情况下都能为许多应用获得最佳性能、功耗和能耗,但我们也发现了一些例外情况。针对这些应用,我们使用 SASS 进行了深入分析,以找出旧版 CUDA 工具包性能提升的原因。我们的分析表明,造成这些问题的因素有三个:激进的循环解卷、低效的指令调度以及主机编译器的影响。
{"title":"Analyzing the impact of CUDA versions on GPU applications","authors":"Kohei Yoshida,&nbsp;Shinobu Miwa,&nbsp;Hayato Yamaki,&nbsp;Hiroki Honda","doi":"10.1016/j.parco.2024.103081","DOIUrl":"10.1016/j.parco.2024.103081","url":null,"abstract":"<div><p>CUDA toolkits are widely used to develop applications running on NVIDIA GPUs. They include compilers and are frequently updated to integrate state-of-the-art compilation techniques. Hence, many HPC users believe that the latest CUDA toolkit will improve application performance; however, considering results from CPU compilers, there are cases where this is not true. In this paper, we thoroughly evaluate the impact of CUDA toolkit version on the performance, power consumption, and energy consumption of GPU applications with four GPU architectures. Our results show that though the latest CUDA toolkit obtains the best performance, power consumption, and energy consumption for many applications in most cases, but we found a few exceptions. For such applications, we conducted an in-depth analysis using the SASS to identify why older CUDA toolkit achieve performance improvement. Our analysis showed that the factors that caused them are by three phenomena: aggressive loop unrolling, inefficient instruction scheduling, and the impact of host compilers.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103081"},"PeriodicalIF":1.4,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016781912400019X/pdfft?md5=62bfbd6666b978a441b0d0daa8420592&pid=1-s2.0-S016781912400019X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architecture 在新一代 Sunway 架构上并行优化和应用非结构化稀疏三角求解器
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-02-28 DOI: 10.1016/j.parco.2024.103080
Jianjiang Li , Lin Li , Qingwei Wang , Wei Xue , Jiabi Liang , Jinliang Shi

Large-scale sparse linear equation solver plays an important role in both numerical simulation and artificial intelligence, and sparse triangular equation solver is a key step in solving sparse linear equation systems. Its parallel optimization can effectively improve the efficiency of solving sparse linear equation systems. In this paper, we design and implement a parallel algorithm for solving sparse triangular equations in combination with the features of the new generation of Sunway architecture, and optimize the access and communication respectively for 949 real equations and 32 complex equations in the SuiteSparse collection. The solution efficiency of the algorithm presented in this paper outperforms the cuSparse algorithm on NVIDIA V100 GPU platforms in more than 71% of the cases, and the speedup is even better in solving larger cases (matrix size greater than 10,000): our method increases the speedup from 1.29 time of the previous version to an average speedup of 5.54 and the best speedup of 32.18 over the sequential method on the next generation of Sunway architecture when using 64 slave cores.

大规模稀疏线性方程求解器在数值模拟和人工智能领域都发挥着重要作用,而稀疏三角方程求解器是求解稀疏线性方程组的关键步骤。其并行优化能有效提高稀疏线性方程组的求解效率。本文结合新一代 Sunway 架构的特点,设计并实现了一种求解稀疏三角形方程的并行算法,并分别对 SuiteSparse 集合中的 949 个实数方程和 32 个复数方程进行了访问和通信优化。在英伟达 V100 GPU 平台上,本文介绍的算法在 71% 以上的情况下求解效率优于 cuSparse 算法,而在求解更大的情况(矩阵大小大于 10,000 个)时,速度提升效果更好:在新一代 Sunway 架构上使用 64 个从属内核时,我们的方法比顺序方法的平均速度从之前版本的 1.29 倍提高到 5.54 倍,最佳速度为 32.18 倍。
{"title":"Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architecture","authors":"Jianjiang Li ,&nbsp;Lin Li ,&nbsp;Qingwei Wang ,&nbsp;Wei Xue ,&nbsp;Jiabi Liang ,&nbsp;Jinliang Shi","doi":"10.1016/j.parco.2024.103080","DOIUrl":"https://doi.org/10.1016/j.parco.2024.103080","url":null,"abstract":"<div><p>Large-scale sparse linear equation solver plays an important role in both numerical simulation and artificial intelligence, and sparse triangular equation solver is a key step in solving sparse linear equation systems. Its parallel optimization can effectively improve the efficiency of solving sparse linear equation systems. In this paper, we design and implement a parallel algorithm for solving sparse triangular equations in combination with the features of the new generation of Sunway architecture, and optimize the access and communication respectively for 949 real equations and 32 complex equations in the SuiteSparse collection. The solution efficiency of the algorithm presented in this paper outperforms the cuSparse algorithm on NVIDIA V100 GPU platforms in more than 71% of the cases, and the speedup is even better in solving larger cases (matrix size greater than 10,000): our method increases the speedup from 1.29 time of the previous version to an average speedup of 5.54 and the best speedup of 32.18 over the sequential method on the next generation of Sunway architecture when using 64 slave cores.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"120 ","pages":"Article 103080"},"PeriodicalIF":1.4,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140024112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial for parallel computing 并行计算》编辑部
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-02-01 DOI: 10.1016/j.parco.2024.103065
Anne Benoit
{"title":"Editorial for parallel computing","authors":"Anne Benoit","doi":"10.1016/j.parco.2024.103065","DOIUrl":"10.1016/j.parco.2024.103065","url":null,"abstract":"","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"119 ","pages":"Article 103065"},"PeriodicalIF":1.4,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000036/pdfft?md5=6609917dd084289a0bd17a84b82785e4&pid=1-s2.0-S0167819124000036-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139888638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating FPGA-based hardware acceleration with relational databases 将基于 FPGA 的硬件加速与关系数据库相结合
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-02-01 DOI: 10.1016/j.parco.2024.103064
Ke Liu , Haonan Tong , Zhongxiang Sun, Zhixin Ren, Guangkui Huang, Hongyin Zhu, Luyang Liu, Qunyang Lin, Chuang Zhang

The explosion of data over the last decades puts significant strain on the computational capacity of the central processing unit (CPU), challenging online analytical processing (OLAP). While previous studies have shown the potential of using Field Programmable Gate Arrays (FPGAs) in database systems, integrating FPGA-based hardware acceleration with relational databases remains challenging because of the complex nature of relational database operations and the need for specialized FPGA programming skills. Additionally, there are significant challenges related to optimizing FPGA-based acceleration for specific database workloads, ensuring data consistency and reliability, and integrating FPGA-based hardware acceleration with existing database infrastructure. In this study, we proposed a novel end-to-end FPGA-based acceleration system that supports native SQL statements and storage engine. We defined a callback process to reload the database query logic and customize the scanning method for database queries. Through middleware process development, we optimized offloading efficiency on PCIe bus by scheduling data transmission and computation in a pipeline workflow. Additionally, we designed a novel five-stage FPGA microarchitecture module that achieves optimal clock frequency, further enhancing offloading efficiency. Results from systematic evaluations indicate that our solution allows a single FPGA card to perform as well as 8 CPU query processes, while reducing CPU load by 34%. Compared to using 4 CPU cores, our FPGA-based acceleration system reduces query latency by 1.7 times without increasing CPU load. Furthermore, our proposed solution achieves 2.1 times computation speedup for data filtering compared with the software baseline in a single core environment. Overall, our work presents a valuable end-to-end hardware acceleration system for OLAP databases.

过去几十年来,数据量激增,给中央处理器(CPU)的计算能力带来了巨大压力,给联机分析处理(OLAP)带来了挑战。虽然以前的研究已经显示了在数据库系统中使用现场可编程门阵列(FPGA)的潜力,但由于关系数据库操作的复杂性以及对专业 FPGA 编程技能的需求,将基于 FPGA 的硬件加速与关系数据库集成仍然具有挑战性。此外,在针对特定数据库工作负载优化基于 FPGA 的加速、确保数据一致性和可靠性以及将基于 FPGA 的硬件加速与现有数据库基础架构集成等方面也存在重大挑战。在本研究中,我们提出了一种新颖的端到端基于 FPGA 的加速系统,该系统支持本地 SQL 语句和存储引擎。我们定义了一个回调流程,用于重新加载数据库查询逻辑和定制数据库查询的扫描方法。通过中间件流程开发,我们在流水线工作流程中调度数据传输和计算,优化了 PCIe 总线上的卸载效率。此外,我们还设计了一种新颖的五级 FPGA 微体系结构模块,实现了最佳时钟频率,进一步提高了卸载效率。系统评估结果表明,我们的解决方案使单个 FPGA 卡的性能与 8 个 CPU 查询进程相当,同时将 CPU 负载降低了 34%。与使用 4 个 CPU 内核相比,我们基于 FPGA 的加速系统在不增加 CPU 负载的情况下将查询延迟降低了 1.7 倍。此外,与单核环境下的软件基线相比,我们提出的解决方案在数据过滤方面的计算速度提高了 2.1 倍。总之,我们的工作为 OLAP 数据库提供了一个有价值的端到端硬件加速系统。
{"title":"Integrating FPGA-based hardware acceleration with relational databases","authors":"Ke Liu ,&nbsp;Haonan Tong ,&nbsp;Zhongxiang Sun,&nbsp;Zhixin Ren,&nbsp;Guangkui Huang,&nbsp;Hongyin Zhu,&nbsp;Luyang Liu,&nbsp;Qunyang Lin,&nbsp;Chuang Zhang","doi":"10.1016/j.parco.2024.103064","DOIUrl":"10.1016/j.parco.2024.103064","url":null,"abstract":"<div><p>The explosion of data over the last decades puts significant strain on the computational capacity of the central processing unit (CPU), challenging online analytical processing (OLAP). While previous studies have shown the potential of using Field Programmable Gate Arrays (FPGAs) in database systems, integrating FPGA-based hardware acceleration with relational databases remains challenging because of the complex nature of relational database operations and the need for specialized FPGA programming skills. Additionally, there are significant challenges related to optimizing FPGA-based acceleration for specific database workloads, ensuring data consistency and reliability, and integrating FPGA-based hardware acceleration with existing database infrastructure. In this study, we proposed a novel end-to-end FPGA-based acceleration system that supports native SQL statements and storage engine. We defined a callback process to reload the database query logic and customize the scanning method for database queries. Through middleware process development, we optimized offloading efficiency on PCIe bus by scheduling data transmission and computation in a pipeline workflow. Additionally, we designed a novel five-stage FPGA microarchitecture module that achieves optimal clock frequency, further enhancing offloading efficiency. Results from systematic evaluations indicate that our solution allows a single FPGA card to perform as well as 8 CPU query processes, while reducing CPU load by 34%. Compared to using 4 CPU cores, our FPGA-based acceleration system reduces query latency by 1.7 times without increasing CPU load. Furthermore, our proposed solution achieves 2.1 times computation speedup for data filtering compared with the software baseline in a single core environment. Overall, our work presents a valuable end-to-end hardware acceleration system for OLAP databases.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"119 ","pages":"Article 103064"},"PeriodicalIF":1.4,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000024/pdfft?md5=d270aeec859768a5bff3f5d4988863f9&pid=1-s2.0-S0167819124000024-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139825507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast data-dependence profiling through prior static analysis 通过先期静态分析快速剖析数据依赖性
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-01-11 DOI: 10.1016/j.parco.2024.103063
Mohammad Norouzi , Nicolas Morew , Qamar Ilias , Lukas Rothenberger , Ali Jannesari , Felix Wolf

Data-dependence profiling is a program-analysis technique for detecting parallelism opportunities in sequential programs. It captures data dependences that actually occur during program execution, filtering parallelism-preventing dependences that purely static methods assume only because they lack critical runtime information, such as the values of pointers and array indices. Profiling, however, suffers from high runtime overhead. In our earlier work, we accelerated data-dependence profiling by excluding polyhedral loops that can be handled statically using certain compilers and eliminating scalar variables that create statically-identifiable data dependences. In this paper, we combine the two methods and integrate them into DiscoPoP, a data-dependence profiler and parallelism discovery tool. Additionally, we detect reduction patterns statically and unify the three static analyses with the DiscoPoP framework to significantly diminish the profiling overhead and for a wider range of programs. We have evaluated our unified approaches with 49 benchmarks from three benchmark suites and two computer simulation applications. The evaluation results show that our approach reports fewer false positive and negative data dependences than the original data-dependence profiler and reduces the profiling time by at least 43%, with a median reduction of 76% across all programs. Also, we identify 40% of reduction cases statically and eliminate the associated profiling overhead for these cases.

数据依赖性剖析是一种程序分析技术,用于检测顺序程序中的并行性机会。它捕捉程序执行过程中实际发生的数据依赖性,过滤纯静态方法因缺乏关键运行时信息(如指针和数组索引的值等)而假设的防止并行性的依赖性。然而,剖析会带来很高的运行时开销。在我们早期的工作中,我们通过排除可使用某些编译器静态处理的多面体循环,以及消除会产生静态可识别数据依赖性的标量变量,加速了数据依赖性剖析。在本文中,我们将这两种方法结合起来,并将其集成到数据依赖性剖析器和并行性发现工具 DiscoPoP 中。此外,我们还通过静态方式检测还原模式,并将三种静态分析与 DiscoPoP 框架统一起来,从而大幅降低剖析开销,并适用于更广泛的程序。我们用来自三个基准套件和两个计算机模拟应用程序的 49 个基准对我们的统一方法进行了评估。评估结果表明,与原始数据依赖性剖析器相比,我们的方法报告的错误正负数据依赖性更少,剖析时间至少减少了 43%,在所有程序中的中位数减少了 76%。此外,我们还能静态识别 40% 的减少情况,并消除这些情况的相关剖析开销。
{"title":"Fast data-dependence profiling through prior static analysis","authors":"Mohammad Norouzi ,&nbsp;Nicolas Morew ,&nbsp;Qamar Ilias ,&nbsp;Lukas Rothenberger ,&nbsp;Ali Jannesari ,&nbsp;Felix Wolf","doi":"10.1016/j.parco.2024.103063","DOIUrl":"10.1016/j.parco.2024.103063","url":null,"abstract":"<div><p>Data-dependence profiling is a program-analysis technique for detecting parallelism opportunities in sequential programs. It captures data dependences that actually occur during program execution, filtering parallelism-preventing dependences that purely static methods assume only because they lack critical runtime information, such as the values of pointers and array indices. Profiling, however, suffers from high runtime overhead. In our earlier work, we accelerated data-dependence profiling by excluding polyhedral loops that can be handled statically using certain compilers and eliminating scalar variables that create statically-identifiable data dependences. In this paper, we combine the two methods and integrate them into DiscoPoP, a data-dependence profiler and parallelism discovery tool. Additionally, we detect reduction patterns statically and unify the three static analyses with the DiscoPoP framework to significantly diminish the profiling overhead and for a wider range of programs. We have evaluated our unified approaches with 49 benchmarks from three benchmark suites and two computer simulation applications. The evaluation results show that our approach reports fewer false positive and negative data dependences than the original data-dependence profiler and reduces the profiling time by at least 43%, with a median reduction of 76% across all programs. Also, we identify 40% of reduction cases statically and eliminate the associated profiling overhead for these cases.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"119 ","pages":"Article 103063"},"PeriodicalIF":1.4,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819124000012/pdfft?md5=99e5ae1bcda1fac5d3c65fb23d0ba7f8&pid=1-s2.0-S0167819124000012-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A GPU-based hydrodynamic simulator with boid interactions 基于 GPU 的水动力模拟器与boid 的相互作用
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2023-12-21 DOI: 10.1016/j.parco.2023.103062
Xi Liu, Gizem Kayar, Ken Perlin

We present a hydrodynamic simulation system using the GPU compute shaders of DirectX for simulating virtual agent behaviors and navigation inside a smoothed particle hydrodynamical (SPH) fluid environment with real-time water mesh surface reconstruction. The current SPH literature includes interactions between SPH and heterogeneous meshes but seldom involves interactions between SPH and virtual boid agents. The contribution of the system lies in the combination of the parallel smoothed particle hydrodynamics model with the distributed boid model of virtual agents to enable agents to interact with fluids. The agents based on the boid algorithm influence the motion of SPH fluid particles, and the forces from the SPH algorithm affect the movement of the boids. To enable realistic fluid rendering and simulation in a particle-based system, it is essential to construct a mesh from the particle attributes. Our system also contributes to the surface reconstruction aspect of the pipeline, in which we performed a set of experiments with the parallel marching cubes algorithm per frame for constructing the mesh from the fluid particles in a real-time compute and memory-intensive application, producing a wide range of triangle configurations. We also demonstrate that our system is versatile enough for reinforced robotic agents instead of boid agents to interact with the fluid environment for underwater navigation and remote control engineering purposes.

我们介绍了一种使用 DirectX GPU 计算着色器的流体动力学仿真系统,用于模拟虚拟代理在平滑粒子流体动力学(SPH)流体环境中的行为和导航,并实时重建水网格表面。目前的 SPH 文献包括 SPH 与异质网格之间的交互,但很少涉及 SPH 与虚拟boid 代理之间的交互。该系统的贡献在于将平行平滑粒子流体力学模型与虚拟代理的分布式 boid 模型相结合,使代理能够与流体互动。基于 boid 算法的代理影响 SPH 流体粒子的运动,而 SPH 算法的力又影响 boids 的运动。为了在基于粒子的系统中实现逼真的流体渲染和模拟,必须根据粒子属性构建网格。我们的系统还对管道的表面重建方面做出了贡献,其中我们进行了一组实验,在实时计算和内存密集型应用中,每帧使用并行行进立方体算法从流体粒子构建网格,产生了多种三角形配置。我们还证明,我们的系统具有足够的通用性,可用于加强机器人代理而不是boid代理与流体环境互动,以实现水下导航和远程控制工程目的。
{"title":"A GPU-based hydrodynamic simulator with boid interactions","authors":"Xi Liu,&nbsp;Gizem Kayar,&nbsp;Ken Perlin","doi":"10.1016/j.parco.2023.103062","DOIUrl":"10.1016/j.parco.2023.103062","url":null,"abstract":"<div><p>We present a hydrodynamic simulation system using the GPU compute shaders of DirectX for simulating virtual agent behaviors and navigation inside a smoothed particle hydrodynamical (SPH) fluid environment with real-time water mesh surface reconstruction. The current SPH literature includes interactions between SPH and heterogeneous meshes but seldom involves interactions between SPH and virtual boid agents. The contribution of the system lies in the combination of the parallel smoothed particle hydrodynamics model with the distributed boid model of virtual agents to enable agents to interact with fluids. The agents based on the boid algorithm influence the motion of SPH fluid particles, and the forces from the SPH algorithm affect the movement of the boids. To enable realistic fluid rendering and simulation in a particle-based system, it is essential to construct a mesh from the particle attributes. Our system also contributes to the surface reconstruction aspect of the pipeline, in which we performed a set of experiments with the parallel marching cubes algorithm per frame for constructing the mesh from the fluid particles in a real-time compute and memory-intensive application, producing a wide range of triangle configurations. We also demonstrate that our system is versatile enough for reinforced robotic agents instead of boid agents to interact with the fluid environment for underwater navigation and remote control engineering purposes.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"119 ","pages":"Article 103062"},"PeriodicalIF":1.4,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819123000686/pdfft?md5=c561b22916df38cc210c4a6988c337bc&pid=1-s2.0-S0167819123000686-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139028634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1