Parallel Computing最新文献_第9页

Optimizing small channel 3D convolution on GPU with tensor core 基于张量核的GPU小通道三维卷积优化

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102954

Jiazhi Jiang, Dan Huang, Jiangsu Du, Yutong Lu, Xiangke Liao

In many scenarios, particularly scientific AI applications, algorithm engineers widely adopt more complex convolution, e.g. 3D CNN, to improve the accuracy. Scientific AI applications with 3D-CNN, which tends to train with volumetric datasets, substantially increase the size of the input, which in turn potentially restricts the channel sizes (e.g. less than 64) under the constraints of limited device memory capacity. Since existing convolution implementations tend to split and parallelize computing the small channel convolution from channel dimension, they usually cannot fully exploit the performance of GPU accelerator, in particular that configured with the emerging tensor core.

In this work, we target on enhancing the performance of small channel 3D convolution on the GPU platform configured with tensor cores. Our analysis shows that the channel size of convolution has a great effect on the performance of existing convolution implementations, that are memory-bound on tensor core. By leveraging the memory hierarchy characteristics and the WMMA API of tensor core, we propose and implement holistic optimizations for both promoting the data access efficiency and intensifying the utilization of computing units. Experiments show that our implementation can obtain 1.1x–5.4x speedup comparing to the cuDNN’s implementations for the 3D convolutions on different GPU platforms. We also evaluate our implementations on two practical scientific AI applications and observe up to 1.7x and 2.0x overall speedups compared with using cuDNN on V100 GPU.

在许多场景下，特别是科学AI应用中，算法工程师广泛采用更复杂的卷积，例如3D CNN，以提高精度。具有3D-CNN的科学AI应用，倾向于使用体积数据集进行训练，大大增加了输入的大小，这反过来又潜在地限制了通道大小(例如，在有限的设备内存容量约束下，小于64)。由于现有的卷积实现倾向于从通道维度拆分和并行计算小通道卷积，它们通常不能充分利用GPU加速器的性能，特别是配置了新兴张量核的GPU加速器。在这项工作中，我们的目标是在配置张量核的GPU平台上提高小通道3D卷积的性能。我们的分析表明，卷积的通道大小对现有的卷积实现的性能有很大的影响，这些卷积实现是在张量核上进行内存限制的。利用张量核的内存层次特性和WMMA API，提出并实现了整体优化，既提高了数据访问效率，又增强了计算单元的利用率。实验表明，与cuDNN在不同GPU平台上实现的3D卷积相比，我们的实现可以获得1.1 - 5.4倍的加速。我们还在两个实际的科学AI应用程序上评估了我们的实现，并观察到与在V100 GPU上使用cuDNN相比，整体速度高达1.7倍和2.0倍。

{"title":"Optimizing small channel 3D convolution on GPU with tensor core","authors":"Jiazhi Jiang, Dan Huang, Jiangsu Du, Yutong Lu, Xiangke Liao","doi":"10.1016/j.parco.2022.102954","DOIUrl":"10.1016/j.parco.2022.102954","url":null,"abstract":"<div>In many scenarios, particularly scientific AI applications, algorithm engineers widely adopt more complex convolution, e.g. 3D CNN, to improve the accuracy. Scientific AI applications with 3D-CNN, which tends to train with volumetric datasets, substantially increase the size of the input, which in turn potentially restricts the channel sizes (e.g. less than 64) under the constraints of limited device memory capacity. Since existing convolution implementations tend to split and parallelize computing the small channel convolution from channel dimension, they usually cannot fully exploit the performance of GPU accelerator, in particular that configured with the emerging tensor core.In this work, we target on enhancing the performance of small channel 3D convolution on the GPU platform configured with tensor cores. Our analysis shows that the channel size of convolution has a great effect on the performance of existing convolution implementations, that are memory-bound on tensor core. By leveraging the memory hierarchy characteristics and the WMMA API of tensor core, we propose and implement holistic optimizations for both promoting the data access efficiency and intensifying the utilization of computing units. Experiments show that our implementation can obtain 1.1x–5.4x speedup comparing to the cuDNN’s implementations for the 3D convolutions on different GPU platforms. We also evaluate our implementations on two practical scientific AI applications and observe up to 1.7x and 2.0x overall speedups compared with using cuDNN on V100 GPU.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102954"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78348079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Graph optimization algorithm using symmetry and host bias for low-latency indirect network 基于对称和主机偏差的低延迟间接网络图优化算法

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-10-01 DOI: 10.2139/ssrn.4048955

M. Nakao, M. Tsukamoto, Y. Hanada, Keiji Yamamoto

引用次数: 1

A method for efficient radio astronomical data gridding on multi-core vector processor 一种基于多核矢量处理器的射电天文数据高效网格化方法

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102972

Hao Wang , Ce Yu , Jian Xiao , Shanjiang Tang , Yu Lu , Hao Fu , Bo Kang , Gang Zheng , Chenzhou Cui

Gridding is the performance-critical step in the data reduction pipeline for radio astronomy research, allowing astronomers to create the correct sky images for further analysis. Like the 2D stencil computation, gridding iteratively updates the output cells by convolution, where the value at each output cell in the space is computed as a weighted sum of neighboring point values. Existing state-of-the-art works have achieved performance improvement of gridding by using multi-core CPUs and GPUs in real-world applications, and their study proved that gridding is a type of scientific computation with high-density computing characteristics. However, low computational performance or high power consumption becomes the main limitation for their processing of large-scale astronomical data. The high-density computing feature of gridding provides opportunities to accelerate it on the multi-core vector processor with vector-SIMD architectures. However, existing works’ (such as those implemented on CPUs or GPUs) task parallelization and data transfer strategies are inefficient to perform gridding directly on the vector processor without any dedicated mapping algorithm.

M-DSP is a multi-core vector processor with vector-SIMD architectures designed for the next-generation exascale supercomputer, delivering high performance with ultra-low power consumption. In this paper, we present, for the first time, a novel method to achieve efficient gridding on the M-DSP. Specifically, we propose a gridding workflow designed for the vector-SIMD architectures and present a vectorized version of the gridding convolution algorithm to fully exploit the computational power of the M-DSP. In addition, centering on the processor architectures, we propose task-based parallelization strategies for block and line computing as well as different data loading strategies to achieve high parallel performance and high data transfer efficiency. Experimental results show that our work on M-DSP exhibits very competitive performance compared to other methods running on CPUs or GPUs. This demonstrates the efficiency of our method and the fact that the vector-SIMD architecture is beneficial for scientific computing with ”high density” characteristics, which can exploit its wide vector core and achieve higher performance than its competitors.

网格化是射电天文学研究中数据简化管道中性能关键的一步，它允许天文学家为进一步分析创建正确的天空图像。与2D模板计算一样，网格化通过卷积迭代更新输出单元，其中空间中每个输出单元的值被计算为相邻点值的加权和。现有的先进工作已经通过在实际应用中使用多核cpu和gpu实现了网格化的性能提升，他们的研究证明了网格化是一种具有高密度计算特性的科学计算。然而，低计算性能或高功耗成为它们处理大规模天文数据的主要限制。网格的高密度计算特性为在具有矢量simd架构的多核矢量处理器上加速网格提供了机会。然而，现有的工作(如那些在cpu或gpu上实现的)任务并行化和数据传输策略在没有任何专用映射算法的情况下直接在矢量处理器上执行网格划分是低效的。M-DSP是一款多核矢量处理器，采用矢量simd架构，专为下一代百亿亿次超级计算机设计，具有高性能和超低功耗。在本文中，我们首次提出了一种在M-DSP上实现高效网格划分的新方法。具体来说，我们提出了一个为矢量simd架构设计的网格工作流程，并提出了网格卷积算法的矢量化版本，以充分利用M-DSP的计算能力。此外，我们围绕处理器架构提出了基于任务的并行化策略，用于块计算和行计算，以及不同的数据加载策略，以实现高并行性能和高数据传输效率。实验结果表明，与其他在cpu或gpu上运行的方法相比，我们在M-DSP上的工作表现出非常有竞争力的性能。这表明了我们的方法的有效性，并且矢量simd架构有利于具有“高密度”特征的科学计算，可以利用其宽矢量核并获得比竞争对手更高的性能。

{"title":"A method for efficient radio astronomical data gridding on multi-core vector processor","authors":"Hao Wang , Ce Yu , Jian Xiao , Shanjiang Tang , Yu Lu , Hao Fu , Bo Kang , Gang Zheng , Chenzhou Cui","doi":"10.1016/j.parco.2022.102972","DOIUrl":"10.1016/j.parco.2022.102972","url":null,"abstract":"<div>Gridding is the performance-critical step in the data reduction pipeline for radio astronomy research, allowing astronomers to create the correct sky images for further analysis. Like the 2D stencil computation, gridding iteratively updates the output cells by convolution, where the value at each output cell in the space is computed as a weighted sum of neighboring point values. Existing state-of-the-art works have achieved performance improvement of gridding by using multi-core CPUs and GPUs in real-world applications, and their study proved that gridding is a type of scientific computation with high-density computing characteristics. However, low computational performance or high power consumption becomes the main limitation for their processing of large-scale astronomical data. The high-density computing feature of gridding provides opportunities to accelerate it on the multi-core vector processor with vector-SIMD architectures. However, existing works’ (such as those implemented on CPUs or GPUs) task parallelization and data transfer strategies are inefficient to perform gridding directly on the vector processor without any dedicated mapping algorithm.M-DSP is a multi-core vector processor with vector-SIMD architectures designed for the next-generation exascale supercomputer, delivering high performance with ultra-low power consumption. In this paper, we present, for the first time, a novel method to achieve efficient gridding on the M-DSP. Specifically, we propose a gridding workflow designed for the vector-SIMD architectures and present a vectorized version of the gridding convolution algorithm to fully exploit the computational power of the M-DSP. In addition, centering on the processor architectures, we propose task-based parallelization strategies for block and line computing as well as different data loading strategies to achieve high parallel performance and high data transfer efficiency. Experimental results show that our work on M-DSP exhibits very competitive performance compared to other methods running on CPUs or GPUs. This demonstrates the efficiency of our method and the fact that the vector-SIMD architecture is beneficial for scientific computing with ”high density” characteristics, which can exploit its wide vector core and achieve higher performance than its competitors.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102972"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75782731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPU 基于qos的动态资源分配，提高了GPU的利用率和能效

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102958

Qingxiao Sun , Liu Yi , Hailong Yang , Mingzhen Li , Zhongzhi Luan , Depei Qian

Although GPUs have been indispensable in data centers, meeting the Quality of Service (QoS) under task consolidation on GPU is extremely challenging. Previous works mostly rely on the static task or resource scheduling and cannot handle the QoS violation during runtime. In addition, existing works fail to exploit the computing characteristics of batch tasks, and thus waste the opportunities to reduce power consumption while improving GPU utilization. To address the above problems, we propose a new runtime mechanism SMQoS that can dynamically adjust the resource allocation during runtime to meet the QoS of latency-sensitive (LS) tasks and determine the optimal resource allocation for batch tasks to improve GPU utilization and power efficiency. We implement the proposed mechanism on both simulator (SMQoS) and real GPU hardware (RH-SMQoS). The experimental results show that both SMQoS and RH-SMQoS can achieve better QoS for LS tasks and higher throughput for batch tasks compared to the state-of-the-art works. With hardware extension, the SMQoS can further reduce the power consumption by power gating idle computing resources.

虽然GPU已经成为数据中心不可或缺的一部分，但在GPU上实现任务整合下的服务质量(QoS)是非常具有挑战性的。以往的工作大多依赖于静态任务或资源调度，无法在运行时处理QoS冲突。此外，现有的工作未能充分利用批处理任务的计算特性，从而浪费了在提高GPU利用率的同时降低功耗的机会。针对上述问题，我们提出了一种新的运行时机制SMQoS，该机制可以在运行时动态调整资源分配，以满足延迟敏感(LS)任务的QoS要求，并确定批处理任务的最优资源分配，从而提高GPU利用率和功耗效率。我们在模拟器(SMQoS)和真实GPU硬件(RH-SMQoS)上实现了所提出的机制。实验结果表明，与现有方法相比，SMQoS和RH-SMQoS都可以实现更好的LS任务QoS和更高的批处理任务吞吐量。通过硬件扩展，SMQoS可以通过对空闲计算资源进行电源门控来进一步降低功耗。

{"title":"QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPU","authors":"Qingxiao Sun , Liu Yi , Hailong Yang , Mingzhen Li , Zhongzhi Luan , Depei Qian","doi":"10.1016/j.parco.2022.102958","DOIUrl":"10.1016/j.parco.2022.102958","url":null,"abstract":"<div>Although GPUs have been indispensable in data centers, meeting the Quality of Service (QoS) under task consolidation on GPU is extremely challenging. Previous works mostly rely on the static task or resource scheduling and cannot handle the QoS violation during runtime. In addition, existing works fail to exploit the computing characteristics of batch tasks, and thus waste the opportunities to reduce power consumption while improving GPU utilization. To address the above problems, we propose a new runtime mechanism SMQoS that can dynamically adjust the resource allocation during runtime to meet the QoS of latency-sensitive (LS) tasks and determine the optimal resource allocation for batch tasks to improve GPU utilization and power efficiency. We implement the proposed mechanism on both simulator (SMQoS) and real GPU hardware (RH-SMQoS). The experimental results show that both SMQoS and RH-SMQoS can achieve better QoS for LS tasks and higher throughput for batch tasks compared to the state-of-the-art works. With hardware extension, the SMQoS can further reduce the power consumption by power gating idle computing resources.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102958"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75432812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

SVM-SMO-SGD: A hybrid-parallel support vector machine algorithm using sequential minimal optimization with stochastic gradient descent SVM-SMO-SGD:一种基于随机梯度下降的序列最小优化混合并行支持向量机算法

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102955

Gizen Mutlu, Çiğdem İnan Acı

The Support Vector Machine (SVM) method is one of the popular machine learning algorithms as it gives high accuracy. However, like most machine learning algorithms, the resource consumption of the SVM algorithm in terms of time and memory increases linearly as the dataset grows. In this study, a parallel-hybrid algorithm that combines SVM, Sequential Minimal Optimization (SMO) with Stochastic Gradient Descent (SGD) methods have been proposed to optimize the calculation of the weight costs. The performance of the proposed SVM-SMO-SGD algorithm was compared with classical SMO and Compute Unified Device Architecture (CUDA) based approaches on the well-known datasets (i.e., Diabetes, Healthcare Stroke Prediction, Adults) with 520, 5110, and 32,560 samples, respectively. According to the results, Sequential SVM-SMO-SGD is 3.81 times faster in terms of time, and 1.04 times more efficient RAM consumption than the classical SMO algorithm. The parallel SVM-SMO-SGD algorithm, on the other hand, is 75.47 times faster than the classical SMO algorithm in terms of time. It is also 1.9 times more efficient in RAM consumption. The overall classification accuracy of all algorithms is 87% in the Diabetes dataset, 95% in the Healthcare Stroke Prediction dataset, and 82% in the Adults dataset.

支持向量机(SVM)方法是目前流行的机器学习算法之一，具有较高的准确率。然而，与大多数机器学习算法一样，SVM算法在时间和内存方面的资源消耗随着数据集的增长呈线性增长。本文提出了一种结合支持向量机(SVM)、顺序最小优化(SMO)和随机梯度下降(SGD)方法的并行混合算法来优化权值的计算。将提出的SVM-SMO-SGD算法与经典的SMO方法和基于CUDA的方法在已知数据集(即糖尿病、医疗卒中预测、成人)上的性能进行了比较，这些数据集分别为520、5110和32,560个样本。结果表明，与传统的SMO算法相比，顺序SVM-SMO-SGD算法在时间上提高了3.81倍，RAM消耗提高了1.04倍。而并行SVM-SMO-SGD算法在时间上比经典SMO算法快75.47倍。它的RAM消耗效率也提高了1.9倍。所有算法的总体分类准确率在糖尿病数据集中为87%，在医疗卒中预测数据集中为95%，在成人数据集中为82%。

{"title":"SVM-SMO-SGD: A hybrid-parallel support vector machine algorithm using sequential minimal optimization with stochastic gradient descent","authors":"Gizen Mutlu, Çiğdem İnan Acı","doi":"10.1016/j.parco.2022.102955","DOIUrl":"10.1016/j.parco.2022.102955","url":null,"abstract":"<div>The Support Vector Machine (SVM) method is one of the popular machine learning algorithms as it gives high accuracy. However, like most machine learning algorithms, the resource consumption of the SVM algorithm in terms of time and memory increases linearly as the dataset grows. In this study, a parallel-hybrid algorithm that combines SVM, Sequential Minimal Optimization (SMO) with Stochastic Gradient Descent (SGD) methods have been proposed to optimize the calculation of the weight costs. The performance of the proposed SVM-SMO-SGD algorithm was compared with classical SMO and Compute Unified Device Architecture (CUDA) based approaches on the well-known datasets (i.e., Diabetes, Healthcare Stroke Prediction, Adults) with 520, 5110, and 32,560 samples, respectively. According to the results, Sequential SVM-SMO-SGD is 3.81 times faster in terms of time, and 1.04 times more efficient RAM consumption than the classical SMO algorithm. The parallel SVM-SMO-SGD algorithm, on the other hand, is 75.47 times faster than the classical SMO algorithm in terms of time. It is also 1.9 times more efficient in RAM consumption. The overall classification accuracy of all algorithms is 87% in the Diabetes dataset, 95% in the Healthcare Stroke Prediction dataset, and 82% in the Adults dataset.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102955"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73437828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers 通过冯诺依曼瓶颈路由大脑流量:通用计算机上尖峰神经网络仿真代码的高效缓存使用

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102952

J. Pronold , J. Jordan , B.J.N. Wylie , I. Kitayama , M. Diesmann , S. Kunkel

Simulation is a third pillar next to experiment and theory in the study of complex dynamic systems such as biological neural networks. Contemporary brain-scale networks correspond to directed random graphs of a few million nodes, each with an in-degree and out-degree of several thousands of edges, where nodes and edges correspond to the fundamental biological units, neurons and synapses, respectively. The activity in neuronal networks is also sparse. Each neuron occasionally transmits a brief signal, called spike, via its outgoing synapses to the corresponding target neurons. In distributed computing these targets are scattered across thousands of parallel processes. The spatial and temporal sparsity represents an inherent bottleneck for simulations on conventional computers: irregular memory-access patterns cause poor cache utilization. Using an established neuronal network simulation code as a reference implementation, we investigate how common techniques to recover cache performance such as software-induced prefetching and software pipelining can benefit a real-world application. The algorithmic changes reduce simulation time by up to 50%. The study exemplifies that many-core systems assigned with an intrinsically parallel computational problem can alleviate the von Neumann bottleneck of conventional computer architectures.

在生物神经网络等复杂动态系统的研究中，仿真是仅次于实验和理论的第三大支柱。当代大脑规模的网络对应于几百万个节点的有向随机图，每个节点都有几千个边的入度和出度，其中节点和边分别对应于基本的生物单位，神经元和突触。神经网络的活动也是稀疏的。每个神经元偶尔会通过其输出突触向相应的目标神经元传递一个简短的信号，称为spike。在分布式计算中，这些目标分散在数千个并行进程中。空间和时间稀疏性是在传统计算机上进行模拟的固有瓶颈:不规则的内存访问模式导致缓存利用率低下。使用已建立的神经网络模拟代码作为参考实现，我们研究了恢复缓存性能的常用技术(如软件诱导预取和软件流水线)如何使现实世界的应用受益。算法的改变将模拟时间减少了50%。该研究表明，多核系统分配一个本质上并行的计算问题，可以缓解传统计算机体系结构的冯·诺依曼瓶颈。

{"title":"Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers","authors":"J. Pronold , J. Jordan , B.J.N. Wylie , I. Kitayama , M. Diesmann , S. Kunkel","doi":"10.1016/j.parco.2022.102952","DOIUrl":"10.1016/j.parco.2022.102952","url":null,"abstract":"<div>Simulation is a third pillar next to experiment and theory in the study of complex dynamic systems such as biological neural networks. Contemporary brain-scale networks correspond to directed random graphs of a few million nodes, each with an in-degree and out-degree of several thousands of edges, where nodes and edges correspond to the fundamental biological units, neurons and synapses, respectively. The activity in neuronal networks is also sparse. Each neuron occasionally transmits a brief signal, called spike, via its outgoing synapses to the corresponding target neurons. In distributed computing these targets are scattered across thousands of parallel processes. The spatial and temporal sparsity represents an inherent bottleneck for simulations on conventional computers: irregular memory-access patterns cause poor cache utilization. Using an established neuronal network simulation code as a reference implementation, we investigate how common techniques to recover cache performance such as software-induced prefetching and software pipelining can benefit a real-world application. The algorithmic changes reduce simulation time by up to 50%. The study exemplifies that many-core systems assigned with an intrinsically parallel computational problem can alleviate the von Neumann bottleneck of conventional computer architectures.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102952"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000461/pdfft?md5=b8e7064aa5b20b2508d68e7bff9b38e4&pid=1-s2.0-S0167819122000461-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76371194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Fast calculation of isostatic compensation correction using the GPU-parallel prism method 用GPU平行棱镜法快速计算等静压补偿校正

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102970

Yan Huang , Qingbin Wang , Minghao Lv , Xingguang Song , Jinkai Feng , Xuli Tan , Ziyan Huang , Chuyuan Zhou

Isostatic compensation is a crucial component of crustal structure analysis and geoid calculations in cases of gravity reduction. However, large-scale and high-precision calculations are limited by the inefficiencies of the strict prism method and the low accuracy of the approximate calculation formula. In this study, we propose a new method of terrain grid re-encoding and an eight-component strict prism integral disassembly using a compute unified device architecture parallel programming platform. We use a fast parallel algorithm for the isostatic compensation correction, using the strict prism method based on CPU + GPU heterogeneous parallelization with efficient task allocation and GPU thread overloading procedure. The results of this study provide a rigorous, fast, and accurate solution for high-resolution and high-precision isostatic compensation corrections. To ensure an absolute calculation accuracy of 10⁻⁶ mGal, the maximum acceleration ratio of the calculation was set to at least 730 using one GPU and 2241 using four GPUs, which shortens the calculation time and improves the calculation efficiency.

均衡补偿是重力减小情况下地壳结构分析和大地水准面计算的重要组成部分。然而，严格棱镜法的低效率和近似计算公式的低精度限制了大规模高精度的计算。在本研究中，我们提出了一种新的地形网格重新编码方法和基于计算统一设备架构并行编程平台的八分量严格棱镜整体拆卸方法。采用基于CPU + GPU异构并行的严格棱镜方法，采用高效的任务分配和GPU线程过载过程，实现了均衡补偿校正的快速并行算法。本研究结果为高分辨率、高精度等静力补偿校正提供了严格、快速、准确的解决方案。为了保证10−6 mGal的绝对计算精度，计算的最大加速比在单GPU下至少设置为730，在四GPU下至少设置为2241，这样可以缩短计算时间，提高计算效率。

{"title":"Fast calculation of isostatic compensation correction using the GPU-parallel prism method","authors":"Yan Huang , Qingbin Wang , Minghao Lv , Xingguang Song , Jinkai Feng , Xuli Tan , Ziyan Huang , Chuyuan Zhou","doi":"10.1016/j.parco.2022.102970","DOIUrl":"10.1016/j.parco.2022.102970","url":null,"abstract":"<div>Isostatic compensation is a crucial component of crustal structure analysis and geoid calculations in cases of gravity reduction. However, large-scale and high-precision calculations are limited by the inefficiencies of the strict prism method and the low accuracy of the approximate calculation formula. In this study, we propose a new method of terrain grid re-encoding and an eight-component strict prism integral disassembly using a compute unified device architecture parallel programming platform. We use a fast parallel algorithm for the isostatic compensation correction, using the strict prism method based on CPU + GPU heterogeneous parallelization with efficient task allocation and GPU thread overloading procedure. The results of this study provide a rigorous, fast, and accurate solution for high-resolution and high-precision isostatic compensation corrections. To ensure an absolute calculation accuracy of 10−6 mGal, the maximum acceleration ratio of the calculation was set to at least 730 using one GPU and 2241 using four GPUs, which shortens the calculation time and improves the calculation efficiency.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102970"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000618/pdfft?md5=c2b82b5c153d0daba6ac23f42fb2b152&pid=1-s2.0-S0167819122000618-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45936763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

parGeMSLR: A parallel multilevel Schur complement low-rank preconditioning and solution package for general sparse matrices parGeMSLR:一般稀疏矩阵的并行多级Schur补低秩预处理和解包

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102956

Tianshi Xu , Vassilis Kalantzis , Ruipeng Li , Yuanzhe Xi , Geoffrey Dillon , Yousef Saad

This paper discusses parGeMSLR, a C++/MPI software library for the solution of sparse systems of linear algebraic equations via preconditioned Krylov subspace methods in distributed-memory computing environments. The preconditioner implemented in parGeMSLR is based on algebraic domain decomposition and partitions the symmetrized adjacency graph recursively into several non-overlapping partitions via a $p$ -way vertex separator, where $p$ is an integer multiple of the total number of MPI processes. From a numerical perspective, parGeMSLR builds a Schur complement approximate inverse preconditioner as the sum between the matrix inverse of the interface coupling matrix and a low-rank correction term. To reduce the cost associated with the computation of the approximate inverse matrices, parGeMSLR exploits a multilevel partitioning of the algebraic domain. The parGeMSLR library is implemented on top of the Message Passing Interface and can solve both real and complex linear systems. Furthermore, parGeMSLR can take advantage of hybrid computing environments with in-node access to one or more Graphics Processing Units. Finally, the parallel efficiency (weak and strong scaling) of parGeMSLR is demonstrated on a few model problems arising from discretizations of 3D Partial Differential Equations.

本文讨论了一个c++ /MPI软件库parGeMSLR，它用于在分布式存储计算环境下用预条件Krylov子空间方法求解线性代数方程的稀疏系统。parGeMSLR中实现的预条件基于代数域分解，通过p路顶点分隔符将对称邻接图递归划分为多个不重叠的分区，其中p是MPI进程总数的整数倍。从数值角度来看，parGeMSLR将Schur补近似逆预条件构建为界面耦合矩阵逆与低秩校正项的和。为了减少与近似逆矩阵的计算相关的成本，parGeMSLR利用了代数域的多级划分。parGeMSLR库是在消息传递接口之上实现的，可以解决真实和复杂的线性系统。此外，parGeMSLR可以利用节点内访问一个或多个图形处理单元的混合计算环境。最后，在三维偏微分方程离散化引起的几个模型问题上，证明了parGeMSLR的并行效率(弱标度和强标度)。

{"title":"parGeMSLR: A parallel multilevel Schur complement low-rank preconditioning and solution package for general sparse matrices","authors":"Tianshi Xu , Vassilis Kalantzis , Ruipeng Li , Yuanzhe Xi , Geoffrey Dillon , Yousef Saad","doi":"10.1016/j.parco.2022.102956","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102956","url":null,"abstract":"<div>This paper discusses parGeMSLR, a C++/MPI software library for the solution of sparse systems of linear algebraic equations via preconditioned Krylov subspace methods in distributed-memory computing environments. The preconditioner implemented in parGeMSLR is based on algebraic domain decomposition and partitions the symmetrized adjacency graph recursively into several non-overlapping partitions via a <math><mi>p</mi></math>-way vertex separator, where <math><mi>p</mi></math> is an integer multiple of the total number of MPI processes. From a numerical perspective, parGeMSLR builds a Schur complement approximate inverse preconditioner as the sum between the matrix inverse of the interface coupling matrix and a low-rank correction term. To reduce the cost associated with the computation of the approximate inverse matrices, parGeMSLR exploits a multilevel partitioning of the algebraic domain. The parGeMSLR library is implemented on top of the Message Passing Interface and can solve both real and complex linear systems. Furthermore, parGeMSLR can take advantage of hybrid computing environments with in-node access to one or more Graphics Processing Units. Finally, the parallel efficiency (weak and strong scaling) of parGeMSLR is demonstrated on a few model problems arising from discretizations of 3D Partial Differential Equations.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102956"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91978783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterizing the Performance of Node-Aware Strategies for Irregular Point-to-Point Communication on Heterogeneous Architectures 异构体系结构中不规则点对点通信节点感知策略的性能表征

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-09-13 DOI: 10.48550/arXiv.2209.06141

S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson

Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data to be communicated and the number of potential data flow paths. In this work, we characterize the performance of irregular point-to-point communication with MPI on heterogeneous compute environments through performance modeling, demonstrating the limitations of standard communication strategies for both device-aware and staging-through-host communication techniques. Presented models suggest staging communicated data through host processes then using node-aware communication strategies for high inter-node message counts. Notably, the models also predict that node-aware communication utilizing all available CPU cores to communicate inter-node data leads to the most performant strategy when communicating with a high number of nodes. Model validation is provided via a case study of irregular point-to-point communication patterns in distributed sparse matrix-vector products. Importantly, we include a discussion on the implications model predictions have on communication strategy design for emerging supercomputer architectures.

由于包含异构计算节点，超级计算机体系结构正趋向于更高的计算吞吐量。这些多gpu节点提高了节点上的计算效率，同时也增加了需要通信的数据量和潜在数据流路径的数量。在这项工作中，我们通过性能建模描述了在异构计算环境中与MPI进行不规则点对点通信的性能，展示了设备感知和通过主机进行通信技术的标准通信策略的局限性。提出的模型建议通过主机进程暂存通信数据，然后使用节点感知通信策略来实现高节点间消息计数。值得注意的是，这些模型还预测，当与大量节点通信时，利用所有可用CPU内核进行节点间数据通信的节点感知通信将产生最高效的策略。通过对分布式稀疏矩阵向量积中不规则点对点通信模式的案例研究，提供了模型验证。重要的是，我们还讨论了模型预测对新兴超级计算机体系结构通信策略设计的影响。

{"title":"Characterizing the Performance of Node-Aware Strategies for Irregular Point-to-Point Communication on Heterogeneous Architectures","authors":"S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson","doi":"10.48550/arXiv.2209.06141","DOIUrl":"https://doi.org/10.48550/arXiv.2209.06141","url":null,"abstract":"Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data to be communicated and the number of potential data flow paths. In this work, we characterize the performance of irregular point-to-point communication with MPI on heterogeneous compute environments through performance modeling, demonstrating the limitations of standard communication strategies for both device-aware and staging-through-host communication techniques. Presented models suggest staging communicated data through host processes then using node-aware communication strategies for high inter-node message counts. Notably, the models also predict that node-aware communication utilizing all available CPU cores to communicate inter-node data leads to the most performant strategy when communicating with a high number of nodes. Model validation is provided via a case study of irregular point-to-point communication patterns in distributed sparse matrix-vector products. Importantly, we include a discussion on the implications model predictions have on communication strategy design for emerging supercomputer architectures.","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"20 1","pages":"103021"},"PeriodicalIF":1.4,"publicationDate":"2022-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82207056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Energy-efficient scheduling algorithms based on task clustering in heterogeneous spark clusters 异构火花集群中基于任务聚类的节能调度算法

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102947

Wenhu Shi, Hongjian Li, Junzhe Guan, Hang Zeng, Rafe Misskat jahan

Spark is widely used for its fast in-memory processing. It is important to improve energy efficiency under deadline constrains. In this paper, a Task Performance Clustering of Best Fitting Decrease (TPCBFD) scheduling algorithm is proposed. It divides tasks in Spark into three types, with the different types of tasks being placed on nodes with superior performance. However, the basic computation time for TPCBFD takes up a large proportion of the task execution time, so the Energy-Aware TPCBFD (EATPCBFD) algorithm based on the proposed energy consumption model is proposed, focusing on optimizing energy efficiency and Service Level Agreement (SLA) service times. The experimental results show that EATPCBFD increases the average energy efficiency in Spark by 77% and the average passing rate of SLA service time by 14% compared to comparison algorithms. EATPCBFD has higher energy efficiency on average than comparison algorithms under deadline. The average energy efficiency of EATPCBFD with the deadline constraint is higher than the comparison algorithm.

Spark因其快速的内存处理而被广泛使用。在期限限制下提高能源效率是很重要的。提出了一种任务性能聚类最优拟合递减调度算法(TPCBFD)。它将Spark中的任务分为三种类型，不同类型的任务被放置在性能优越的节点上。但由于TPCBFD的基本计算时间占任务执行时间的很大比例，因此提出了基于上述能耗模型的energy - aware TPCBFD (EATPCBFD)算法，该算法的重点是优化能源效率和SLA服务时间。实验结果表明，与比较算法相比，EATPCBFD在Spark中的平均能效提高了77%，SLA服务时间的平均合格率提高了14%。在截止日期下，EATPCBFD的平均能量效率高于比较算法。带deadline约束的EATPCBFD的平均能效高于比较算法。

{"title":"Energy-efficient scheduling algorithms based on task clustering in heterogeneous spark clusters","authors":"Wenhu Shi, Hongjian Li, Junzhe Guan, Hang Zeng, Rafe Misskat jahan","doi":"10.1016/j.parco.2022.102947","DOIUrl":"10.1016/j.parco.2022.102947","url":null,"abstract":"<div>Spark is widely used for its fast in-memory processing. It is important to improve energy efficiency under deadline constrains. In this paper, a Task Performance Clustering of Best Fitting Decrease (TPCBFD) scheduling algorithm is proposed. It divides tasks in Spark into three types, with the different types of tasks being placed on nodes with superior performance. However, the basic computation time for TPCBFD takes up a large proportion of the task execution time, so the Energy-Aware TPCBFD (EATPCBFD) algorithm based on the proposed energy consumption model is proposed, focusing on optimizing energy efficiency and Service Level Agreement (SLA) service times. The experimental results show that EATPCBFD increases the average energy efficiency in Spark by 77% and the average passing rate of SLA service time by 14% compared to comparison algorithms. EATPCBFD has higher energy efficiency on average than comparison algorithms under deadline. The average energy efficiency of EATPCBFD with the deadline constraint is higher than the comparison algorithm.</div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102947"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78038927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1