首页 > 最新文献

Parallel Computing最新文献

英文 中文
SVM-SMO-SGD: A hybrid-parallel support vector machine algorithm using sequential minimal optimization with stochastic gradient descent SVM-SMO-SGD:一种基于随机梯度下降的序列最小优化混合并行支持向量机算法
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102955
Gizen Mutlu, Çiğdem İnan Acı

The Support Vector Machine (SVM) method is one of the popular machine learning algorithms as it gives high accuracy. However, like most machine learning algorithms, the resource consumption of the SVM algorithm in terms of time and memory increases linearly as the dataset grows. In this study, a parallel-hybrid algorithm that combines SVM, Sequential Minimal Optimization (SMO) with Stochastic Gradient Descent (SGD) methods have been proposed to optimize the calculation of the weight costs. The performance of the proposed SVM-SMO-SGD algorithm was compared with classical SMO and Compute Unified Device Architecture (CUDA) based approaches on the well-known datasets (i.e., Diabetes, Healthcare Stroke Prediction, Adults) with 520, 5110, and 32,560 samples, respectively. According to the results, Sequential SVM-SMO-SGD is 3.81 times faster in terms of time, and 1.04 times more efficient RAM consumption than the classical SMO algorithm. The parallel SVM-SMO-SGD algorithm, on the other hand, is 75.47 times faster than the classical SMO algorithm in terms of time. It is also 1.9 times more efficient in RAM consumption. The overall classification accuracy of all algorithms is 87% in the Diabetes dataset, 95% in the Healthcare Stroke Prediction dataset, and 82% in the Adults dataset.

支持向量机(SVM)方法是目前流行的机器学习算法之一,具有较高的准确率。然而,与大多数机器学习算法一样,SVM算法在时间和内存方面的资源消耗随着数据集的增长呈线性增长。本文提出了一种结合支持向量机(SVM)、顺序最小优化(SMO)和随机梯度下降(SGD)方法的并行混合算法来优化权值的计算。将提出的SVM-SMO-SGD算法与经典的SMO方法和基于CUDA的方法在已知数据集(即糖尿病、医疗卒中预测、成人)上的性能进行了比较,这些数据集分别为520、5110和32,560个样本。结果表明,与传统的SMO算法相比,顺序SVM-SMO-SGD算法在时间上提高了3.81倍,RAM消耗提高了1.04倍。而并行SVM-SMO-SGD算法在时间上比经典SMO算法快75.47倍。它的RAM消耗效率也提高了1.9倍。所有算法的总体分类准确率在糖尿病数据集中为87%,在医疗卒中预测数据集中为95%,在成人数据集中为82%。
{"title":"SVM-SMO-SGD: A hybrid-parallel support vector machine algorithm using sequential minimal optimization with stochastic gradient descent","authors":"Gizen Mutlu,&nbsp;Çiğdem İnan Acı","doi":"10.1016/j.parco.2022.102955","DOIUrl":"10.1016/j.parco.2022.102955","url":null,"abstract":"<div><p><span>The Support Vector Machine<span><span> (SVM) method is one of the popular machine learning algorithms<span> as it gives high accuracy. However, like most machine learning algorithms, the resource consumption of the SVM algorithm in terms of time and memory increases linearly as the dataset grows. In this study, a parallel-hybrid algorithm that combines SVM, Sequential Minimal Optimization (SMO) with Stochastic Gradient Descent (SGD) methods have been proposed to optimize the calculation of the weight costs. The performance of the proposed SVM-SMO-SGD algorithm was compared with classical SMO and Compute Unified Device Architecture (CUDA) based approaches on the well-known datasets (i.e., Diabetes, Healthcare Stroke Prediction, Adults) with 520, 5110, and 32,560 samples, respectively. According to the results, Sequential SVM-SMO-SGD is 3.81 times faster in terms of time, and 1.04 times more efficient </span></span>RAM consumption than the classical </span></span>SMO algorithm<span>. The parallel SVM-SMO-SGD algorithm, on the other hand, is 75.47 times faster than the classical SMO algorithm in terms of time. It is also 1.9 times more efficient in RAM consumption. The overall classification accuracy of all algorithms is 87% in the Diabetes dataset, 95% in the Healthcare Stroke Prediction dataset, and 82% in the Adults dataset.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102955"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73437828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers 通过冯诺依曼瓶颈路由大脑流量:通用计算机上尖峰神经网络仿真代码的高效缓存使用
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102952
J. Pronold , J. Jordan , B.J.N. Wylie , I. Kitayama , M. Diesmann , S. Kunkel

Simulation is a third pillar next to experiment and theory in the study of complex dynamic systems such as biological neural networks. Contemporary brain-scale networks correspond to directed random graphs of a few million nodes, each with an in-degree and out-degree of several thousands of edges, where nodes and edges correspond to the fundamental biological units, neurons and synapses, respectively. The activity in neuronal networks is also sparse. Each neuron occasionally transmits a brief signal, called spike, via its outgoing synapses to the corresponding target neurons. In distributed computing these targets are scattered across thousands of parallel processes. The spatial and temporal sparsity represents an inherent bottleneck for simulations on conventional computers: irregular memory-access patterns cause poor cache utilization. Using an established neuronal network simulation code as a reference implementation, we investigate how common techniques to recover cache performance such as software-induced prefetching and software pipelining can benefit a real-world application. The algorithmic changes reduce simulation time by up to 50%. The study exemplifies that many-core systems assigned with an intrinsically parallel computational problem can alleviate the von Neumann bottleneck of conventional computer architectures.

在生物神经网络等复杂动态系统的研究中,仿真是仅次于实验和理论的第三大支柱。当代大脑规模的网络对应于几百万个节点的有向随机图,每个节点都有几千个边的入度和出度,其中节点和边分别对应于基本的生物单位,神经元和突触。神经网络的活动也是稀疏的。每个神经元偶尔会通过其输出突触向相应的目标神经元传递一个简短的信号,称为spike。在分布式计算中,这些目标分散在数千个并行进程中。空间和时间稀疏性是在传统计算机上进行模拟的固有瓶颈:不规则的内存访问模式导致缓存利用率低下。使用已建立的神经网络模拟代码作为参考实现,我们研究了恢复缓存性能的常用技术(如软件诱导预取和软件流水线)如何使现实世界的应用受益。算法的改变将模拟时间减少了50%。该研究表明,多核系统分配一个本质上并行的计算问题,可以缓解传统计算机体系结构的冯·诺依曼瓶颈。
{"title":"Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers","authors":"J. Pronold ,&nbsp;J. Jordan ,&nbsp;B.J.N. Wylie ,&nbsp;I. Kitayama ,&nbsp;M. Diesmann ,&nbsp;S. Kunkel","doi":"10.1016/j.parco.2022.102952","DOIUrl":"10.1016/j.parco.2022.102952","url":null,"abstract":"<div><p>Simulation is a third pillar next to experiment and theory in the study of complex dynamic systems such as biological neural networks. Contemporary brain-scale networks correspond to directed random graphs of a few million nodes, each with an in-degree and out-degree of several thousands of edges, where nodes and edges correspond to the fundamental biological units, neurons and synapses, respectively. The activity in neuronal networks is also sparse. Each neuron occasionally transmits a brief signal, called spike, via its outgoing synapses to the corresponding target neurons. In distributed computing these targets are scattered across thousands of parallel processes. The spatial and temporal sparsity represents an inherent bottleneck for simulations on conventional computers: irregular memory-access patterns cause poor cache utilization. Using an established neuronal network simulation code as a reference implementation, we investigate how common techniques to recover cache performance such as software-induced prefetching and software pipelining can benefit a real-world application. The algorithmic changes reduce simulation time by up to 50%. The study exemplifies that many-core systems assigned with an intrinsically parallel computational problem can alleviate the von Neumann bottleneck of conventional computer architectures.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102952"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000461/pdfft?md5=b8e7064aa5b20b2508d68e7bff9b38e4&pid=1-s2.0-S0167819122000461-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76371194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Fast calculation of isostatic compensation correction using the GPU-parallel prism method 用GPU平行棱镜法快速计算等静压补偿校正
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102970
Yan Huang , Qingbin Wang , Minghao Lv , Xingguang Song , Jinkai Feng , Xuli Tan , Ziyan Huang , Chuyuan Zhou

Isostatic compensation is a crucial component of crustal structure analysis and geoid calculations in cases of gravity reduction. However, large-scale and high-precision calculations are limited by the inefficiencies of the strict prism method and the low accuracy of the approximate calculation formula. In this study, we propose a new method of terrain grid re-encoding and an eight-component strict prism integral disassembly using a compute unified device architecture parallel programming platform. We use a fast parallel algorithm for the isostatic compensation correction, using the strict prism method based on CPU + GPU heterogeneous parallelization with efficient task allocation and GPU thread overloading procedure. The results of this study provide a rigorous, fast, and accurate solution for high-resolution and high-precision isostatic compensation corrections. To ensure an absolute calculation accuracy of 10−6 mGal, the maximum acceleration ratio of the calculation was set to at least 730 using one GPU and 2241 using four GPUs, which shortens the calculation time and improves the calculation efficiency.

均衡补偿是重力减小情况下地壳结构分析和大地水准面计算的重要组成部分。然而,严格棱镜法的低效率和近似计算公式的低精度限制了大规模高精度的计算。在本研究中,我们提出了一种新的地形网格重新编码方法和基于计算统一设备架构并行编程平台的八分量严格棱镜整体拆卸方法。采用基于CPU + GPU异构并行的严格棱镜方法,采用高效的任务分配和GPU线程过载过程,实现了均衡补偿校正的快速并行算法。本研究结果为高分辨率、高精度等静力补偿校正提供了严格、快速、准确的解决方案。为了保证10−6 mGal的绝对计算精度,计算的最大加速比在单GPU下至少设置为730,在四GPU下至少设置为2241,这样可以缩短计算时间,提高计算效率。
{"title":"Fast calculation of isostatic compensation correction using the GPU-parallel prism method","authors":"Yan Huang ,&nbsp;Qingbin Wang ,&nbsp;Minghao Lv ,&nbsp;Xingguang Song ,&nbsp;Jinkai Feng ,&nbsp;Xuli Tan ,&nbsp;Ziyan Huang ,&nbsp;Chuyuan Zhou","doi":"10.1016/j.parco.2022.102970","DOIUrl":"10.1016/j.parco.2022.102970","url":null,"abstract":"<div><p>Isostatic compensation is a crucial component of crustal structure analysis and geoid calculations in cases of gravity reduction. However, large-scale and high-precision calculations are limited by the inefficiencies of the strict prism method and the low accuracy of the approximate calculation formula. In this study, we propose a new method of terrain grid re-encoding and an eight-component strict prism integral disassembly using a compute unified device architecture parallel programming platform. We use a fast parallel algorithm for the isostatic compensation correction, using the strict prism method based on CPU + GPU heterogeneous parallelization with efficient task allocation and GPU thread overloading procedure. The results of this study provide a rigorous, fast, and accurate solution for high-resolution and high-precision isostatic compensation corrections. To ensure an absolute calculation accuracy of 10<sup>−6</sup> mGal, the maximum acceleration ratio of the calculation was set to at least 730 using one GPU and 2241 using four GPUs, which shortens the calculation time and improves the calculation efficiency.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102970"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000618/pdfft?md5=c2b82b5c153d0daba6ac23f42fb2b152&pid=1-s2.0-S0167819122000618-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45936763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
parGeMSLR: A parallel multilevel Schur complement low-rank preconditioning and solution package for general sparse matrices parGeMSLR:一般稀疏矩阵的并行多级Schur补低秩预处理和解包
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-10-01 DOI: 10.1016/j.parco.2022.102956
Tianshi Xu , Vassilis Kalantzis , Ruipeng Li , Yuanzhe Xi , Geoffrey Dillon , Yousef Saad

This paper discusses parGeMSLR, a C++/MPI software library for the solution of sparse systems of linear algebraic equations via preconditioned Krylov subspace methods in distributed-memory computing environments. The preconditioner implemented in parGeMSLR is based on algebraic domain decomposition and partitions the symmetrized adjacency graph recursively into several non-overlapping partitions via a p-way vertex separator, where p is an integer multiple of the total number of MPI processes. From a numerical perspective, parGeMSLR builds a Schur complement approximate inverse preconditioner as the sum between the matrix inverse of the interface coupling matrix and a low-rank correction term. To reduce the cost associated with the computation of the approximate inverse matrices, parGeMSLR exploits a multilevel partitioning of the algebraic domain. The parGeMSLR library is implemented on top of the Message Passing Interface and can solve both real and complex linear systems. Furthermore, parGeMSLR can take advantage of hybrid computing environments with in-node access to one or more Graphics Processing Units. Finally, the parallel efficiency (weak and strong scaling) of parGeMSLR is demonstrated on a few model problems arising from discretizations of 3D Partial Differential Equations.

本文讨论了一个c++ /MPI软件库parGeMSLR,它用于在分布式存储计算环境下用预条件Krylov子空间方法求解线性代数方程的稀疏系统。parGeMSLR中实现的预条件基于代数域分解,通过p路顶点分隔符将对称邻接图递归划分为多个不重叠的分区,其中p是MPI进程总数的整数倍。从数值角度来看,parGeMSLR将Schur补近似逆预条件构建为界面耦合矩阵逆与低秩校正项的和。为了减少与近似逆矩阵的计算相关的成本,parGeMSLR利用了代数域的多级划分。parGeMSLR库是在消息传递接口之上实现的,可以解决真实和复杂的线性系统。此外,parGeMSLR可以利用节点内访问一个或多个图形处理单元的混合计算环境。最后,在三维偏微分方程离散化引起的几个模型问题上,证明了parGeMSLR的并行效率(弱标度和强标度)。
{"title":"parGeMSLR: A parallel multilevel Schur complement low-rank preconditioning and solution package for general sparse matrices","authors":"Tianshi Xu ,&nbsp;Vassilis Kalantzis ,&nbsp;Ruipeng Li ,&nbsp;Yuanzhe Xi ,&nbsp;Geoffrey Dillon ,&nbsp;Yousef Saad","doi":"10.1016/j.parco.2022.102956","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102956","url":null,"abstract":"<div><p>This paper discusses <span>parGeMSLR</span><span><span>, a C++/MPI software library for the solution of sparse systems of linear algebraic equations via preconditioned </span>Krylov subspace methods<span> in distributed-memory computing environments. The preconditioner implemented in </span></span><span>parGeMSLR</span><span> is based on algebraic domain decomposition and partitions the symmetrized adjacency graph recursively into several non-overlapping partitions via a </span><span><math><mi>p</mi></math></span>-way vertex separator, where <span><math><mi>p</mi></math></span><span> is an integer multiple of the total number of MPI processes. From a numerical perspective, </span><span>parGeMSLR</span><span><span> builds a Schur complement approximate inverse preconditioner as the sum between the </span>matrix inverse<span> of the interface coupling matrix and a low-rank correction term. To reduce the cost associated with the computation of the approximate inverse matrices, </span></span><span>parGeMSLR</span> exploits a multilevel partitioning of the algebraic domain. The <span>parGeMSLR</span> library is implemented on top of the Message Passing Interface and can solve both real and complex linear systems. Furthermore, <span>parGeMSLR</span><span> can take advantage of hybrid computing environments with in-node access to one or more Graphics Processing Units. Finally, the parallel efficiency (weak and strong scaling) of </span><span>parGeMSLR</span><span> is demonstrated on a few model problems arising from discretizations<span> of 3D Partial Differential Equations.</span></span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"113 ","pages":"Article 102956"},"PeriodicalIF":1.4,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91978783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing the Performance of Node-Aware Strategies for Irregular Point-to-Point Communication on Heterogeneous Architectures 异构体系结构中不规则点对点通信节点感知策略的性能表征
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-09-13 DOI: 10.48550/arXiv.2209.06141
S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson
Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data to be communicated and the number of potential data flow paths. In this work, we characterize the performance of irregular point-to-point communication with MPI on heterogeneous compute environments through performance modeling, demonstrating the limitations of standard communication strategies for both device-aware and staging-through-host communication techniques. Presented models suggest staging communicated data through host processes then using node-aware communication strategies for high inter-node message counts. Notably, the models also predict that node-aware communication utilizing all available CPU cores to communicate inter-node data leads to the most performant strategy when communicating with a high number of nodes. Model validation is provided via a case study of irregular point-to-point communication patterns in distributed sparse matrix-vector products. Importantly, we include a discussion on the implications model predictions have on communication strategy design for emerging supercomputer architectures.
由于包含异构计算节点,超级计算机体系结构正趋向于更高的计算吞吐量。这些多gpu节点提高了节点上的计算效率,同时也增加了需要通信的数据量和潜在数据流路径的数量。在这项工作中,我们通过性能建模描述了在异构计算环境中与MPI进行不规则点对点通信的性能,展示了设备感知和通过主机进行通信技术的标准通信策略的局限性。提出的模型建议通过主机进程暂存通信数据,然后使用节点感知通信策略来实现高节点间消息计数。值得注意的是,这些模型还预测,当与大量节点通信时,利用所有可用CPU内核进行节点间数据通信的节点感知通信将产生最高效的策略。通过对分布式稀疏矩阵向量积中不规则点对点通信模式的案例研究,提供了模型验证。重要的是,我们还讨论了模型预测对新兴超级计算机体系结构通信策略设计的影响。
{"title":"Characterizing the Performance of Node-Aware Strategies for Irregular Point-to-Point Communication on Heterogeneous Architectures","authors":"S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson","doi":"10.48550/arXiv.2209.06141","DOIUrl":"https://doi.org/10.48550/arXiv.2209.06141","url":null,"abstract":"Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data to be communicated and the number of potential data flow paths. In this work, we characterize the performance of irregular point-to-point communication with MPI on heterogeneous compute environments through performance modeling, demonstrating the limitations of standard communication strategies for both device-aware and staging-through-host communication techniques. Presented models suggest staging communicated data through host processes then using node-aware communication strategies for high inter-node message counts. Notably, the models also predict that node-aware communication utilizing all available CPU cores to communicate inter-node data leads to the most performant strategy when communicating with a high number of nodes. Model validation is provided via a case study of irregular point-to-point communication patterns in distributed sparse matrix-vector products. Importantly, we include a discussion on the implications model predictions have on communication strategy design for emerging supercomputer architectures.","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"20 1","pages":"103021"},"PeriodicalIF":1.4,"publicationDate":"2022-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82207056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Energy-efficient scheduling algorithms based on task clustering in heterogeneous spark clusters 异构火花集群中基于任务聚类的节能调度算法
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102947
Wenhu Shi, Hongjian Li, Junzhe Guan, Hang Zeng, Rafe Misskat jahan

Spark is widely used for its fast in-memory processing. It is important to improve energy efficiency under deadline constrains. In this paper, a Task Performance Clustering of Best Fitting Decrease (TPCBFD) scheduling algorithm is proposed. It divides tasks in Spark into three types, with the different types of tasks being placed on nodes with superior performance. However, the basic computation time for TPCBFD takes up a large proportion of the task execution time, so the Energy-Aware TPCBFD (EATPCBFD) algorithm based on the proposed energy consumption model is proposed, focusing on optimizing energy efficiency and Service Level Agreement (SLA) service times. The experimental results show that EATPCBFD increases the average energy efficiency in Spark by 77% and the average passing rate of SLA service time by 14% compared to comparison algorithms. EATPCBFD has higher energy efficiency on average than comparison algorithms under deadline. The average energy efficiency of EATPCBFD with the deadline constraint is higher than the comparison algorithm.

Spark因其快速的内存处理而被广泛使用。在期限限制下提高能源效率是很重要的。提出了一种任务性能聚类最优拟合递减调度算法(TPCBFD)。它将Spark中的任务分为三种类型,不同类型的任务被放置在性能优越的节点上。但由于TPCBFD的基本计算时间占任务执行时间的很大比例,因此提出了基于上述能耗模型的energy - aware TPCBFD (EATPCBFD)算法,该算法的重点是优化能源效率和SLA服务时间。实验结果表明,与比较算法相比,EATPCBFD在Spark中的平均能效提高了77%,SLA服务时间的平均合格率提高了14%。在截止日期下,EATPCBFD的平均能量效率高于比较算法。带deadline约束的EATPCBFD的平均能效高于比较算法。
{"title":"Energy-efficient scheduling algorithms based on task clustering in heterogeneous spark clusters","authors":"Wenhu Shi,&nbsp;Hongjian Li,&nbsp;Junzhe Guan,&nbsp;Hang Zeng,&nbsp;Rafe Misskat jahan","doi":"10.1016/j.parco.2022.102947","DOIUrl":"10.1016/j.parco.2022.102947","url":null,"abstract":"<div><p><span>Spark is widely used for its fast in-memory processing. It is important to improve energy efficiency under deadline constrains. In this paper, a Task Performance Clustering of Best Fitting Decrease (TPCBFD) scheduling algorithm is proposed. It divides tasks in Spark into three types, with the different types of tasks being placed on nodes with superior performance. However, the basic computation time for TPCBFD takes up a large proportion of the task execution time, so the Energy-Aware TPCBFD (EATPCBFD) algorithm based on the proposed </span>energy consumption model<span> is proposed, focusing on optimizing energy efficiency and Service Level Agreement (SLA) service times. The experimental results show that EATPCBFD increases the average energy efficiency in Spark by 77% and the average passing rate of SLA service time by 14% compared to comparison algorithms. EATPCBFD has higher energy efficiency on average than comparison algorithms under deadline. The average energy efficiency of EATPCBFD with the deadline constraint is higher than the comparison algorithm.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102947"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78038927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Resource allocation for task-level speculative scientific applications: A proof of concept using Parallel Trajectory Splicing 任务级思辨科学应用的资源分配:使用平行轨迹拼接的概念证明
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102936
Andrew Garmon , Vinay Ramakrishnaiah , Danny Perez

The constant increase in parallelism available on large-scale distributed computers poses major scalability challenges to many scientific applications. A common strategy to improve scalability is to express algorithms in terms of independent tasks that can be executed concurrently on a runtime system. In this manuscript, we consider a generalization of this approach where task-level speculation is allowed. In this context, a probability is attached to each task which corresponds to the likelihood that the output of the speculative task will be consumed as part of the larger calculation. We consider the problem of optimal resource allocation to each of the possible tasks so as to maximize the total expected computational throughput. The power of this approach is demonstrated by analyzing its application to Parallel Trajectory Splicing, a massively-parallel long-time-dynamics method for atomistic simulations.

大规模分布式计算机上可用的并行性的不断增加对许多科学应用程序提出了主要的可伸缩性挑战。提高可伸缩性的一种常用策略是用可以在运行时系统上并发执行的独立任务来表示算法。在这份手稿中,我们考虑了这种方法的推广,其中任务级猜测是允许的。在这种情况下,每个任务都附加了一个概率,该概率对应于投机任务的输出将作为较大计算的一部分被消耗的可能性。我们考虑对每个可能的任务进行最优资源分配的问题,以使总期望计算吞吐量最大化。通过分析该方法在并行轨迹拼接(一种用于原子模拟的大规模并行长时间动力学方法)中的应用,证明了该方法的有效性。
{"title":"Resource allocation for task-level speculative scientific applications: A proof of concept using Parallel Trajectory Splicing","authors":"Andrew Garmon ,&nbsp;Vinay Ramakrishnaiah ,&nbsp;Danny Perez","doi":"10.1016/j.parco.2022.102936","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102936","url":null,"abstract":"<div><p><span>The constant increase in parallelism available on large-scale distributed computers poses major scalability challenges to many scientific applications. A common strategy to improve scalability is to express algorithms in terms of independent tasks that can be executed concurrently on a </span>runtime system<span>. In this manuscript, we consider a generalization of this approach where task-level speculation is allowed. In this context, a probability is attached to each task which corresponds to the likelihood that the output of the speculative task will be consumed as part of the larger calculation. We consider the problem of optimal resource allocation to each of the possible tasks so as to maximize the total expected computational throughput. The power of this approach is demonstrated by analyzing its application to Parallel Trajectory Splicing, a massively-parallel long-time-dynamics method for atomistic simulations.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102936"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91714606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving cryptanalytic applications with stochastic runtimes on GPUs and multicores 改进在gpu和多核上随机运行的密码分析应用程序
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102944
Lena Oden, Jörg Keller

We investigate cryptanalytic applications comprised of many independent tasks that exhibit a stochastic runtime distribution. We compare four algorithms for executing such applications on GPUs and on multicore CPUs with SIMD units. We demonstrate that for four different distributions, multiple problem sizes, and three platforms the best strategy varies. We support our analytic results by extensive experiments on an Intel Skylake-based multicore CPU and a high performance GPU (Nvidia Volta).

我们研究了由许多独立任务组成的密码分析应用程序,这些任务表现出随机的运行时分布。我们比较了在gpu和带有SIMD单元的多核cpu上执行此类应用程序的四种算法。我们证明,对于四种不同的发行版、多种问题大小和三种平台,最佳策略各不相同。我们通过在基于英特尔skylake的多核CPU和高性能GPU (Nvidia Volta)上进行大量实验来支持我们的分析结果。
{"title":"Improving cryptanalytic applications with stochastic runtimes on GPUs and multicores","authors":"Lena Oden,&nbsp;Jörg Keller","doi":"10.1016/j.parco.2022.102944","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102944","url":null,"abstract":"<div><p>We investigate cryptanalytic applications comprised of many independent tasks that exhibit a stochastic runtime distribution. We compare four algorithms for executing such applications on GPUs and on multicore CPUs with SIMD units. We demonstrate that for four different distributions, multiple problem sizes, and three platforms the best strategy varies. We support our analytic results by extensive experiments on an Intel Skylake-based multicore CPU and a high performance GPU (Nvidia Volta).</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102944"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"137214689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing convolutional neural networks on multi-core vector accelerator 在多核矢量加速器上优化卷积神经网络
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102945
Zhong Liu, Xin Xiao, Chen Li, Sheng Ma, Deng Rangyu

Vector Accelerators have been widely used in scientific computing. It also shows great potential to accelerate the computational performance of convolutional neural networks (CNNs). However, previous general CNN-mapping methods introduced a large amount of intermediate data and additional conversion, and the resulting memory overhead would cause great performance loss.

To address these issues and achieve high computational efficiency, this paper proposes an efficient CNN-mapping method dedicated to vector accelerators, including: 1) Data layout method: establishing a set of efficient data storage and computing models for various CNN networks on vector accelerators. It achieves high memory access efficiency and high vectorization efficiency. 2) A conversion method: convert the computation of convolutional layers and fully connected layers into large-scale matrix multiplication, and convert the computation of pooling layers into row computation of matrix. All conversions are implemented by extracting rows from a two-dimensional matrix, with high data access and transmission efficiency, and without additional memory overhead and data conversion.

Based on these methods, we design a vectorization mechanism to vectorize convolutional, pooling and fully connected layers on a vector accelerator, which can be applied for various CNN models. This mechanism takes full advantage of the parallel computing capability of the multi-core vector accelerator and further improves the performance of deep convolutional neural networks. The experimental results show that the average computational efficiency of the convolutional layers and full connected layers of AlexNet, VGG-19, GoogleNet and ResNet-50 is 93.3% and 93.4% respectively, and the average data access efficiency of pooling layer is 70%. Compared to NVIDIA inference GPUs, our accelerator achieves a 36.1% performance improvement, comparable to NVIDIA V100 GPUs. Compared with Matrix2000 of similar architecture, our accelerator achieves a 17-45% improvement in computational efficiency.

矢量加速器在科学计算中得到了广泛的应用。它也显示出极大的潜力来加速卷积神经网络(cnn)的计算性能。然而,以往的通用cnn映射方法引入了大量的中间数据和额外的转换,由此产生的内存开销会造成很大的性能损失。为了解决这些问题并获得较高的计算效率,本文提出了一种专用于矢量加速器的高效CNN映射方法,包括:1)数据布局方法:在矢量加速器上为各种CNN网络建立一套高效的数据存储和计算模型。实现了较高的内存访问效率和矢量化效率。2)一种转换方法:将卷积层和全连通层的计算转换为大规模的矩阵乘法,将池化层的计算转换为矩阵的行计算。所有转换都是通过从二维矩阵中提取行来实现的,具有很高的数据访问和传输效率,并且没有额外的内存开销和数据转换。基于这些方法,我们设计了一种矢量化机制,在矢量加速器上对卷积层、池化层和全连接层进行矢量化,可以应用于各种CNN模型。该机制充分利用了多核矢量加速器的并行计算能力,进一步提高了深度卷积神经网络的性能。实验结果表明,AlexNet、VGG-19、GoogleNet和ResNet-50的卷积层和全连接层的平均计算效率分别为93.3%和93.4%,池化层的平均数据访问效率为70%。与NVIDIA推理gpu相比,我们的加速器实现了36.1%的性能提升,与NVIDIA V100 gpu相当。与类似架构的Matrix2000相比,我们的加速器的计算效率提高了17-45%。
{"title":"Optimizing convolutional neural networks on multi-core vector accelerator","authors":"Zhong Liu,&nbsp;Xin Xiao,&nbsp;Chen Li,&nbsp;Sheng Ma,&nbsp;Deng Rangyu","doi":"10.1016/j.parco.2022.102945","DOIUrl":"10.1016/j.parco.2022.102945","url":null,"abstract":"<div><p>Vector Accelerators have been widely used in scientific computing. It also shows great potential to accelerate the computational performance of convolutional neural networks (CNNs). However, previous general CNN-mapping methods introduced a large amount of intermediate data and additional conversion, and the resulting memory overhead would cause great performance loss.</p><p>To address these issues and achieve high computational efficiency, this paper proposes an efficient CNN-mapping method dedicated to vector accelerators, including: 1) Data layout method: establishing a set of efficient data storage and computing models for various CNN networks on vector accelerators. It achieves high memory access efficiency and high vectorization efficiency. 2) A conversion method: convert the computation of convolutional layers and fully connected layers into large-scale matrix multiplication, and convert the computation of pooling layers into row computation of matrix. All conversions are implemented by extracting rows from a two-dimensional matrix, with high data access and transmission efficiency, and without additional memory overhead and data conversion.</p><p>Based on these methods, we design a vectorization mechanism to vectorize convolutional, pooling and fully connected layers on a vector accelerator, which can be applied for various CNN models. This mechanism takes full advantage of the parallel computing capability of the multi-core vector accelerator and further improves the performance of deep convolutional neural networks. The experimental results show that the average computational efficiency of the convolutional layers and full connected layers of AlexNet, VGG-19, GoogleNet and ResNet-50 is 93.3% and 93.4% respectively, and the average data access efficiency of pooling layer is 70%. Compared to NVIDIA inference GPUs, our accelerator achieves a 36.1% performance improvement, comparable to NVIDIA V100 GPUs. Compared with Matrix2000 of similar architecture, our accelerator achieves a 17-45% improvement in computational efficiency.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102945"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83744906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Performance and accuracy predictions of approximation methods for shortest-path algorithms on GPUs gpu上最短路径算法近似方法的性能和精度预测
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102942
Busenur Aktılav, Işıl Öz

Approximate computing techniques, where less-than-perfect solutions are acceptable, present performance-accuracy trade-offs by performing inexact computations. Moreover, heterogeneous architectures, a combination of miscellaneous compute units, offer high performance as well as energy efficiency. Graph algorithms utilize the parallel computation units of heterogeneous GPU architectures as well as performance improvements offered by approximation methods. Since different approximations yield different speedup and accuracy loss for the target execution, it becomes impractical to test all methods with various parameters. In this work, we perform approximate computations for the three shortest-path graph algorithms and propose a machine learning framework to predict the impact of the approximations on program performance and output accuracy. We evaluate random predictions for both synthetic and real road-network graphs, and predictions of the large graph cases from small graph instances. We achieve less than 5% prediction error rates for speedup and inaccuracy values.

近似计算技术,其中不完美的解决方案是可以接受的,通过执行不精确的计算来实现性能-精度的权衡。此外,异构体系结构(各种计算单元的组合)提供了高性能和能源效率。图算法利用异构GPU架构的并行计算单元以及近似方法提供的性能改进。由于不同的近似对目标执行产生不同的加速和精度损失,因此用不同的参数测试所有方法变得不切实际。在这项工作中,我们对三种最短路径图算法进行了近似计算,并提出了一个机器学习框架来预测近似对程序性能和输出精度的影响。我们评估了合成和真实路网图的随机预测,以及小图实例对大图案例的预测。我们实现了小于5%的预测错误率的加速和不准确值。
{"title":"Performance and accuracy predictions of approximation methods for shortest-path algorithms on GPUs","authors":"Busenur Aktılav,&nbsp;Işıl Öz","doi":"10.1016/j.parco.2022.102942","DOIUrl":"10.1016/j.parco.2022.102942","url":null,"abstract":"<div><p><span><span>Approximate computing techniques, where less-than-perfect solutions are acceptable, present performance-accuracy trade-offs by performing inexact computations. Moreover, </span>heterogeneous architectures<span><span>, a combination of miscellaneous compute units, offer high performance as well as energy efficiency. Graph algorithms utilize the parallel computation units of heterogeneous </span>GPU architectures as well as performance improvements offered by </span></span>approximation<span> methods. Since different approximations yield different speedup and accuracy loss for the target execution, it becomes impractical to test all methods with various parameters. In this work, we perform approximate computations for the three shortest-path graph algorithms and propose a machine learning framework to predict the impact of the approximations on program performance and output accuracy. We evaluate random predictions for both synthetic and real road-network graphs, and predictions of the large graph cases from small graph instances. We achieve less than 5% prediction error rates for speedup and inaccuracy values.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102942"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89425951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1