Parallel Computing最新文献_第10页

Resource allocation for task-level speculative scientific applications: A proof of concept using Parallel Trajectory Splicing 任务级思辨科学应用的资源分配:使用平行轨迹拼接的概念证明

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102936

Andrew Garmon , Vinay Ramakrishnaiah , Danny Perez

The constant increase in parallelism available on large-scale distributed computers poses major scalability challenges to many scientific applications. A common strategy to improve scalability is to express algorithms in terms of independent tasks that can be executed concurrently on a runtime system. In this manuscript, we consider a generalization of this approach where task-level speculation is allowed. In this context, a probability is attached to each task which corresponds to the likelihood that the output of the speculative task will be consumed as part of the larger calculation. We consider the problem of optimal resource allocation to each of the possible tasks so as to maximize the total expected computational throughput. The power of this approach is demonstrated by analyzing its application to Parallel Trajectory Splicing, a massively-parallel long-time-dynamics method for atomistic simulations.

大规模分布式计算机上可用的并行性的不断增加对许多科学应用程序提出了主要的可伸缩性挑战。提高可伸缩性的一种常用策略是用可以在运行时系统上并发执行的独立任务来表示算法。在这份手稿中，我们考虑了这种方法的推广，其中任务级猜测是允许的。在这种情况下，每个任务都附加了一个概率，该概率对应于投机任务的输出将作为较大计算的一部分被消耗的可能性。我们考虑对每个可能的任务进行最优资源分配的问题，以使总期望计算吞吐量最大化。通过分析该方法在并行轨迹拼接(一种用于原子模拟的大规模并行长时间动力学方法)中的应用，证明了该方法的有效性。

引用次数: 1

Improving cryptanalytic applications with stochastic runtimes on GPUs and multicores 改进在gpu和多核上随机运行的密码分析应用程序

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102944

Lena Oden, Jörg Keller

We investigate cryptanalytic applications comprised of many independent tasks that exhibit a stochastic runtime distribution. We compare four algorithms for executing such applications on GPUs and on multicore CPUs with SIMD units. We demonstrate that for four different distributions, multiple problem sizes, and three platforms the best strategy varies. We support our analytic results by extensive experiments on an Intel Skylake-based multicore CPU and a high performance GPU (Nvidia Volta).

我们研究了由许多独立任务组成的密码分析应用程序，这些任务表现出随机的运行时分布。我们比较了在gpu和带有SIMD单元的多核cpu上执行此类应用程序的四种算法。我们证明，对于四种不同的发行版、多种问题大小和三种平台，最佳策略各不相同。我们通过在基于英特尔skylake的多核CPU和高性能GPU (Nvidia Volta)上进行大量实验来支持我们的分析结果。

引用次数: 0

Optimizing convolutional neural networks on multi-core vector accelerator 在多核矢量加速器上优化卷积神经网络

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102945

Zhong Liu, Xin Xiao, Chen Li, Sheng Ma, Deng Rangyu

Vector Accelerators have been widely used in scientific computing. It also shows great potential to accelerate the computational performance of convolutional neural networks (CNNs). However, previous general CNN-mapping methods introduced a large amount of intermediate data and additional conversion, and the resulting memory overhead would cause great performance loss.

To address these issues and achieve high computational efficiency, this paper proposes an efficient CNN-mapping method dedicated to vector accelerators, including: 1) Data layout method: establishing a set of efficient data storage and computing models for various CNN networks on vector accelerators. It achieves high memory access efficiency and high vectorization efficiency. 2) A conversion method: convert the computation of convolutional layers and fully connected layers into large-scale matrix multiplication, and convert the computation of pooling layers into row computation of matrix. All conversions are implemented by extracting rows from a two-dimensional matrix, with high data access and transmission efficiency, and without additional memory overhead and data conversion.

Based on these methods, we design a vectorization mechanism to vectorize convolutional, pooling and fully connected layers on a vector accelerator, which can be applied for various CNN models. This mechanism takes full advantage of the parallel computing capability of the multi-core vector accelerator and further improves the performance of deep convolutional neural networks. The experimental results show that the average computational efficiency of the convolutional layers and full connected layers of AlexNet, VGG-19, GoogleNet and ResNet-50 is 93.3% and 93.4% respectively, and the average data access efficiency of pooling layer is 70%. Compared to NVIDIA inference GPUs, our accelerator achieves a 36.1% performance improvement, comparable to NVIDIA V100 GPUs. Compared with Matrix2000 of similar architecture, our accelerator achieves a 17-45% improvement in computational efficiency.

矢量加速器在科学计算中得到了广泛的应用。它也显示出极大的潜力来加速卷积神经网络(cnn)的计算性能。然而，以往的通用cnn映射方法引入了大量的中间数据和额外的转换，由此产生的内存开销会造成很大的性能损失。为了解决这些问题并获得较高的计算效率，本文提出了一种专用于矢量加速器的高效CNN映射方法，包括:1)数据布局方法:在矢量加速器上为各种CNN网络建立一套高效的数据存储和计算模型。实现了较高的内存访问效率和矢量化效率。2)一种转换方法:将卷积层和全连通层的计算转换为大规模的矩阵乘法，将池化层的计算转换为矩阵的行计算。所有转换都是通过从二维矩阵中提取行来实现的，具有很高的数据访问和传输效率，并且没有额外的内存开销和数据转换。基于这些方法，我们设计了一种矢量化机制，在矢量加速器上对卷积层、池化层和全连接层进行矢量化，可以应用于各种CNN模型。该机制充分利用了多核矢量加速器的并行计算能力，进一步提高了深度卷积神经网络的性能。实验结果表明，AlexNet、VGG-19、GoogleNet和ResNet-50的卷积层和全连接层的平均计算效率分别为93.3%和93.4%，池化层的平均数据访问效率为70%。与NVIDIA推理gpu相比，我们的加速器实现了36.1%的性能提升，与NVIDIA V100 gpu相当。与类似架构的Matrix2000相比，我们的加速器的计算效率提高了17-45%。

{"title":"Optimizing convolutional neural networks on multi-core vector accelerator","authors":"Zhong Liu, Xin Xiao, Chen Li, Sheng Ma, Deng Rangyu","doi":"10.1016/j.parco.2022.102945","DOIUrl":"10.1016/j.parco.2022.102945","url":null,"abstract":"<div><p>Vector Accelerators have been widely used in scientific computing. It also shows great potential to accelerate the computational performance of convolutional neural networks (CNNs). However, previous general CNN-mapping methods introduced a large amount of intermediate data and additional conversion, and the resulting memory overhead would cause great performance loss.</p><p>To address these issues and achieve high computational efficiency, this paper proposes an efficient CNN-mapping method dedicated to vector accelerators, including: 1) Data layout method: establishing a set of efficient data storage and computing models for various CNN networks on vector accelerators. It achieves high memory access efficiency and high vectorization efficiency. 2) A conversion method: convert the computation of convolutional layers and fully connected layers into large-scale matrix multiplication, and convert the computation of pooling layers into row computation of matrix. All conversions are implemented by extracting rows from a two-dimensional matrix, with high data access and transmission efficiency, and without additional memory overhead and data conversion.</p><p>Based on these methods, we design a vectorization mechanism to vectorize convolutional, pooling and fully connected layers on a vector accelerator, which can be applied for various CNN models. This mechanism takes full advantage of the parallel computing capability of the multi-core vector accelerator and further improves the performance of deep convolutional neural networks. The experimental results show that the average computational efficiency of the convolutional layers and full connected layers of AlexNet, VGG-19, GoogleNet and ResNet-50 is 93.3% and 93.4% respectively, and the average data access efficiency of pooling layer is 70%. Compared to NVIDIA inference GPUs, our accelerator achieves a 36.1% performance improvement, comparable to NVIDIA V100 GPUs. Compared with Matrix2000 of similar architecture, our accelerator achieves a 17-45% improvement in computational efficiency.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102945"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83744906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Performance and accuracy predictions of approximation methods for shortest-path algorithms on GPUs gpu上最短路径算法近似方法的性能和精度预测

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102942

Busenur Aktılav, Işıl Öz

Approximate computing techniques, where less-than-perfect solutions are acceptable, present performance-accuracy trade-offs by performing inexact computations. Moreover, heterogeneous architectures, a combination of miscellaneous compute units, offer high performance as well as energy efficiency. Graph algorithms utilize the parallel computation units of heterogeneous GPU architectures as well as performance improvements offered by approximation methods. Since different approximations yield different speedup and accuracy loss for the target execution, it becomes impractical to test all methods with various parameters. In this work, we perform approximate computations for the three shortest-path graph algorithms and propose a machine learning framework to predict the impact of the approximations on program performance and output accuracy. We evaluate random predictions for both synthetic and real road-network graphs, and predictions of the large graph cases from small graph instances. We achieve less than 5% prediction error rates for speedup and inaccuracy values.

近似计算技术，其中不完美的解决方案是可以接受的，通过执行不精确的计算来实现性能-精度的权衡。此外，异构体系结构(各种计算单元的组合)提供了高性能和能源效率。图算法利用异构GPU架构的并行计算单元以及近似方法提供的性能改进。由于不同的近似对目标执行产生不同的加速和精度损失，因此用不同的参数测试所有方法变得不切实际。在这项工作中，我们对三种最短路径图算法进行了近似计算，并提出了一个机器学习框架来预测近似对程序性能和输出精度的影响。我们评估了合成和真实路网图的随机预测，以及小图实例对大图案例的预测。我们实现了小于5%的预测错误率的加速和不准确值。

{"title":"Performance and accuracy predictions of approximation methods for shortest-path algorithms on GPUs","authors":"Busenur Aktılav, Işıl Öz","doi":"10.1016/j.parco.2022.102942","DOIUrl":"10.1016/j.parco.2022.102942","url":null,"abstract":"<div><p><span><span>Approximate computing techniques, where less-than-perfect solutions are acceptable, present performance-accuracy trade-offs by performing inexact computations. Moreover, </span>heterogeneous architectures<span><span>, a combination of miscellaneous compute units, offer high performance as well as energy efficiency. Graph algorithms utilize the parallel computation units of heterogeneous </span>GPU architectures as well as performance improvements offered by </span></span>approximation<span> methods. Since different approximations yield different speedup and accuracy loss for the target execution, it becomes impractical to test all methods with various parameters. In this work, we perform approximate computations for the three shortest-path graph algorithms and propose a machine learning framework to predict the impact of the approximations on program performance and output accuracy. We evaluate random predictions for both synthetic and real road-network graphs, and predictions of the large graph cases from small graph instances. We achieve less than 5% prediction error rates for speedup and inaccuracy values.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102942"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89425951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallel multi-view HEVC for heterogeneously embedded cluster system 异构嵌入式集群系统的并行多视图HEVC

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102948

Seo Jin Jang , Wei Liu , Wei Li , Yong Beom Cho

In this paper, we present a computer cluster with heterogeneous computing components intended to provide concurrency and parallelism with embedded processors to achieve a real-time Multi-View High-Efficiency Video Coding (MV-HEVC) encoder/decoder with a maximum resolution of 1088p. The latest MV-HEVC standard represents a significant improvement over the previous video coding standard (MVC). However, the MV-HEVC standard also has higher computational complexity. To this point, research using the MV-HEVC has had to use the Central Processing Unit (CPU) on a Personal Computer (PC) or workstation for decompression, because MV-HEVC is much more complex than High-Efficiency Video Coding (HEVC), and because decompressors need higher parallelism to decompress in real time. It is particularly difficult to encode/decode in an embedded device. Therefore, we propose a novel framework for an MV-HEVC encoder/decoder that is based on a heterogeneously distributed embedded system. To this end, we use a parallel computing method to divide the video into multiple blocks and then code the blocks independently in each sub-work node with a group of pictures and a coding tree unit level. To appropriately assign the tasks to each work node, we propose a new allocation method that makes the operation of the entire heterogeneously distributed system more efficient. Our experimental results show that, compared to the single device (3D-HTM single threading), the proposed distributed MV-HEVC decoder and encoder performance increased approximately (20.39 and 68.7) times under 20 devices (multithreading) with the CTU level of a 1088p resolution video, respectively. Further, at the proposed GOP level, the decoder and encoder performance with 20 devices (multithreading) respectively increased approximately (20.78 and 77) times for a 1088p resolution video with heterogeneously distributed computing compared to the single device (3D-HTM single threading).

在本文中，我们提出了一个具有异构计算组件的计算机集群，旨在通过嵌入式处理器提供并发性和并行性，以实现最大分辨率为1088p的实时多视图高效视频编码(MV-HEVC)编码器/解码器。最新的MV-HEVC标准比之前的视频编码标准(MVC)有了显著的改进。然而，MV-HEVC标准也具有较高的计算复杂度。到目前为止，使用MV-HEVC的研究必须使用个人计算机(PC)或工作站上的中央处理器(CPU)进行解压，因为MV-HEVC比高效视频编码(HEVC)复杂得多，而且解压缩器需要更高的并行度才能实时解压缩。在嵌入式设备中进行编码/解码尤其困难。因此，我们提出了一种基于异构分布式嵌入式系统的MV-HEVC编码器/解码器的新框架。为此，我们采用并行计算的方法将视频分成多个块，然后在每个子工作节点上以一组图片和一个编码树单元级对这些块进行独立编码。为了将任务合理地分配给各个工作节点，我们提出了一种新的分配方法，使整个异构分布式系统的运行效率更高。实验结果表明，与单设备(3D-HTM单线程)相比，本文提出的分布式MV-HEVC解码器和编码器性能在20个设备(多线程)下，以1088p分辨率视频的CTU级别分别提高了约20.39倍和68.7倍。此外，在建议的GOP水平上，与单设备(3D-HTM单线程)相比，在异构分布式计算的1088p分辨率视频中，20个设备(多线程)的解码器和编码器性能分别提高了大约(20.78和77)倍。

{"title":"Parallel multi-view HEVC for heterogeneously embedded cluster system","authors":"Seo Jin Jang , Wei Liu , Wei Li , Yong Beom Cho","doi":"10.1016/j.parco.2022.102948","DOIUrl":"10.1016/j.parco.2022.102948","url":null,"abstract":"<div><p><span>In this paper, we present a computer cluster with heterogeneous computing<span> components intended to provide concurrency and parallelism with embedded processors to achieve a real-time Multi-View High-Efficiency Video Coding (MV-HEVC) encoder/decoder with a maximum resolution of 1088p. The latest MV-HEVC standard represents a significant improvement over the previous video coding standard (MVC). However, the MV-HEVC standard also has higher </span></span>computational complexity<span><span>. To this point, research using the MV-HEVC has had to use the Central Processing Unit<span><span> (CPU) on a Personal Computer (PC) or workstation for decompression<span>, because MV-HEVC is much more complex than High-Efficiency Video Coding (HEVC), and because decompressors need higher parallelism to decompress in real time. It is particularly difficult to encode/decode in an embedded device. Therefore, we propose a novel framework for an MV-HEVC encoder/decoder that is based on a heterogeneously distributed embedded system. To this end, we use a </span></span>parallel computing method to divide the video into multiple blocks and then code the blocks independently in each sub-work node with a group of pictures and a coding tree unit level. To appropriately assign the tasks to each work node, we propose a new allocation method that makes the operation of the entire heterogeneously distributed system more efficient. Our experimental results show that, compared to the single device (3D-HTM single threading), the proposed distributed MV-HEVC decoder and encoder performance increased approximately (20.39 and 68.7) times under 20 devices (multithreading) with the CTU level of a 1088p resolution video, respectively. Further, at the proposed GOP level, the decoder and encoder performance with 20 devices (multithreading) respectively increased approximately (20.78 and 77) times for a 1088p resolution video with heterogeneously </span></span>distributed computing compared to the single device (3D-HTM single threading).</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102948"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74144309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low-synch Gram–Schmidt with delayed reorthogonalization for Krylov solvers Krylov解的延迟再正交化低同步Gram-Schmidt

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102940

Daniel Bielich , Julien Langou , Stephen Thomas , Kasia Świrydowicz , Ichitaro Yamazaki , Erik G. Boman

The parallel strong-scaling of iterative methods is often determined by the number of global reductions at each iteration. Low-synch Gram–Schmidt algorithms are applied here to the Arnoldi algorithm to reduce the number of global reductions and therefore to improve the parallel strong-scaling of iterative solvers for nonsymmetric matrices such as the GMRES and the Krylov–Schur iterative methods. In the Arnoldi context, the $Q R$ factorization is “left-looking” and processes one column at a time. Among the methods for generating an orthogonal basis for the Arnoldi algorithm, the classical Gram–Schmidt algorithm, with reorthogonalization (CGS2) requires three global reductions per iteration. A new variant of CGS2 that requires only one reduction per iteration is presented and applied to the Arnoldi algorithm. Delayed CGS2 (DCGS2) employs the minimum number of global reductions per iteration (one) for a one-column at-a-time algorithm. The main idea behind the new algorithm is to group global reductions by rearranging the order of operations. DCGS2 must be carefully integrated into an Arnoldi expansion or a GMRES solver. Numerical stability experiments assess robustness for Krylov–Schur eigenvalue computations. Performance experiments on the ORNL Summit supercomputer then establish the superiority of DCGS2 over CGS2.

迭代方法的并行强标度通常由每次迭代的全局约简次数决定。本文将低同步Gram-Schmidt算法应用于Arnoldi算法，以减少全局约简的数量，从而提高非对称矩阵迭代求解器(如GMRES和Krylov-Schur迭代方法)的并行强尺度性。在Arnoldi上下文中，QR分解是“向左看”的，每次处理一列。在为Arnoldi算法生成正交基的方法中，经典的Gram-Schmidt算法(CGS2)每次迭代需要三次全局约简。提出了一种每次迭代只需要一次约简的CGS2新变体，并将其应用于Arnoldi算法。延迟CGS2 (DCGS2)为每次一列的算法使用每次迭代的最小全局约简数(1)。新算法背后的主要思想是通过重新安排操作顺序来分组全局约简。DCGS2必须小心地集成到Arnoldi扩展或GMRES求解器中。数值稳定性实验评估Krylov-Schur特征值计算的鲁棒性。在ORNL Summit超级计算机上的性能实验验证了DCGS2优于CGS2。

{"title":"Low-synch Gram–Schmidt with delayed reorthogonalization for Krylov solvers","authors":"Daniel Bielich , Julien Langou , Stephen Thomas , Kasia Świrydowicz , Ichitaro Yamazaki , Erik G. Boman","doi":"10.1016/j.parco.2022.102940","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102940","url":null,"abstract":"<div><p><span>The parallel strong-scaling of iterative methods is often determined by the number of global reductions at each iteration. Low-synch Gram–Schmidt algorithms are applied here to the Arnoldi algorithm to reduce the number of global reductions and therefore to improve the parallel strong-scaling of iterative solvers for nonsymmetric matrices such as the GMRES and the Krylov–Schur iterative methods. In the Arnoldi context, the </span><span><math><mrow><mi>Q</mi><mi>R</mi></mrow></math></span><span><span><span> factorization is “left-looking” and processes one column at a time. Among the methods for generating an orthogonal basis for the Arnoldi algorithm, the classical Gram–Schmidt algorithm, with reorthogonalization (CGS2) requires three global reductions per iteration. A new variant of CGS2 that requires only one reduction per iteration is presented and applied to the Arnoldi algorithm. Delayed CGS2 (DCGS2) employs the minimum number of global reductions per iteration (one) for a one-column at-a-time algorithm. The main idea behind the new algorithm is to group global reductions by rearranging the order of operations. DCGS2 must be carefully integrated into an Arnoldi expansion or a GMRES solver. Numerical stability experiments assess robustness for Krylov–Schur </span>eigenvalue computations. Performance experiments on the ORNL Summit </span>supercomputer then establish the superiority of DCGS2 over CGS2.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102940"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91714605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Spatial- and time- division multiplexing in CNN accelerator CNN加速器的时空复用

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-07-01 DOI: 10.1016/j.parco.2022.102922

Tetsuro Nakamura, Shogo Saito, Kei Fujimoto, Masashi Kaneko, Akinori Shiraga

With the widespread use of real-time data analysis by artificial intelligence (AI), the integration of accelerators is attracting attention from the perspectives of their low power consumption and low latency. The objective of this research is to increase accelerator resource efficiency and further reduce power consumption by sharing accelerators among multiple users while maintaining real-time performance. To achieve the accelerator-sharing system, we define three requirements: high device utilization, fair device utilization among users, and real-time performance. Targeting the AI inference use case, this paper proposes a system that shares a field-programmable gate array (FPGA) among multiple users by switching the convolutional neural network (CNN) models stored in the device memory on the FPGA, while satisfying the three requirements. The proposed system uses different behavioral models for workloads with predictable and unpredictable data arrival timing. For the workloads with predictable data arrival timing, the system uses spatial-division multiplexing of the FPGA device memory to achieve real-time performance and high device utilization. Specifically, the FPGA device memory controller of the system transparently preloads and caches the CNN models into the FPGA device memory before the data arrival. For workloads with unpredictable data arrival timing, the system transfers CNN models to the FPGA device memory upon data arrival using time-division multiplexing of FPGA device memory. In the latter case of unpredictable workloads, the switch cost between CNN models is non-negligible to achieve real-time performance and high device utilization, so the system integrates a new scheduling algorithm that considers the switch time of the CNN models. For both predictable and unpredictable workloads, user fairness is achieved by using an ageing technique in the scheduling algorithm that increases the priority of jobs in accordance with the job waiting time. The evaluation results show that the scheduling overhead of the proposed system is negligible for both predictable and unpredictable workloads providing practical real-time performance. For unpredictable workloads, the new scheduling algorithm improves fairness by 24%–94% and resource efficiency by 31%–33% compared to traditional algorithms using first-come first-served or round-robin. For predictable workloads, the system improves fairness by 50.5 % compared to first-come first-served and achieves 99.5 % resource efficiency.

随着人工智能(AI)实时数据分析的广泛应用，加速器的集成从其低功耗和低延迟的角度受到关注。本研究的目的是在保持实时性能的同时，通过在多个用户之间共享加速器来提高加速器资源效率并进一步降低功耗。为了实现加速器共享系统，我们定义了三个要求:高设备利用率、用户间公平的设备利用率和实时性。针对人工智能推理用例，本文提出了一种通过在FPGA上切换存储在器件存储器中的卷积神经网络(CNN)模型，在满足上述三个要求的情况下，在多个用户之间共享现场可编程门阵列(FPGA)的系统。提出的系统对具有可预测和不可预测数据到达时间的工作负载使用不同的行为模型。对于数据到达时间可预测的工作负载，系统采用FPGA设备内存的空分复用，实现实时性和高设备利用率。具体来说，系统的FPGA器件存储器控制器在数据到达之前透明地将CNN模型预加载并缓存到FPGA器件存储器中。对于数据到达时间不可预测的工作负载，系统利用FPGA设备内存的时分复用技术，在数据到达时将CNN模型传输到FPGA设备内存中。在后一种情况下，由于工作负载不可预测，为了实现实时性能和高设备利用率，CNN模型之间的切换成本是不可忽略的，因此系统集成了一种考虑CNN模型切换时间的新的调度算法。对于可预测和不可预测的工作负载，通过在调度算法中使用老化技术来实现用户公平性，该算法根据作业等待时间增加作业的优先级。评估结果表明，对于可预测和不可预测的工作负载，所提出的系统的调度开销可以忽略不计，提供了实际的实时性能。对于不可预测的工作负载，与使用先到先得或轮循的传统调度算法相比，新调度算法的公平性提高24% ~ 94%，资源效率提高31% ~ 33%。对于可预测的工作负载，与先到先得相比，系统的公平性提高了50.5%，资源效率达到99.5%。

{"title":"Spatial- and time- division multiplexing in CNN accelerator","authors":"Tetsuro Nakamura, Shogo Saito, Kei Fujimoto, Masashi Kaneko, Akinori Shiraga","doi":"10.1016/j.parco.2022.102922","DOIUrl":"10.1016/j.parco.2022.102922","url":null,"abstract":"<div><p>With the widespread use of real-time data analysis by artificial intelligence (AI), the integration of accelerators is attracting attention from the perspectives of their low power consumption and low latency. The objective of this research is to increase accelerator resource efficiency and further reduce power consumption by sharing accelerators among multiple users while maintaining real-time performance. To achieve the accelerator-sharing system, we define three requirements: high device utilization, fair device utilization among users, and real-time performance. Targeting the AI inference use case, this paper proposes a system that shares a field-programmable gate array (FPGA) among multiple users by switching the convolutional neural network (CNN) models stored in the device memory on the FPGA, while satisfying the three requirements. The proposed system uses different behavioral models for workloads with predictable and unpredictable data arrival timing. For the workloads with predictable data arrival timing, the system uses spatial-division multiplexing of the FPGA device memory to achieve real-time performance and high device utilization. Specifically, the FPGA device memory controller of the system transparently preloads and caches the CNN models into the FPGA device memory before the data arrival. For workloads with unpredictable data arrival timing, the system transfers CNN models to the FPGA device memory upon data arrival using time-division multiplexing of FPGA device memory. In the latter case of unpredictable workloads, the switch cost between CNN models is non-negligible to achieve real-time performance and high device utilization, so the system integrates a new scheduling algorithm that considers the switch time of the CNN models. For both predictable and unpredictable workloads, user fairness is achieved by using an ageing technique in the scheduling algorithm that increases the priority of jobs in accordance with the job waiting time. The evaluation results show that the scheduling overhead of the proposed system is negligible for both predictable and unpredictable workloads providing practical real-time performance. For unpredictable workloads, the new scheduling algorithm improves fairness by 24%–94% and resource efficiency by 31%–33% compared to traditional algorithms using first-come first-served or round-robin. For predictable workloads, the system improves fairness by 50.5 % compared to first-come first-served and achieves 99.5 % resource efficiency.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102922"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000254/pdfft?md5=ffbf4f3879d04bf0d1e7a6b33de6606f&pid=1-s2.0-S0167819122000254-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90824046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms 基于电子结构的数亿原子从头算分子动力学模拟

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-07-01 DOI: 10.1016/j.parco.2022.102920

Robert Schade , Tobias Kenter , Hossam Elgabarty , Michael Lass , Ole Schütt , Alfio Lazzaro , Hans Pabst , Stephan Mohr , Jürg Hutter , Thomas D. Kühne , Christian Plessl

We push the boundaries of electronic structure-based ab-initio molecular dynamics (AIMD) beyond 100 million atoms. This scale is otherwise barely reachable with classical force-field methods or novel neural network and machine learning potentials. We achieve this breakthrough by combining innovations in linear-scaling AIMD, efficient and approximate sparse linear algebra, low and mixed-precision floating-point computation on GPUs, and a compensation scheme for the errors introduced by numerical approximations. The core of our work is the non-orthogonalized local submatrix method (NOLSM), which scales very favorably to massively parallel computing systems and translates large sparse matrix operations into highly parallel, dense matrix operations that are ideally suited to hardware accelerators. We demonstrate that the NOLSM method, which is at the center point of each AIMD step, is able to achieve a sustained performance of 324 PFLOP/s in mixed FP16/FP32 precision corresponding to an efficiency of 67.7% when running on 1536 NVIDIA A100 GPUs.

我们推动了基于电子结构的从头算分子动力学(AIMD)的边界超过1亿个原子。这个尺度很难用经典的力场方法或新的神经网络和机器学习潜力来达到。我们通过结合线性缩放AIMD，高效和近似稀疏线性代数，gpu上的低精度和混合精度浮点计算以及数值近似引入的误差补偿方案的创新来实现这一突破。我们工作的核心是非正交局部子矩阵方法(NOLSM)，它非常适合大规模并行计算系统，并将大型稀疏矩阵操作转换为高度并行，密集的矩阵操作，非常适合硬件加速器。我们证明了在每个AIMD步骤的中心点处的NOLSM方法能够在FP16/FP32混合精度下实现324 PFLOP/s的持续性能，对应于在1536 NVIDIA A100 gpu上运行时的效率为67.7%。

{"title":"Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms","authors":"Robert Schade , Tobias Kenter , Hossam Elgabarty , Michael Lass , Ole Schütt , Alfio Lazzaro , Hans Pabst , Stephan Mohr , Jürg Hutter , Thomas D. Kühne , Christian Plessl","doi":"10.1016/j.parco.2022.102920","DOIUrl":"10.1016/j.parco.2022.102920","url":null,"abstract":"<div><p>We push the boundaries of electronic structure-based <em>ab-initio</em> molecular dynamics (AIMD) beyond 100 million atoms. This scale is otherwise barely reachable with classical force-field methods or novel neural network and machine learning potentials. We achieve this breakthrough by combining innovations in linear-scaling AIMD, efficient and approximate sparse linear algebra, low and mixed-precision floating-point computation on GPUs, and a compensation scheme for the errors introduced by numerical approximations. The core of our work is the non-orthogonalized local submatrix method (NOLSM), which scales very favorably to massively parallel computing systems and translates large sparse matrix operations into highly parallel, dense matrix operations that are ideally suited to hardware accelerators. We demonstrate that the NOLSM method, which is at the center point of each AIMD step, is able to achieve a sustained performance of 324 PFLOP/s in mixed FP16/FP32 precision corresponding to an efficiency of 67.7% when running on 1536 NVIDIA A100 GPUs.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102920"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000242/pdfft?md5=cb708fe8c83694714bb33b45ee473a37&pid=1-s2.0-S0167819122000242-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77029045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

OpenACC + Athread collaborative optimization of Silicon-Crystal application on Sunway TaihuLight 神威太湖之光硅晶应用的OpenACC +线程协同优化

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-07-01 DOI: 10.1016/j.parco.2022.102893

Jianguo Liang , Rong Hua , Wenqiang Zhu , Yuxi Ye , You Fu , Hao Zhang

The Silicon-Crystal application based on molecular dynamics (MD) is used to simulate the thermal conductivity of the crystal, which adopts the Tersoff potential to simulate the trajectory of the silicon crystal. Based on the OpenACC version, to better solve the problem of discrete memory access and write dependency, task pipeline optimization and the interval graph coloring scheduling method are proposed. Also, the part of codes on CPEs is vectorized by the SIMD command to further improve the computational performance. After the collaborative development of OpenACC+Athread, the performance has been improved by 16.68 times and achieves 2.34X speedup compared with the OpenACC version. Moreover, the application is expanded to 66,560 cores and can simulate reactions of 268,435,456 silicon atoms.

利用基于分子动力学(MD)的硅晶体应用程序模拟晶体的导热性，采用Tersoff势来模拟硅晶体的运动轨迹。基于OpenACC版本，为了更好地解决离散内存读写依赖问题，提出了任务流水线优化和区间图着色调度方法。同时，通过SIMD命令对cpe上的部分代码进行矢量化，进一步提高了计算性能。经过OpenACC+Athread的协同开发，性能比OpenACC版本提升了16.68倍，实现了2.34倍的提速。此外，应用程序扩展到66,560个核，可以模拟268,435,456个硅原子的反应。

引用次数: 1

Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance 在Hadoop集群上构建分布式大数据仓库的新颖物理设计，提高OLAP多维数据集查询性能

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-07-01 DOI: 10.1016/j.parco.2022.102918

Yassine Ramdane , Omar Boussaid , Doulkifli Boukraà , Nadia Kabachi , Fadila Bentayeb

Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of Hadoop is a challenging task. An OLAP Cube query comprises several relational operations, such as selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system’s query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time.

在Hadoop之上的分布式系统中提高OLAP(在线分析处理)查询性能是一项具有挑战性的任务。OLAP Cube查询包含几个关系操作，例如选择、连接和分组聚合。众所周知，星型连接和分组聚合是Hadoop数据库系统中成本最高的操作。这些操作确实增加了网络流量并可能溢出内存;为了克服这些困难，文献中提出了许多分区和数据负载平衡技术。但是，有些问题仍然存在疑问，例如减少Spark阶段和在分布式系统上执行OLAP查询的网络I/O。在之前的工作中，我们为Hadoop集群上的大数据仓库提出了一种新的数据放置策略。这个数据仓库模式增强了OLAP查询的投影、选择和星型连接操作，这样系统的查询优化器就可以在本地执行星型连接过程，只需要一个火花阶段而不需要shuffle阶段。此外，在执行谓词时，系统可以跳过加载不必要的数据块。在本文中，我们通过进一步的技术细节和实验扩展了之前的工作，并提出了一种新的动态方法来改进分组聚合。为了评估我们的方法，我们在一个有15个节点的集群上进行了一些实验。实验结果表明，该方法在OLAP查询评估时间方面优于现有方法。

{"title":"Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance","authors":"Yassine Ramdane , Omar Boussaid , Doulkifli Boukraà , Nadia Kabachi , Fadila Bentayeb","doi":"10.1016/j.parco.2022.102918","DOIUrl":"10.1016/j.parco.2022.102918","url":null,"abstract":"<div><p><span>Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of </span>Hadoop<span> is a challenging task. An OLAP Cube query comprises several relational operations, such as selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system’s query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102918"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90453784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5