首页 > 最新文献

Parallel Computing最新文献

英文 中文
Parallel multi-view HEVC for heterogeneously embedded cluster system 异构嵌入式集群系统的并行多视图HEVC
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102948
Seo Jin Jang , Wei Liu , Wei Li , Yong Beom Cho

In this paper, we present a computer cluster with heterogeneous computing components intended to provide concurrency and parallelism with embedded processors to achieve a real-time Multi-View High-Efficiency Video Coding (MV-HEVC) encoder/decoder with a maximum resolution of 1088p. The latest MV-HEVC standard represents a significant improvement over the previous video coding standard (MVC). However, the MV-HEVC standard also has higher computational complexity. To this point, research using the MV-HEVC has had to use the Central Processing Unit (CPU) on a Personal Computer (PC) or workstation for decompression, because MV-HEVC is much more complex than High-Efficiency Video Coding (HEVC), and because decompressors need higher parallelism to decompress in real time. It is particularly difficult to encode/decode in an embedded device. Therefore, we propose a novel framework for an MV-HEVC encoder/decoder that is based on a heterogeneously distributed embedded system. To this end, we use a parallel computing method to divide the video into multiple blocks and then code the blocks independently in each sub-work node with a group of pictures and a coding tree unit level. To appropriately assign the tasks to each work node, we propose a new allocation method that makes the operation of the entire heterogeneously distributed system more efficient. Our experimental results show that, compared to the single device (3D-HTM single threading), the proposed distributed MV-HEVC decoder and encoder performance increased approximately (20.39 and 68.7) times under 20 devices (multithreading) with the CTU level of a 1088p resolution video, respectively. Further, at the proposed GOP level, the decoder and encoder performance with 20 devices (multithreading) respectively increased approximately (20.78 and 77) times for a 1088p resolution video with heterogeneously distributed computing compared to the single device (3D-HTM single threading).

在本文中,我们提出了一个具有异构计算组件的计算机集群,旨在通过嵌入式处理器提供并发性和并行性,以实现最大分辨率为1088p的实时多视图高效视频编码(MV-HEVC)编码器/解码器。最新的MV-HEVC标准比之前的视频编码标准(MVC)有了显著的改进。然而,MV-HEVC标准也具有较高的计算复杂度。到目前为止,使用MV-HEVC的研究必须使用个人计算机(PC)或工作站上的中央处理器(CPU)进行解压,因为MV-HEVC比高效视频编码(HEVC)复杂得多,而且解压缩器需要更高的并行度才能实时解压缩。在嵌入式设备中进行编码/解码尤其困难。因此,我们提出了一种基于异构分布式嵌入式系统的MV-HEVC编码器/解码器的新框架。为此,我们采用并行计算的方法将视频分成多个块,然后在每个子工作节点上以一组图片和一个编码树单元级对这些块进行独立编码。为了将任务合理地分配给各个工作节点,我们提出了一种新的分配方法,使整个异构分布式系统的运行效率更高。实验结果表明,与单设备(3D-HTM单线程)相比,本文提出的分布式MV-HEVC解码器和编码器性能在20个设备(多线程)下,以1088p分辨率视频的CTU级别分别提高了约20.39倍和68.7倍。此外,在建议的GOP水平上,与单设备(3D-HTM单线程)相比,在异构分布式计算的1088p分辨率视频中,20个设备(多线程)的解码器和编码器性能分别提高了大约(20.78和77)倍。
{"title":"Parallel multi-view HEVC for heterogeneously embedded cluster system","authors":"Seo Jin Jang ,&nbsp;Wei Liu ,&nbsp;Wei Li ,&nbsp;Yong Beom Cho","doi":"10.1016/j.parco.2022.102948","DOIUrl":"10.1016/j.parco.2022.102948","url":null,"abstract":"<div><p><span>In this paper, we present a computer cluster with heterogeneous computing<span> components intended to provide concurrency and parallelism with embedded processors to achieve a real-time Multi-View High-Efficiency Video Coding (MV-HEVC) encoder/decoder with a maximum resolution of 1088p. The latest MV-HEVC standard represents a significant improvement over the previous video coding standard (MVC). However, the MV-HEVC standard also has higher </span></span>computational complexity<span><span>. To this point, research using the MV-HEVC has had to use the Central Processing Unit<span><span> (CPU) on a Personal Computer (PC) or workstation for decompression<span>, because MV-HEVC is much more complex than High-Efficiency Video Coding (HEVC), and because decompressors need higher parallelism to decompress in real time. It is particularly difficult to encode/decode in an embedded device. Therefore, we propose a novel framework for an MV-HEVC encoder/decoder that is based on a heterogeneously distributed embedded system. To this end, we use a </span></span>parallel computing method to divide the video into multiple blocks and then code the blocks independently in each sub-work node with a group of pictures and a coding tree unit level. To appropriately assign the tasks to each work node, we propose a new allocation method that makes the operation of the entire heterogeneously distributed system more efficient. Our experimental results show that, compared to the single device (3D-HTM single threading), the proposed distributed MV-HEVC decoder and encoder performance increased approximately (20.39 and 68.7) times under 20 devices (multithreading) with the CTU level of a 1088p resolution video, respectively. Further, at the proposed GOP level, the decoder and encoder performance with 20 devices (multithreading) respectively increased approximately (20.78 and 77) times for a 1088p resolution video with heterogeneously </span></span>distributed computing compared to the single device (3D-HTM single threading).</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102948"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74144309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Low-synch Gram–Schmidt with delayed reorthogonalization for Krylov solvers Krylov解的延迟再正交化低同步Gram-Schmidt
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-09-01 DOI: 10.1016/j.parco.2022.102940
Daniel Bielich , Julien Langou , Stephen Thomas , Kasia Świrydowicz , Ichitaro Yamazaki , Erik G. Boman

The parallel strong-scaling of iterative methods is often determined by the number of global reductions at each iteration. Low-synch Gram–Schmidt algorithms are applied here to the Arnoldi algorithm to reduce the number of global reductions and therefore to improve the parallel strong-scaling of iterative solvers for nonsymmetric matrices such as the GMRES and the Krylov–Schur iterative methods. In the Arnoldi context, the QR factorization is “left-looking” and processes one column at a time. Among the methods for generating an orthogonal basis for the Arnoldi algorithm, the classical Gram–Schmidt algorithm, with reorthogonalization (CGS2) requires three global reductions per iteration. A new variant of CGS2 that requires only one reduction per iteration is presented and applied to the Arnoldi algorithm. Delayed CGS2 (DCGS2) employs the minimum number of global reductions per iteration (one) for a one-column at-a-time algorithm. The main idea behind the new algorithm is to group global reductions by rearranging the order of operations. DCGS2 must be carefully integrated into an Arnoldi expansion or a GMRES solver. Numerical stability experiments assess robustness for Krylov–Schur eigenvalue computations. Performance experiments on the ORNL Summit supercomputer then establish the superiority of DCGS2 over CGS2.

迭代方法的并行强标度通常由每次迭代的全局约简次数决定。本文将低同步Gram-Schmidt算法应用于Arnoldi算法,以减少全局约简的数量,从而提高非对称矩阵迭代求解器(如GMRES和Krylov-Schur迭代方法)的并行强尺度性。在Arnoldi上下文中,QR分解是“向左看”的,每次处理一列。在为Arnoldi算法生成正交基的方法中,经典的Gram-Schmidt算法(CGS2)每次迭代需要三次全局约简。提出了一种每次迭代只需要一次约简的CGS2新变体,并将其应用于Arnoldi算法。延迟CGS2 (DCGS2)为每次一列的算法使用每次迭代的最小全局约简数(1)。新算法背后的主要思想是通过重新安排操作顺序来分组全局约简。DCGS2必须小心地集成到Arnoldi扩展或GMRES求解器中。数值稳定性实验评估Krylov-Schur特征值计算的鲁棒性。在ORNL Summit超级计算机上的性能实验验证了DCGS2优于CGS2。
{"title":"Low-synch Gram–Schmidt with delayed reorthogonalization for Krylov solvers","authors":"Daniel Bielich ,&nbsp;Julien Langou ,&nbsp;Stephen Thomas ,&nbsp;Kasia Świrydowicz ,&nbsp;Ichitaro Yamazaki ,&nbsp;Erik G. Boman","doi":"10.1016/j.parco.2022.102940","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102940","url":null,"abstract":"<div><p><span>The parallel strong-scaling of iterative methods is often determined by the number of global reductions at each iteration. Low-synch Gram–Schmidt algorithms are applied here to the Arnoldi algorithm to reduce the number of global reductions and therefore to improve the parallel strong-scaling of iterative solvers for nonsymmetric matrices such as the GMRES and the Krylov–Schur iterative methods. In the Arnoldi context, the </span><span><math><mrow><mi>Q</mi><mi>R</mi></mrow></math></span><span><span><span> factorization is “left-looking” and processes one column at a time. Among the methods for generating an orthogonal basis for the Arnoldi algorithm, the classical Gram–Schmidt algorithm, with reorthogonalization (CGS2) requires three global reductions per iteration. A new variant of CGS2 that requires only one reduction per iteration is presented and applied to the Arnoldi algorithm. Delayed CGS2 (DCGS2) employs the minimum number of global reductions per iteration (one) for a one-column at-a-time algorithm. The main idea behind the new algorithm is to group global reductions by rearranging the order of operations. DCGS2 must be carefully integrated into an Arnoldi expansion or a GMRES solver. Numerical stability experiments assess robustness for Krylov–Schur </span>eigenvalue computations. Performance experiments on the ORNL Summit </span>supercomputer then establish the superiority of DCGS2 over CGS2.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102940"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91714605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms 基于电子结构的数亿原子从头算分子动力学模拟
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-07-01 DOI: 10.1016/j.parco.2022.102920
Robert Schade , Tobias Kenter , Hossam Elgabarty , Michael Lass , Ole Schütt , Alfio Lazzaro , Hans Pabst , Stephan Mohr , Jürg Hutter , Thomas D. Kühne , Christian Plessl

We push the boundaries of electronic structure-based ab-initio molecular dynamics (AIMD) beyond 100 million atoms. This scale is otherwise barely reachable with classical force-field methods or novel neural network and machine learning potentials. We achieve this breakthrough by combining innovations in linear-scaling AIMD, efficient and approximate sparse linear algebra, low and mixed-precision floating-point computation on GPUs, and a compensation scheme for the errors introduced by numerical approximations. The core of our work is the non-orthogonalized local submatrix method (NOLSM), which scales very favorably to massively parallel computing systems and translates large sparse matrix operations into highly parallel, dense matrix operations that are ideally suited to hardware accelerators. We demonstrate that the NOLSM method, which is at the center point of each AIMD step, is able to achieve a sustained performance of 324 PFLOP/s in mixed FP16/FP32 precision corresponding to an efficiency of 67.7% when running on 1536 NVIDIA A100 GPUs.

我们推动了基于电子结构的从头算分子动力学(AIMD)的边界超过1亿个原子。这个尺度很难用经典的力场方法或新的神经网络和机器学习潜力来达到。我们通过结合线性缩放AIMD,高效和近似稀疏线性代数,gpu上的低精度和混合精度浮点计算以及数值近似引入的误差补偿方案的创新来实现这一突破。我们工作的核心是非正交局部子矩阵方法(NOLSM),它非常适合大规模并行计算系统,并将大型稀疏矩阵操作转换为高度并行,密集的矩阵操作,非常适合硬件加速器。我们证明了在每个AIMD步骤的中心点处的NOLSM方法能够在FP16/FP32混合精度下实现324 PFLOP/s的持续性能,对应于在1536 NVIDIA A100 gpu上运行时的效率为67.7%。
{"title":"Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms","authors":"Robert Schade ,&nbsp;Tobias Kenter ,&nbsp;Hossam Elgabarty ,&nbsp;Michael Lass ,&nbsp;Ole Schütt ,&nbsp;Alfio Lazzaro ,&nbsp;Hans Pabst ,&nbsp;Stephan Mohr ,&nbsp;Jürg Hutter ,&nbsp;Thomas D. Kühne ,&nbsp;Christian Plessl","doi":"10.1016/j.parco.2022.102920","DOIUrl":"10.1016/j.parco.2022.102920","url":null,"abstract":"<div><p>We push the boundaries of electronic structure-based <em>ab-initio</em> molecular dynamics (AIMD) beyond 100 million atoms. This scale is otherwise barely reachable with classical force-field methods or novel neural network and machine learning potentials. We achieve this breakthrough by combining innovations in linear-scaling AIMD, efficient and approximate sparse linear algebra, low and mixed-precision floating-point computation on GPUs, and a compensation scheme for the errors introduced by numerical approximations. The core of our work is the non-orthogonalized local submatrix method (NOLSM), which scales very favorably to massively parallel computing systems and translates large sparse matrix operations into highly parallel, dense matrix operations that are ideally suited to hardware accelerators. We demonstrate that the NOLSM method, which is at the center point of each AIMD step, is able to achieve a sustained performance of 324 PFLOP/s in mixed FP16/FP32 precision corresponding to an efficiency of 67.7% when running on 1536 NVIDIA A100 GPUs.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102920"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000242/pdfft?md5=cb708fe8c83694714bb33b45ee473a37&pid=1-s2.0-S0167819122000242-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77029045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Spatial- and time- division multiplexing in CNN accelerator CNN加速器的时空复用
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-07-01 DOI: 10.1016/j.parco.2022.102922
Tetsuro Nakamura, Shogo Saito, Kei Fujimoto, Masashi Kaneko, Akinori Shiraga

With the widespread use of real-time data analysis by artificial intelligence (AI), the integration of accelerators is attracting attention from the perspectives of their low power consumption and low latency. The objective of this research is to increase accelerator resource efficiency and further reduce power consumption by sharing accelerators among multiple users while maintaining real-time performance. To achieve the accelerator-sharing system, we define three requirements: high device utilization, fair device utilization among users, and real-time performance. Targeting the AI inference use case, this paper proposes a system that shares a field-programmable gate array (FPGA) among multiple users by switching the convolutional neural network (CNN) models stored in the device memory on the FPGA, while satisfying the three requirements. The proposed system uses different behavioral models for workloads with predictable and unpredictable data arrival timing. For the workloads with predictable data arrival timing, the system uses spatial-division multiplexing of the FPGA device memory to achieve real-time performance and high device utilization. Specifically, the FPGA device memory controller of the system transparently preloads and caches the CNN models into the FPGA device memory before the data arrival. For workloads with unpredictable data arrival timing, the system transfers CNN models to the FPGA device memory upon data arrival using time-division multiplexing of FPGA device memory. In the latter case of unpredictable workloads, the switch cost between CNN models is non-negligible to achieve real-time performance and high device utilization, so the system integrates a new scheduling algorithm that considers the switch time of the CNN models. For both predictable and unpredictable workloads, user fairness is achieved by using an ageing technique in the scheduling algorithm that increases the priority of jobs in accordance with the job waiting time. The evaluation results show that the scheduling overhead of the proposed system is negligible for both predictable and unpredictable workloads providing practical real-time performance. For unpredictable workloads, the new scheduling algorithm improves fairness by 24%–94% and resource efficiency by 31%–33% compared to traditional algorithms using first-come first-served or round-robin. For predictable workloads, the system improves fairness by 50.5 % compared to first-come first-served and achieves 99.5 % resource efficiency.

随着人工智能(AI)实时数据分析的广泛应用,加速器的集成从其低功耗和低延迟的角度受到关注。本研究的目的是在保持实时性能的同时,通过在多个用户之间共享加速器来提高加速器资源效率并进一步降低功耗。为了实现加速器共享系统,我们定义了三个要求:高设备利用率、用户间公平的设备利用率和实时性。针对人工智能推理用例,本文提出了一种通过在FPGA上切换存储在器件存储器中的卷积神经网络(CNN)模型,在满足上述三个要求的情况下,在多个用户之间共享现场可编程门阵列(FPGA)的系统。提出的系统对具有可预测和不可预测数据到达时间的工作负载使用不同的行为模型。对于数据到达时间可预测的工作负载,系统采用FPGA设备内存的空分复用,实现实时性和高设备利用率。具体来说,系统的FPGA器件存储器控制器在数据到达之前透明地将CNN模型预加载并缓存到FPGA器件存储器中。对于数据到达时间不可预测的工作负载,系统利用FPGA设备内存的时分复用技术,在数据到达时将CNN模型传输到FPGA设备内存中。在后一种情况下,由于工作负载不可预测,为了实现实时性能和高设备利用率,CNN模型之间的切换成本是不可忽略的,因此系统集成了一种考虑CNN模型切换时间的新的调度算法。对于可预测和不可预测的工作负载,通过在调度算法中使用老化技术来实现用户公平性,该算法根据作业等待时间增加作业的优先级。评估结果表明,对于可预测和不可预测的工作负载,所提出的系统的调度开销可以忽略不计,提供了实际的实时性能。对于不可预测的工作负载,与使用先到先得或轮循的传统调度算法相比,新调度算法的公平性提高24% ~ 94%,资源效率提高31% ~ 33%。对于可预测的工作负载,与先到先得相比,系统的公平性提高了50.5%,资源效率达到99.5%。
{"title":"Spatial- and time- division multiplexing in CNN accelerator","authors":"Tetsuro Nakamura,&nbsp;Shogo Saito,&nbsp;Kei Fujimoto,&nbsp;Masashi Kaneko,&nbsp;Akinori Shiraga","doi":"10.1016/j.parco.2022.102922","DOIUrl":"10.1016/j.parco.2022.102922","url":null,"abstract":"<div><p>With the widespread use of real-time data analysis by artificial intelligence (AI), the integration of accelerators is attracting attention from the perspectives of their low power consumption and low latency. The objective of this research is to increase accelerator resource efficiency and further reduce power consumption by sharing accelerators among multiple users while maintaining real-time performance. To achieve the accelerator-sharing system, we define three requirements: high device utilization, fair device utilization among users, and real-time performance. Targeting the AI inference use case, this paper proposes a system that shares a field-programmable gate array (FPGA) among multiple users by switching the convolutional neural network (CNN) models stored in the device memory on the FPGA, while satisfying the three requirements. The proposed system uses different behavioral models for workloads with predictable and unpredictable data arrival timing. For the workloads with predictable data arrival timing, the system uses spatial-division multiplexing of the FPGA device memory to achieve real-time performance and high device utilization. Specifically, the FPGA device memory controller of the system transparently preloads and caches the CNN models into the FPGA device memory before the data arrival. For workloads with unpredictable data arrival timing, the system transfers CNN models to the FPGA device memory upon data arrival using time-division multiplexing of FPGA device memory. In the latter case of unpredictable workloads, the switch cost between CNN models is non-negligible to achieve real-time performance and high device utilization, so the system integrates a new scheduling algorithm that considers the switch time of the CNN models. For both predictable and unpredictable workloads, user fairness is achieved by using an ageing technique in the scheduling algorithm that increases the priority of jobs in accordance with the job waiting time. The evaluation results show that the scheduling overhead of the proposed system is negligible for both predictable and unpredictable workloads providing practical real-time performance. For unpredictable workloads, the new scheduling algorithm improves fairness by 24%–94% and resource efficiency by 31%–33% compared to traditional algorithms using first-come first-served or round-robin. For predictable workloads, the system improves fairness by 50.5 % compared to first-come first-served and achieves 99.5 % resource efficiency.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102922"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000254/pdfft?md5=ffbf4f3879d04bf0d1e7a6b33de6606f&pid=1-s2.0-S0167819122000254-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90824046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
OpenACC + Athread collaborative optimization of Silicon-Crystal application on Sunway TaihuLight 神威太湖之光硅晶应用的OpenACC +线程协同优化
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-07-01 DOI: 10.1016/j.parco.2022.102893
Jianguo Liang , Rong Hua , Wenqiang Zhu , Yuxi Ye , You Fu , Hao Zhang

The Silicon-Crystal application based on molecular dynamics (MD) is used to simulate the thermal conductivity of the crystal, which adopts the Tersoff potential to simulate the trajectory of the silicon crystal. Based on the OpenACC version, to better solve the problem of discrete memory access and write dependency, task pipeline optimization and the interval graph coloring scheduling method are proposed. Also, the part of codes on CPEs is vectorized by the SIMD command to further improve the computational performance. After the collaborative development of OpenACC+Athread, the performance has been improved by 16.68 times and achieves 2.34X speedup compared with the OpenACC version. Moreover, the application is expanded to 66,560 cores and can simulate reactions of 268,435,456 silicon atoms.

利用基于分子动力学(MD)的硅晶体应用程序模拟晶体的导热性,采用Tersoff势来模拟硅晶体的运动轨迹。基于OpenACC版本,为了更好地解决离散内存读写依赖问题,提出了任务流水线优化和区间图着色调度方法。同时,通过SIMD命令对cpe上的部分代码进行矢量化,进一步提高了计算性能。经过OpenACC+Athread的协同开发,性能比OpenACC版本提升了16.68倍,实现了2.34倍的提速。此外,应用程序扩展到66,560个核,可以模拟268,435,456个硅原子的反应。
{"title":"OpenACC + Athread collaborative optimization of Silicon-Crystal application on Sunway TaihuLight","authors":"Jianguo Liang ,&nbsp;Rong Hua ,&nbsp;Wenqiang Zhu ,&nbsp;Yuxi Ye ,&nbsp;You Fu ,&nbsp;Hao Zhang","doi":"10.1016/j.parco.2022.102893","DOIUrl":"10.1016/j.parco.2022.102893","url":null,"abstract":"<div><p><span>The Silicon-Crystal application based on molecular dynamics (MD) is used to simulate the thermal conductivity of the crystal, which adopts the Tersoff potential to simulate the trajectory of the silicon crystal. Based on the </span>OpenACC<span><span> version, to better solve the problem of discrete memory access and write dependency, task pipeline optimization and the interval graph coloring scheduling method are proposed. Also, the part of codes on CPEs is vectorized by the SIMD command to further improve the computational performance. After the collaborative development of OpenACC+Athread, the performance has been improved by 16.68 times and achieves 2.34X speedup compared with the OpenACC version. Moreover, the application is expanded to 66,560 cores and can simulate reactions of 268,435,456 </span>silicon atoms.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102893"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75647516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance 在Hadoop集群上构建分布式大数据仓库的新颖物理设计,提高OLAP多维数据集查询性能
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-07-01 DOI: 10.1016/j.parco.2022.102918
Yassine Ramdane , Omar Boussaid , Doulkifli Boukraà , Nadia Kabachi , Fadila Bentayeb

Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of Hadoop is a challenging task. An OLAP Cube query comprises several relational operations, such as selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system’s query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time.

在Hadoop之上的分布式系统中提高OLAP(在线分析处理)查询性能是一项具有挑战性的任务。OLAP Cube查询包含几个关系操作,例如选择、连接和分组聚合。众所周知,星型连接和分组聚合是Hadoop数据库系统中成本最高的操作。这些操作确实增加了网络流量并可能溢出内存;为了克服这些困难,文献中提出了许多分区和数据负载平衡技术。但是,有些问题仍然存在疑问,例如减少Spark阶段和在分布式系统上执行OLAP查询的网络I/O。在之前的工作中,我们为Hadoop集群上的大数据仓库提出了一种新的数据放置策略。这个数据仓库模式增强了OLAP查询的投影、选择和星型连接操作,这样系统的查询优化器就可以在本地执行星型连接过程,只需要一个火花阶段而不需要shuffle阶段。此外,在执行谓词时,系统可以跳过加载不必要的数据块。在本文中,我们通过进一步的技术细节和实验扩展了之前的工作,并提出了一种新的动态方法来改进分组聚合。为了评估我们的方法,我们在一个有15个节点的集群上进行了一些实验。实验结果表明,该方法在OLAP查询评估时间方面优于现有方法。
{"title":"Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance","authors":"Yassine Ramdane ,&nbsp;Omar Boussaid ,&nbsp;Doulkifli Boukraà ,&nbsp;Nadia Kabachi ,&nbsp;Fadila Bentayeb","doi":"10.1016/j.parco.2022.102918","DOIUrl":"10.1016/j.parco.2022.102918","url":null,"abstract":"<div><p><span>Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of </span>Hadoop<span> is a challenging task. An OLAP Cube query comprises several relational operations, such as selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system’s query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102918"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90453784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Towards scaling community detection on distributed-memory heterogeneous systems 基于分布式内存异构系统的社区检测研究
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-07-01 DOI: 10.1016/j.parco.2022.102898
Nitin Gawande , Sayan Ghosh , Mahantesh Halappanavar , Antonino Tumeo , Ananth Kalyanaraman

In most real-world networks, nodes/vertices tend to be organized into tightly-knit modules known as communities or clusters such that nodes within a community are more likely to be connected or related to one another than they are to the rest of the network. Community detection in a network (graph) is aimed at finding a partitioning of the vertices into communities. The goodness of the partitioning is commonly measured using modularity. Maximizing modularity is an NP-complete problem. In 2008, Blondel et al. introduced a multi-phase, multi-iteration heuristic for modularity maximization called the Louvain method. Owing to its speed and ability to yield high quality communities, the Louvain method continues to be one of the most widely used tools for serial community detection.

Distributed multi-GPU systems pose significant challenges and opportunities for efficient execution of parallel applications. Graph algorithms, in particular, have been known to be harder to parallelize on such platforms, due to irregular memory accesses, low computation to communication ratios, and load balancing problems that are especially hard to address on multi-GPU systems.

In this paper, we present our ongoing work on distributed-memory implementation of Louvain method on heterogeneous systems. We build on our prior work parallelizing the Louvain method for community detection on traditional CPU-only distributed systems without GPUs. Corroborated by an extensive set of experiments on multi-GPU systems, we demonstrate competitive performance to existing distributed-memory CPU-based implementation, up to 3.2× speedup using 16 nodes of OLCF Summit relative to two nodes, and up to 19× speedup relative to the NVIDIA RAPIDS® cuGraph® implementation on a single NVIDIA V100 GPU from DGX-2 platform, while achieving high quality solutions comparable to the original Louvain method. To the best of our knowledge, this work represents the first effort for community detection on distributed multi-GPU systems. Our approach and related findings can be extended to numerous other iterative graph algorithms on multi-GPU systems.

在大多数现实世界的网络中,节点/顶点往往被组织成紧密结合的模块,称为社区或集群,这样社区内的节点更有可能相互连接或相互关联,而不是与网络的其余部分相连。网络(图)中的社区检测旨在找到将顶点划分为社区的方法。划分的好坏通常用模块化来衡量。模块化最大化是一个np完全问题。2008年,Blondel等人引入了一种多阶段、多迭代的模块化最大化启发式方法,称为Louvain方法。由于它的速度和产生高质量社区的能力,Louvain方法仍然是串行社区检测最广泛使用的工具之一。分布式多gpu系统为高效执行并行应用程序带来了巨大的挑战和机遇。特别是图算法,在这样的平台上很难并行化,这是由于不规则的内存访问,较低的计算与通信比率,以及在多gpu系统上特别难以解决的负载平衡问题。在本文中,我们介绍了我们正在进行的Louvain方法在异构系统上的分布式内存实现的工作。我们在之前的工作的基础上,将Louvain方法并行化,用于在没有gpu的传统仅cpu分布式系统上进行社区检测。通过在多GPU系统上进行的大量实验证实,我们展示了与现有基于分布式内存cpu的实现相比具有竞争力的性能,使用OLCF Summit的16个节点相对于两个节点的加速高达3.2倍,相对于来自DGX-2平台的单个NVIDIA V100 GPU的NVIDIA RAPIDS®cuGraph®实现的加速高达19倍,同时获得与原始Louvain方法相当的高质量解决方案。据我们所知,这项工作代表了分布式多gpu系统上社区检测的第一次努力。我们的方法和相关发现可以扩展到多gpu系统上的许多其他迭代图算法。
{"title":"Towards scaling community detection on distributed-memory heterogeneous systems","authors":"Nitin Gawande ,&nbsp;Sayan Ghosh ,&nbsp;Mahantesh Halappanavar ,&nbsp;Antonino Tumeo ,&nbsp;Ananth Kalyanaraman","doi":"10.1016/j.parco.2022.102898","DOIUrl":"10.1016/j.parco.2022.102898","url":null,"abstract":"<div><p>In most real-world networks, nodes/vertices tend to be organized into tightly-knit modules known as <em>communities</em> or <em>clusters</em> such that nodes within a community are more likely to be connected or related to one another than they are to the rest of the network. Community detection in a network (graph) is aimed at finding a partitioning of the vertices into communities. The goodness of the partitioning is commonly measured using <em>modularity</em>. Maximizing modularity is an NP-complete problem. In 2008, Blondel et al. introduced a multi-phase, multi-iteration heuristic for modularity maximization called the <em>Louvain</em> method. Owing to its speed and ability to yield high quality communities, the Louvain method continues to be one of the most widely used tools for serial community detection.</p><p>Distributed multi-GPU systems pose significant challenges and opportunities for efficient execution of parallel applications. Graph algorithms, in particular, have been known to be harder to parallelize on such platforms, due to irregular memory accesses, low computation to communication ratios, and load balancing problems that are especially hard to address on multi-GPU systems.</p><p>In this paper, we present our ongoing work on distributed-memory implementation of Louvain method on heterogeneous systems. We build on our prior work parallelizing the Louvain method for community detection on traditional CPU-only distributed systems without GPUs. Corroborated by an extensive set of experiments on multi-GPU systems, we demonstrate competitive performance to existing distributed-memory CPU-based implementation, up to 3.2<span><math><mo>×</mo></math></span> speedup using 16 nodes of OLCF Summit relative to two nodes, and up to 19<span><math><mo>×</mo></math></span> speedup relative to the NVIDIA RAPIDS® <span>cuGraph</span>® implementation on a single NVIDIA V100 GPU from DGX-2 platform, while achieving high quality solutions comparable to the original Louvain method. To the best of our knowledge, this work represents the first effort for community detection on distributed multi-GPU systems. Our approach and related findings can be extended to numerous other iterative graph algorithms on multi-GPU systems.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102898"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000060/pdfft?md5=af2c328e8814f291f58460d2c8138c36&pid=1-s2.0-S0167819122000060-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88658806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Task-parallel tiled direct solver for dense symmetric indefinite systems 密集对称不定系统的任务并行平铺直接求解
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-07-01 DOI: 10.1016/j.parco.2022.102900
Zhongyu Shen , Jilin Zhang , Tomohiro Suzuki

This paper proposes a direct solver for symmetric indefinite linear systems. The program is parallelized via the OpenMP task construct and outperforms existing programs. The proposed solver avoids pivoting, which requires a lot of data movement, during factorization with preconditioning using the symmetric random butterfly transformation. The matrix data layout is tiled after the preconditioning to more efficiently use cache memory during factorization. Given the low-rank property of the input matrices, an adaptive crossing approximation is used to make a low-rank approximation before the update step to reduce the computation load. Iterative refinement is then used to improve the accuracy of the final result. Finally, the performance of the proposed solver is compared to that of various symmetric indefinite linear system solvers to show its superiority.

本文提出了对称不定线性系统的一种直接求解方法。该程序通过OpenMP任务结构并行化,性能优于现有程序。该算法利用对称随机蝴蝶变换,避免了因式分解过程中需要大量数据移动的旋转问题。预处理后的矩阵数据布局被平铺,以便在分解过程中更有效地使用缓存内存。考虑到输入矩阵的低秩特性,采用自适应交叉逼近法在更新步骤之前进行低秩逼近,以减少计算量。然后使用迭代细化来提高最终结果的准确性。最后,将所提求解器的性能与各种对称不定线性系统求解器的性能进行了比较,证明了所提求解器的优越性。
{"title":"Task-parallel tiled direct solver for dense symmetric indefinite systems","authors":"Zhongyu Shen ,&nbsp;Jilin Zhang ,&nbsp;Tomohiro Suzuki","doi":"10.1016/j.parco.2022.102900","DOIUrl":"10.1016/j.parco.2022.102900","url":null,"abstract":"<div><p>This paper proposes a direct solver for symmetric indefinite linear systems. The program is parallelized via the OpenMP task construct and outperforms existing programs. The proposed solver avoids pivoting, which requires a lot of data movement, during factorization with preconditioning using the symmetric random butterfly transformation. The matrix data layout is tiled after the preconditioning to more efficiently use cache memory during factorization. Given the low-rank property of the input matrices, an adaptive crossing approximation is used to make a low-rank approximation before the update step to reduce the computation load. Iterative refinement is then used to improve the accuracy of the final result. Finally, the performance of the proposed solver is compared to that of various symmetric indefinite linear system solvers to show its superiority.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102900"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85415549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A coarse-grained multicomputer parallel algorithm for the sequential substring constrained longest common subsequence problem 序列子串约束下最长公共子序列问题的粗粒度多机并行算法
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-07-01 DOI: 10.1016/j.parco.2022.102927
Vianney Kengne Tchendji , Hermann Bogning Tepiele , Mathias Akong Onabid , Jean Frédéric Myoupo , Jerry Lacmou Zeutouo

In this paper, we study the sequential substring constrained longest common subsequence (SSCLCS) problem. It is widely used in the bioinformatics field. Given two strings X and Y with respective lengths m and n, formed on an alphabet Σ and a constraint sequence C formed by ordered strings (c1,c2,,cl) with total length r, the SSCLCS problem is to find the longest common subsequence D between X and Y such that D contains in an ordered way c1,c2,,cl. To solve this problem, Tseng et al. proposed a dynamic-programming algorithm that runs in Omnr+(m+n)|Σ| time. We rely on this work to propose a parallel algorithm for the SSCLCS problem on the Coarse-Grained Multicomputer (CGM) model. We design a three-dimensional partitioning technique of the corresponding dependency graph to reduce the latency time of processors by ensuring that at each step, the size of the subproblems to be performed by processors is small. It also minimizes the number of communications between processors. Our solution requires Onmr+(m+n)|Σ|p execution time with O(p) communication rounds on p processors. The experimental results show that our solution speedups up to 59.7 on 64 processors. This is better than the CGM-based parallel techniques that have been used in solving similar problems.

本文研究了序列子串约束的最长公共子序列问题。在生物信息学领域有着广泛的应用。给定两个字符串X和Y,长度分别为m和n,构成字母Σ和一个约束序列C,由总长度为r的有序字符串(c1,c2,…,cl)构成,SSCLCS问题是求X和Y之间的最长公共子序列D,使D以有序方式包含c1,c2,…,cl。为了解决这一问题,Tseng等人提出了一种运行在Omnr+(m+n)|Σ|时间内的动态规划算法。在此基础上,我们提出了一种基于粗粒度多计算机(CGM)模型的SSCLCS问题并行算法。我们设计了相应依赖图的三维划分技术,通过保证处理器在每一步执行的子问题的大小较小来减少处理器的延迟时间。它还减少了处理器之间的通信数量。我们的解决方案需要Onmr+(m+n)|Σ|p执行时间,在p个处理器上进行O(p)轮通信。实验结果表明,我们的解决方案在64个处理器上的加速高达59.7。这比用于解决类似问题的基于cgm的并行技术要好。
{"title":"A coarse-grained multicomputer parallel algorithm for the sequential substring constrained longest common subsequence problem","authors":"Vianney Kengne Tchendji ,&nbsp;Hermann Bogning Tepiele ,&nbsp;Mathias Akong Onabid ,&nbsp;Jean Frédéric Myoupo ,&nbsp;Jerry Lacmou Zeutouo","doi":"10.1016/j.parco.2022.102927","DOIUrl":"10.1016/j.parco.2022.102927","url":null,"abstract":"<div><p>In this paper, we study the sequential substring constrained longest common subsequence (SSCLCS) problem. It is widely used in the bioinformatics field. Given two strings <span><math><mi>X</mi></math></span> and <span><math><mi>Y</mi></math></span> with respective lengths <span><math><mi>m</mi></math></span> and <span><math><mi>n</mi></math></span>, formed on an alphabet <span><math><mi>Σ</mi></math></span> and a constraint sequence <span><math><mi>C</mi></math></span> formed by ordered strings <span><math><mrow><mo>(</mo><msup><mrow><mi>c</mi></mrow><mrow><mn>1</mn></mrow></msup><mo>,</mo><msup><mrow><mi>c</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>,</mo><mo>…</mo><mo>,</mo><msup><mrow><mi>c</mi></mrow><mrow><mi>l</mi></mrow></msup><mo>)</mo></mrow></math></span> with total length <span><math><mi>r</mi></math></span>, the SSCLCS problem is to find the longest common subsequence <span><math><mi>D</mi></math></span> between <span><math><mi>X</mi></math></span> and <span><math><mi>Y</mi></math></span> such that <span><math><mi>D</mi></math></span> contains in an ordered way <span><math><mrow><msup><mrow><mi>c</mi></mrow><mrow><mn>1</mn></mrow></msup><mo>,</mo><msup><mrow><mi>c</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>,</mo><mo>…</mo><mo>,</mo><msup><mrow><mi>c</mi></mrow><mrow><mi>l</mi></mrow></msup></mrow></math></span>. To solve this problem, Tseng et al. proposed a dynamic-programming algorithm that runs in <span><math><mrow><mi>O</mi><mfenced><mrow><mi>m</mi><mi>n</mi><mi>r</mi><mo>+</mo><mrow><mo>(</mo><mi>m</mi><mo>+</mo><mi>n</mi><mo>)</mo></mrow><mo>|</mo><mi>Σ</mi><mo>|</mo></mrow></mfenced></mrow></math></span><span><span> time. We rely on this work to propose a parallel algorithm for the SSCLCS problem on the Coarse-Grained </span>Multicomputer<span><span> (CGM) model. We design a three-dimensional partitioning technique of the corresponding dependency graph to reduce the latency time of processors by ensuring that at each step, the size of the </span>subproblems to be performed by processors is small. It also minimizes the number of communications between processors. Our solution requires </span></span><span><math><mrow><mi>O</mi><mfenced><mrow><mfrac><mrow><mi>n</mi><mi>m</mi><mi>r</mi><mo>+</mo><mrow><mo>(</mo><mi>m</mi><mo>+</mo><mi>n</mi><mo>)</mo></mrow><mo>|</mo><mi>Σ</mi><mo>|</mo></mrow><mrow><mi>p</mi></mrow></mfrac></mrow></mfenced></mrow></math></span> execution time with <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mi>p</mi><mo>)</mo></mrow></mrow></math></span> communication rounds on <span><math><mi>p</mi></math></span> processors. The experimental results show that our solution speedups up to 59.7 on 64 processors. This is better than the CGM-based parallel techniques that have been used in solving similar problems.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102927"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87832997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linear solvers for power grid optimization problems: A review of GPU-accelerated linear solvers 电网优化问题的线性求解器:gpu加速线性求解器综述
IF 1.4 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-07-01 DOI: 10.1016/j.parco.2021.102870
Kasia Świrydowicz , Eric Darve , Wesley Jones , Jonathan Maack , Shaked Regev , Michael A. Saunders , Stephen J. Thomas , Slaven Peleš

The linear equations that arise in interior methods for constrained optimization are sparse symmetric indefinite, and they become extremely ill-conditioned as the interior method converges. These linear systems present a challenge for existing solver frameworks based on sparse LU or LDLT decompositions. We benchmark five well known direct linear solver packages on CPU- and GPU-based hardware, using matrices extracted from power grid optimization problems. The achieved solution accuracy varies greatly among the packages. None of the tested packages delivers significant GPU acceleration for our test cases. For completeness of the comparison we include results for MA57, which is one of the most efficient and reliable CPU solvers for this class of problem.

约束优化的内部方法所产生的线性方程是稀疏对称不定的,并且随着内部方法的收敛而变得极其病态。这些线性系统对现有的基于稀疏LU或LDLT分解的求解器框架提出了挑战。我们使用从电网优化问题中提取的矩阵,在基于CPU和gpu的硬件上对五个众所周知的直接线性求解器包进行基准测试。不同的封装所获得的解的精度差别很大。在我们的测试用例中,没有一个测试包提供显著的GPU加速。为了完整地比较,我们包括了MA57的结果,它是这类问题中最有效和最可靠的CPU求解器之一。
{"title":"Linear solvers for power grid optimization problems: A review of GPU-accelerated linear solvers","authors":"Kasia Świrydowicz ,&nbsp;Eric Darve ,&nbsp;Wesley Jones ,&nbsp;Jonathan Maack ,&nbsp;Shaked Regev ,&nbsp;Michael A. Saunders ,&nbsp;Stephen J. Thomas ,&nbsp;Slaven Peleš","doi":"10.1016/j.parco.2021.102870","DOIUrl":"10.1016/j.parco.2021.102870","url":null,"abstract":"<div><p><span>The linear equations<span> that arise in interior methods for constrained optimization are sparse symmetric indefinite, and they become extremely ill-conditioned as the interior method converges. These linear systems present a challenge for existing solver frameworks based on sparse LU or </span></span><span><math><msup><mrow><mtext>LDL</mtext></mrow><mrow><mtext>T</mtext></mrow></msup></math></span><span><span> decompositions. We benchmark five well known direct linear solver packages on CPU- and GPU-based hardware, using matrices extracted from power grid optimization problems. The achieved solution accuracy varies greatly among the packages. None of the tested packages delivers significant </span>GPU acceleration for our test cases. For completeness of the comparison we include results for MA57, which is one of the most efficient and reliable CPU solvers for this class of problem.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102870"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80695625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
期刊
Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1