Pub Date : 2022-09-01DOI: 10.1016/j.parco.2022.102936
Andrew Garmon , Vinay Ramakrishnaiah , Danny Perez
The constant increase in parallelism available on large-scale distributed computers poses major scalability challenges to many scientific applications. A common strategy to improve scalability is to express algorithms in terms of independent tasks that can be executed concurrently on a runtime system. In this manuscript, we consider a generalization of this approach where task-level speculation is allowed. In this context, a probability is attached to each task which corresponds to the likelihood that the output of the speculative task will be consumed as part of the larger calculation. We consider the problem of optimal resource allocation to each of the possible tasks so as to maximize the total expected computational throughput. The power of this approach is demonstrated by analyzing its application to Parallel Trajectory Splicing, a massively-parallel long-time-dynamics method for atomistic simulations.
{"title":"Resource allocation for task-level speculative scientific applications: A proof of concept using Parallel Trajectory Splicing","authors":"Andrew Garmon , Vinay Ramakrishnaiah , Danny Perez","doi":"10.1016/j.parco.2022.102936","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102936","url":null,"abstract":"<div><p><span>The constant increase in parallelism available on large-scale distributed computers poses major scalability challenges to many scientific applications. A common strategy to improve scalability is to express algorithms in terms of independent tasks that can be executed concurrently on a </span>runtime system<span>. In this manuscript, we consider a generalization of this approach where task-level speculation is allowed. In this context, a probability is attached to each task which corresponds to the likelihood that the output of the speculative task will be consumed as part of the larger calculation. We consider the problem of optimal resource allocation to each of the possible tasks so as to maximize the total expected computational throughput. The power of this approach is demonstrated by analyzing its application to Parallel Trajectory Splicing, a massively-parallel long-time-dynamics method for atomistic simulations.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102936"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91714606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1016/j.parco.2022.102944
Lena Oden, Jörg Keller
We investigate cryptanalytic applications comprised of many independent tasks that exhibit a stochastic runtime distribution. We compare four algorithms for executing such applications on GPUs and on multicore CPUs with SIMD units. We demonstrate that for four different distributions, multiple problem sizes, and three platforms the best strategy varies. We support our analytic results by extensive experiments on an Intel Skylake-based multicore CPU and a high performance GPU (Nvidia Volta).
{"title":"Improving cryptanalytic applications with stochastic runtimes on GPUs and multicores","authors":"Lena Oden, Jörg Keller","doi":"10.1016/j.parco.2022.102944","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102944","url":null,"abstract":"<div><p>We investigate cryptanalytic applications comprised of many independent tasks that exhibit a stochastic runtime distribution. We compare four algorithms for executing such applications on GPUs and on multicore CPUs with SIMD units. We demonstrate that for four different distributions, multiple problem sizes, and three platforms the best strategy varies. We support our analytic results by extensive experiments on an Intel Skylake-based multicore CPU and a high performance GPU (Nvidia Volta).</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102944"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"137214689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1016/j.parco.2022.102945
Zhong Liu, Xin Xiao, Chen Li, Sheng Ma, Deng Rangyu
Vector Accelerators have been widely used in scientific computing. It also shows great potential to accelerate the computational performance of convolutional neural networks (CNNs). However, previous general CNN-mapping methods introduced a large amount of intermediate data and additional conversion, and the resulting memory overhead would cause great performance loss.
To address these issues and achieve high computational efficiency, this paper proposes an efficient CNN-mapping method dedicated to vector accelerators, including: 1) Data layout method: establishing a set of efficient data storage and computing models for various CNN networks on vector accelerators. It achieves high memory access efficiency and high vectorization efficiency. 2) A conversion method: convert the computation of convolutional layers and fully connected layers into large-scale matrix multiplication, and convert the computation of pooling layers into row computation of matrix. All conversions are implemented by extracting rows from a two-dimensional matrix, with high data access and transmission efficiency, and without additional memory overhead and data conversion.
Based on these methods, we design a vectorization mechanism to vectorize convolutional, pooling and fully connected layers on a vector accelerator, which can be applied for various CNN models. This mechanism takes full advantage of the parallel computing capability of the multi-core vector accelerator and further improves the performance of deep convolutional neural networks. The experimental results show that the average computational efficiency of the convolutional layers and full connected layers of AlexNet, VGG-19, GoogleNet and ResNet-50 is 93.3% and 93.4% respectively, and the average data access efficiency of pooling layer is 70%. Compared to NVIDIA inference GPUs, our accelerator achieves a 36.1% performance improvement, comparable to NVIDIA V100 GPUs. Compared with Matrix2000 of similar architecture, our accelerator achieves a 17-45% improvement in computational efficiency.
{"title":"Optimizing convolutional neural networks on multi-core vector accelerator","authors":"Zhong Liu, Xin Xiao, Chen Li, Sheng Ma, Deng Rangyu","doi":"10.1016/j.parco.2022.102945","DOIUrl":"10.1016/j.parco.2022.102945","url":null,"abstract":"<div><p>Vector Accelerators have been widely used in scientific computing. It also shows great potential to accelerate the computational performance of convolutional neural networks (CNNs). However, previous general CNN-mapping methods introduced a large amount of intermediate data and additional conversion, and the resulting memory overhead would cause great performance loss.</p><p>To address these issues and achieve high computational efficiency, this paper proposes an efficient CNN-mapping method dedicated to vector accelerators, including: 1) Data layout method: establishing a set of efficient data storage and computing models for various CNN networks on vector accelerators. It achieves high memory access efficiency and high vectorization efficiency. 2) A conversion method: convert the computation of convolutional layers and fully connected layers into large-scale matrix multiplication, and convert the computation of pooling layers into row computation of matrix. All conversions are implemented by extracting rows from a two-dimensional matrix, with high data access and transmission efficiency, and without additional memory overhead and data conversion.</p><p>Based on these methods, we design a vectorization mechanism to vectorize convolutional, pooling and fully connected layers on a vector accelerator, which can be applied for various CNN models. This mechanism takes full advantage of the parallel computing capability of the multi-core vector accelerator and further improves the performance of deep convolutional neural networks. The experimental results show that the average computational efficiency of the convolutional layers and full connected layers of AlexNet, VGG-19, GoogleNet and ResNet-50 is 93.3% and 93.4% respectively, and the average data access efficiency of pooling layer is 70%. Compared to NVIDIA inference GPUs, our accelerator achieves a 36.1% performance improvement, comparable to NVIDIA V100 GPUs. Compared with Matrix2000 of similar architecture, our accelerator achieves a 17-45% improvement in computational efficiency.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102945"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83744906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1016/j.parco.2022.102942
Busenur Aktılav, Işıl Öz
Approximate computing techniques, where less-than-perfect solutions are acceptable, present performance-accuracy trade-offs by performing inexact computations. Moreover, heterogeneous architectures, a combination of miscellaneous compute units, offer high performance as well as energy efficiency. Graph algorithms utilize the parallel computation units of heterogeneous GPU architectures as well as performance improvements offered by approximation methods. Since different approximations yield different speedup and accuracy loss for the target execution, it becomes impractical to test all methods with various parameters. In this work, we perform approximate computations for the three shortest-path graph algorithms and propose a machine learning framework to predict the impact of the approximations on program performance and output accuracy. We evaluate random predictions for both synthetic and real road-network graphs, and predictions of the large graph cases from small graph instances. We achieve less than 5% prediction error rates for speedup and inaccuracy values.
{"title":"Performance and accuracy predictions of approximation methods for shortest-path algorithms on GPUs","authors":"Busenur Aktılav, Işıl Öz","doi":"10.1016/j.parco.2022.102942","DOIUrl":"10.1016/j.parco.2022.102942","url":null,"abstract":"<div><p><span><span>Approximate computing techniques, where less-than-perfect solutions are acceptable, present performance-accuracy trade-offs by performing inexact computations. Moreover, </span>heterogeneous architectures<span><span>, a combination of miscellaneous compute units, offer high performance as well as energy efficiency. Graph algorithms utilize the parallel computation units of heterogeneous </span>GPU architectures as well as performance improvements offered by </span></span>approximation<span> methods. Since different approximations yield different speedup and accuracy loss for the target execution, it becomes impractical to test all methods with various parameters. In this work, we perform approximate computations for the three shortest-path graph algorithms and propose a machine learning framework to predict the impact of the approximations on program performance and output accuracy. We evaluate random predictions for both synthetic and real road-network graphs, and predictions of the large graph cases from small graph instances. We achieve less than 5% prediction error rates for speedup and inaccuracy values.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102942"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89425951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1016/j.parco.2022.102948
Seo Jin Jang , Wei Liu , Wei Li , Yong Beom Cho
In this paper, we present a computer cluster with heterogeneous computing components intended to provide concurrency and parallelism with embedded processors to achieve a real-time Multi-View High-Efficiency Video Coding (MV-HEVC) encoder/decoder with a maximum resolution of 1088p. The latest MV-HEVC standard represents a significant improvement over the previous video coding standard (MVC). However, the MV-HEVC standard also has higher computational complexity. To this point, research using the MV-HEVC has had to use the Central Processing Unit (CPU) on a Personal Computer (PC) or workstation for decompression, because MV-HEVC is much more complex than High-Efficiency Video Coding (HEVC), and because decompressors need higher parallelism to decompress in real time. It is particularly difficult to encode/decode in an embedded device. Therefore, we propose a novel framework for an MV-HEVC encoder/decoder that is based on a heterogeneously distributed embedded system. To this end, we use a parallel computing method to divide the video into multiple blocks and then code the blocks independently in each sub-work node with a group of pictures and a coding tree unit level. To appropriately assign the tasks to each work node, we propose a new allocation method that makes the operation of the entire heterogeneously distributed system more efficient. Our experimental results show that, compared to the single device (3D-HTM single threading), the proposed distributed MV-HEVC decoder and encoder performance increased approximately (20.39 and 68.7) times under 20 devices (multithreading) with the CTU level of a 1088p resolution video, respectively. Further, at the proposed GOP level, the decoder and encoder performance with 20 devices (multithreading) respectively increased approximately (20.78 and 77) times for a 1088p resolution video with heterogeneously distributed computing compared to the single device (3D-HTM single threading).
{"title":"Parallel multi-view HEVC for heterogeneously embedded cluster system","authors":"Seo Jin Jang , Wei Liu , Wei Li , Yong Beom Cho","doi":"10.1016/j.parco.2022.102948","DOIUrl":"10.1016/j.parco.2022.102948","url":null,"abstract":"<div><p><span>In this paper, we present a computer cluster with heterogeneous computing<span> components intended to provide concurrency and parallelism with embedded processors to achieve a real-time Multi-View High-Efficiency Video Coding (MV-HEVC) encoder/decoder with a maximum resolution of 1088p. The latest MV-HEVC standard represents a significant improvement over the previous video coding standard (MVC). However, the MV-HEVC standard also has higher </span></span>computational complexity<span><span>. To this point, research using the MV-HEVC has had to use the Central Processing Unit<span><span> (CPU) on a Personal Computer (PC) or workstation for decompression<span>, because MV-HEVC is much more complex than High-Efficiency Video Coding (HEVC), and because decompressors need higher parallelism to decompress in real time. It is particularly difficult to encode/decode in an embedded device. Therefore, we propose a novel framework for an MV-HEVC encoder/decoder that is based on a heterogeneously distributed embedded system. To this end, we use a </span></span>parallel computing method to divide the video into multiple blocks and then code the blocks independently in each sub-work node with a group of pictures and a coding tree unit level. To appropriately assign the tasks to each work node, we propose a new allocation method that makes the operation of the entire heterogeneously distributed system more efficient. Our experimental results show that, compared to the single device (3D-HTM single threading), the proposed distributed MV-HEVC decoder and encoder performance increased approximately (20.39 and 68.7) times under 20 devices (multithreading) with the CTU level of a 1088p resolution video, respectively. Further, at the proposed GOP level, the decoder and encoder performance with 20 devices (multithreading) respectively increased approximately (20.78 and 77) times for a 1088p resolution video with heterogeneously </span></span>distributed computing compared to the single device (3D-HTM single threading).</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102948"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74144309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-01DOI: 10.1016/j.parco.2022.102940
Daniel Bielich , Julien Langou , Stephen Thomas , Kasia Świrydowicz , Ichitaro Yamazaki , Erik G. Boman
The parallel strong-scaling of iterative methods is often determined by the number of global reductions at each iteration. Low-synch Gram–Schmidt algorithms are applied here to the Arnoldi algorithm to reduce the number of global reductions and therefore to improve the parallel strong-scaling of iterative solvers for nonsymmetric matrices such as the GMRES and the Krylov–Schur iterative methods. In the Arnoldi context, the factorization is “left-looking” and processes one column at a time. Among the methods for generating an orthogonal basis for the Arnoldi algorithm, the classical Gram–Schmidt algorithm, with reorthogonalization (CGS2) requires three global reductions per iteration. A new variant of CGS2 that requires only one reduction per iteration is presented and applied to the Arnoldi algorithm. Delayed CGS2 (DCGS2) employs the minimum number of global reductions per iteration (one) for a one-column at-a-time algorithm. The main idea behind the new algorithm is to group global reductions by rearranging the order of operations. DCGS2 must be carefully integrated into an Arnoldi expansion or a GMRES solver. Numerical stability experiments assess robustness for Krylov–Schur eigenvalue computations. Performance experiments on the ORNL Summit supercomputer then establish the superiority of DCGS2 over CGS2.
{"title":"Low-synch Gram–Schmidt with delayed reorthogonalization for Krylov solvers","authors":"Daniel Bielich , Julien Langou , Stephen Thomas , Kasia Świrydowicz , Ichitaro Yamazaki , Erik G. Boman","doi":"10.1016/j.parco.2022.102940","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102940","url":null,"abstract":"<div><p><span>The parallel strong-scaling of iterative methods is often determined by the number of global reductions at each iteration. Low-synch Gram–Schmidt algorithms are applied here to the Arnoldi algorithm to reduce the number of global reductions and therefore to improve the parallel strong-scaling of iterative solvers for nonsymmetric matrices such as the GMRES and the Krylov–Schur iterative methods. In the Arnoldi context, the </span><span><math><mrow><mi>Q</mi><mi>R</mi></mrow></math></span><span><span><span> factorization is “left-looking” and processes one column at a time. Among the methods for generating an orthogonal basis for the Arnoldi algorithm, the classical Gram–Schmidt algorithm, with reorthogonalization (CGS2) requires three global reductions per iteration. A new variant of CGS2 that requires only one reduction per iteration is presented and applied to the Arnoldi algorithm. Delayed CGS2 (DCGS2) employs the minimum number of global reductions per iteration (one) for a one-column at-a-time algorithm. The main idea behind the new algorithm is to group global reductions by rearranging the order of operations. DCGS2 must be carefully integrated into an Arnoldi expansion or a GMRES solver. Numerical stability experiments assess robustness for Krylov–Schur </span>eigenvalue computations. Performance experiments on the ORNL Summit </span>supercomputer then establish the superiority of DCGS2 over CGS2.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102940"},"PeriodicalIF":1.4,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91714605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the widespread use of real-time data analysis by artificial intelligence (AI), the integration of accelerators is attracting attention from the perspectives of their low power consumption and low latency. The objective of this research is to increase accelerator resource efficiency and further reduce power consumption by sharing accelerators among multiple users while maintaining real-time performance. To achieve the accelerator-sharing system, we define three requirements: high device utilization, fair device utilization among users, and real-time performance. Targeting the AI inference use case, this paper proposes a system that shares a field-programmable gate array (FPGA) among multiple users by switching the convolutional neural network (CNN) models stored in the device memory on the FPGA, while satisfying the three requirements. The proposed system uses different behavioral models for workloads with predictable and unpredictable data arrival timing. For the workloads with predictable data arrival timing, the system uses spatial-division multiplexing of the FPGA device memory to achieve real-time performance and high device utilization. Specifically, the FPGA device memory controller of the system transparently preloads and caches the CNN models into the FPGA device memory before the data arrival. For workloads with unpredictable data arrival timing, the system transfers CNN models to the FPGA device memory upon data arrival using time-division multiplexing of FPGA device memory. In the latter case of unpredictable workloads, the switch cost between CNN models is non-negligible to achieve real-time performance and high device utilization, so the system integrates a new scheduling algorithm that considers the switch time of the CNN models. For both predictable and unpredictable workloads, user fairness is achieved by using an ageing technique in the scheduling algorithm that increases the priority of jobs in accordance with the job waiting time. The evaluation results show that the scheduling overhead of the proposed system is negligible for both predictable and unpredictable workloads providing practical real-time performance. For unpredictable workloads, the new scheduling algorithm improves fairness by 24%–94% and resource efficiency by 31%–33% compared to traditional algorithms using first-come first-served or round-robin. For predictable workloads, the system improves fairness by 50.5 % compared to first-come first-served and achieves 99.5 % resource efficiency.
{"title":"Spatial- and time- division multiplexing in CNN accelerator","authors":"Tetsuro Nakamura, Shogo Saito, Kei Fujimoto, Masashi Kaneko, Akinori Shiraga","doi":"10.1016/j.parco.2022.102922","DOIUrl":"10.1016/j.parco.2022.102922","url":null,"abstract":"<div><p>With the widespread use of real-time data analysis by artificial intelligence (AI), the integration of accelerators is attracting attention from the perspectives of their low power consumption and low latency. The objective of this research is to increase accelerator resource efficiency and further reduce power consumption by sharing accelerators among multiple users while maintaining real-time performance. To achieve the accelerator-sharing system, we define three requirements: high device utilization, fair device utilization among users, and real-time performance. Targeting the AI inference use case, this paper proposes a system that shares a field-programmable gate array (FPGA) among multiple users by switching the convolutional neural network (CNN) models stored in the device memory on the FPGA, while satisfying the three requirements. The proposed system uses different behavioral models for workloads with predictable and unpredictable data arrival timing. For the workloads with predictable data arrival timing, the system uses spatial-division multiplexing of the FPGA device memory to achieve real-time performance and high device utilization. Specifically, the FPGA device memory controller of the system transparently preloads and caches the CNN models into the FPGA device memory before the data arrival. For workloads with unpredictable data arrival timing, the system transfers CNN models to the FPGA device memory upon data arrival using time-division multiplexing of FPGA device memory. In the latter case of unpredictable workloads, the switch cost between CNN models is non-negligible to achieve real-time performance and high device utilization, so the system integrates a new scheduling algorithm that considers the switch time of the CNN models. For both predictable and unpredictable workloads, user fairness is achieved by using an ageing technique in the scheduling algorithm that increases the priority of jobs in accordance with the job waiting time. The evaluation results show that the scheduling overhead of the proposed system is negligible for both predictable and unpredictable workloads providing practical real-time performance. For unpredictable workloads, the new scheduling algorithm improves fairness by 24%–94% and resource efficiency by 31%–33% compared to traditional algorithms using first-come first-served or round-robin. For predictable workloads, the system improves fairness by 50.5 % compared to first-come first-served and achieves 99.5 % resource efficiency.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102922"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000254/pdfft?md5=ffbf4f3879d04bf0d1e7a6b33de6606f&pid=1-s2.0-S0167819122000254-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90824046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-01DOI: 10.1016/j.parco.2022.102920
Robert Schade , Tobias Kenter , Hossam Elgabarty , Michael Lass , Ole Schütt , Alfio Lazzaro , Hans Pabst , Stephan Mohr , Jürg Hutter , Thomas D. Kühne , Christian Plessl
We push the boundaries of electronic structure-based ab-initio molecular dynamics (AIMD) beyond 100 million atoms. This scale is otherwise barely reachable with classical force-field methods or novel neural network and machine learning potentials. We achieve this breakthrough by combining innovations in linear-scaling AIMD, efficient and approximate sparse linear algebra, low and mixed-precision floating-point computation on GPUs, and a compensation scheme for the errors introduced by numerical approximations. The core of our work is the non-orthogonalized local submatrix method (NOLSM), which scales very favorably to massively parallel computing systems and translates large sparse matrix operations into highly parallel, dense matrix operations that are ideally suited to hardware accelerators. We demonstrate that the NOLSM method, which is at the center point of each AIMD step, is able to achieve a sustained performance of 324 PFLOP/s in mixed FP16/FP32 precision corresponding to an efficiency of 67.7% when running on 1536 NVIDIA A100 GPUs.
{"title":"Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms","authors":"Robert Schade , Tobias Kenter , Hossam Elgabarty , Michael Lass , Ole Schütt , Alfio Lazzaro , Hans Pabst , Stephan Mohr , Jürg Hutter , Thomas D. Kühne , Christian Plessl","doi":"10.1016/j.parco.2022.102920","DOIUrl":"10.1016/j.parco.2022.102920","url":null,"abstract":"<div><p>We push the boundaries of electronic structure-based <em>ab-initio</em> molecular dynamics (AIMD) beyond 100 million atoms. This scale is otherwise barely reachable with classical force-field methods or novel neural network and machine learning potentials. We achieve this breakthrough by combining innovations in linear-scaling AIMD, efficient and approximate sparse linear algebra, low and mixed-precision floating-point computation on GPUs, and a compensation scheme for the errors introduced by numerical approximations. The core of our work is the non-orthogonalized local submatrix method (NOLSM), which scales very favorably to massively parallel computing systems and translates large sparse matrix operations into highly parallel, dense matrix operations that are ideally suited to hardware accelerators. We demonstrate that the NOLSM method, which is at the center point of each AIMD step, is able to achieve a sustained performance of 324 PFLOP/s in mixed FP16/FP32 precision corresponding to an efficiency of 67.7% when running on 1536 NVIDIA A100 GPUs.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102920"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167819122000242/pdfft?md5=cb708fe8c83694714bb33b45ee473a37&pid=1-s2.0-S0167819122000242-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77029045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-01DOI: 10.1016/j.parco.2022.102893
Jianguo Liang , Rong Hua , Wenqiang Zhu , Yuxi Ye , You Fu , Hao Zhang
The Silicon-Crystal application based on molecular dynamics (MD) is used to simulate the thermal conductivity of the crystal, which adopts the Tersoff potential to simulate the trajectory of the silicon crystal. Based on the OpenACC version, to better solve the problem of discrete memory access and write dependency, task pipeline optimization and the interval graph coloring scheduling method are proposed. Also, the part of codes on CPEs is vectorized by the SIMD command to further improve the computational performance. After the collaborative development of OpenACC+Athread, the performance has been improved by 16.68 times and achieves 2.34X speedup compared with the OpenACC version. Moreover, the application is expanded to 66,560 cores and can simulate reactions of 268,435,456 silicon atoms.
{"title":"OpenACC + Athread collaborative optimization of Silicon-Crystal application on Sunway TaihuLight","authors":"Jianguo Liang , Rong Hua , Wenqiang Zhu , Yuxi Ye , You Fu , Hao Zhang","doi":"10.1016/j.parco.2022.102893","DOIUrl":"10.1016/j.parco.2022.102893","url":null,"abstract":"<div><p><span>The Silicon-Crystal application based on molecular dynamics (MD) is used to simulate the thermal conductivity of the crystal, which adopts the Tersoff potential to simulate the trajectory of the silicon crystal. Based on the </span>OpenACC<span><span> version, to better solve the problem of discrete memory access and write dependency, task pipeline optimization and the interval graph coloring scheduling method are proposed. Also, the part of codes on CPEs is vectorized by the SIMD command to further improve the computational performance. After the collaborative development of OpenACC+Athread, the performance has been improved by 16.68 times and achieves 2.34X speedup compared with the OpenACC version. Moreover, the application is expanded to 66,560 cores and can simulate reactions of 268,435,456 </span>silicon atoms.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102893"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75647516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of Hadoop is a challenging task. An OLAP Cube query comprises several relational operations, such as selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system’s query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time.
{"title":"Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance","authors":"Yassine Ramdane , Omar Boussaid , Doulkifli Boukraà , Nadia Kabachi , Fadila Bentayeb","doi":"10.1016/j.parco.2022.102918","DOIUrl":"10.1016/j.parco.2022.102918","url":null,"abstract":"<div><p><span>Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of </span>Hadoop<span> is a challenging task. An OLAP Cube query comprises several relational operations, such as selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system’s query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102918"},"PeriodicalIF":1.4,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90453784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}