A modern memory system is equipped with many memory channels to obtain a high memory bandwidth. To take the advantage of this organization, applications’ data are distributed among the channels and transferred in an interleaved fashion. Although memory-intensive applications benefit from a high bandwidth by many memory channels, applications such as compute-intensive ones do not need the high bandwidth. To reduce the energy consumption for such applications, the memory system has low-power modes. During no memory request, the main memory can enter these modes and reduce energy consumption. However, these applications often cause intermittent memory requests to the channels that handle their data, resulting in not entering the low-power modes. Hence, the memory system cannot enter the low-power modes even though the applications do not need the high bandwidth. To solve this problem, this paper proposes a dynamic data allocation mechanism for many-channel memory systems. This mechanism forces data of such applications to use the specified channels by dynamically changing the address-mapping schemes and migrating the data. As a result, the other channels to which the data are not allocated can have a chance to enter the low-power modes for a long time. Therefore, the proposed mechanism has the potential to reduce the energy consumption of many-channel memory systems. The evaluation results show that this mechanism can reduce the energy consumption by up to 11.8% and 1.7% on average.
{"title":"An Energy-aware Dynamic Data Allocation Mechanism for Many-channel Memory Systems","authors":"Masayuki Sato, Takuya Toyoshima, Hikaru Takayashiki, Ryusuke Egawa, Hiroaki Kobayashi","doi":"10.14529/jsfi190401","DOIUrl":"https://doi.org/10.14529/jsfi190401","url":null,"abstract":"A modern memory system is equipped with many memory channels to obtain a high memory bandwidth. To take the advantage of this organization, applications’ data are distributed among the channels and transferred in an interleaved fashion. Although memory-intensive applications benefit from a high bandwidth by many memory channels, applications such as compute-intensive ones do not need the high bandwidth. To reduce the energy consumption for such applications, the memory system has low-power modes. During no memory request, the main memory can enter these modes and reduce energy consumption. However, these applications often cause intermittent memory requests to the channels that handle their data, resulting in not entering the low-power modes. Hence, the memory system cannot enter the low-power modes even though the applications do not need the high bandwidth. To solve this problem, this paper proposes a dynamic data allocation mechanism for many-channel memory systems. This mechanism forces data of such applications to use the specified channels by dynamically changing the address-mapping schemes and migrating the data. As a result, the other channels to which the data are not allocated can have a chance to enter the low-power modes for a long time. Therefore, the proposed mechanism has the potential to reduce the energy consumption of many-channel memory systems. The evaluation results show that this mechanism can reduce the energy consumption by up to 11.8% and 1.7% on average.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121368709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural networks are becoming more and more popular in scientific field and in the industry. It is mostly because new solutions using neural networks show state-of-the-art results in the domains previously occupied by traditional methods, eg. computer vision, speech recognition etc. But to get these results neural networks become progressively more complex, thus needing a lot more training. The training of neural networks today can take weeks. This problems can be solved by parallelization of the neural networks training and using modern clusters and supercomputers, which can significantly reduce the learning time. Today, a faster training for data scientist is essential, because it allows to get the results faster to make the next decision. In this paper we provide an overview of distributed learning provided by the popular modern deep learning frameworks, both in terms of provided functionality and performance. We consider multiple hardware choices: training on multiple GPUs and multiple computing nodes.
{"title":"Survey on Software Tools that Implement Deep Learning Algorithms on Intel/x86 and IBM/Power8/Power9 Platforms","authors":"Denis Shaikhislamov, A. Sozykin, V. Voevodin","doi":"10.14529/jsfi190404","DOIUrl":"https://doi.org/10.14529/jsfi190404","url":null,"abstract":"Neural networks are becoming more and more popular in scientific field and in the industry. It is mostly because new solutions using neural networks show state-of-the-art results in the domains previously occupied by traditional methods, eg. computer vision, speech recognition etc. But to get these results neural networks become progressively more complex, thus needing a lot more training. The training of neural networks today can take weeks. This problems can be solved by parallelization of the neural networks training and using modern clusters and supercomputers, which can significantly reduce the learning time. Today, a faster training for data scientist is essential, because it allows to get the results faster to make the next decision. In this paper we provide an overview of distributed learning provided by the popular modern deep learning frameworks, both in terms of provided functionality and performance. We consider multiple hardware choices: training on multiple GPUs and multiple computing nodes.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114226343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Alowayyed, M. Vassaux, B. Czaja, P. Coveney, A. Hoekstra
New applications that can exploit emerging exascale computing resources efficiently, while providing meaningful scientific results, are eagerly anticipated. Multi-scale models, especially multi-scale applications, will assuredly run at the exascale. We have established that a class of multi-scale applications implementing the heterogeneous multi-scale model follows, a heterogeneous multi-scale computing (HMC) pattern, which typically features a macroscopic model synchronising numerous independent microscopic model simulations. Consequently, communication between microscopic simulations is limited. Furthermore, a surrogate model can often be introduced between macro-scale and micro-scale models to interpolate required data from previously computed micro-scale simulations, thereby substantially reducing the number of micro-scale simulations. Nonetheless, HMC applications, though versatile, remain constrained by load balancing issues. We discuss two main issues: the a priori unknown and variable execution time of microscopic simulations, and the dynamic number of micro-scale simulations required. We tackle execution time variability using a pilot job mechanism to handle internal queuing and multiple sub-model execution on large-scale supercomputers, together with a data-informed execution time prediction model. To dynamically select the number of micro-scale simulations, the HMC pattern automatically detects and identifies three surrogate model phases that help control the available and used core amount. After relevant phase detection and micro-scale simulation scheduling, any idle cores can be used for surrogate model update or for processor release back to the system. We demonstrate HMC performance by testing it on two representative multi-scale applications. We conclude that, considering the subtle interplay between the macroscale model, surrogate models and micro-scale simulations, HMC provides a promising path towards exascale for many multiscale applications.
{"title":"Towards Heterogeneous Multi-scale Computing on Large Scale Parallel Supercomputers","authors":"S. Alowayyed, M. Vassaux, B. Czaja, P. Coveney, A. Hoekstra","doi":"10.14529/jsfi190402","DOIUrl":"https://doi.org/10.14529/jsfi190402","url":null,"abstract":"New applications that can exploit emerging exascale computing resources efficiently, while providing meaningful scientific results, are eagerly anticipated. Multi-scale models, especially multi-scale applications, will assuredly run at the exascale. We have established that a class of multi-scale applications implementing the heterogeneous multi-scale model follows, a heterogeneous multi-scale computing (HMC) pattern, which typically features a macroscopic model synchronising numerous independent microscopic model simulations. Consequently, communication between microscopic simulations is limited. Furthermore, a surrogate model can often be introduced between macro-scale and micro-scale models to interpolate required data from previously computed micro-scale simulations, thereby substantially reducing the number of micro-scale simulations. Nonetheless, HMC applications, though versatile, remain constrained by load balancing issues. We discuss two main issues: the a priori unknown and variable execution time of microscopic simulations, and the dynamic number of micro-scale simulations required. We tackle execution time variability using a pilot job mechanism to handle internal queuing and multiple sub-model execution on large-scale supercomputers, together with a data-informed execution time prediction model. To dynamically select the number of micro-scale simulations, the HMC pattern automatically detects and identifies three surrogate model phases that help control the available and used core amount. After relevant phase detection and micro-scale simulation scheduling, any idle cores can be used for surrogate model update or for processor release back to the system. We demonstrate HMC performance by testing it on two representative multi-scale applications. We conclude that, considering the subtle interplay between the macroscale model, surrogate models and micro-scale simulations, HMC provides a promising path towards exascale for many multiscale applications.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126183862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Banerjee, E. Georganas, Dhiraj D. Kalamkar, Barukh Ziv, Eden Segal, Cristina S. Anderson, A. Heinecke
Recurrent neural network (RNN) models have been found to be well suited for processing temporal data. In this work, we present an optimized implementation of vanilla RNN cell and its two popular variants: LSTM and GRU for Intel Xeon architecture. Typical implementations of these RNN cells employ one or two large matrix multiplication (GEMM) calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our implementation is generally faster than Intel MKL-DNN library implementations, e.g. for RNN, forward pass is up to ~3× faster whereas the backward/weight update pass is up to ~5× faster. Furthermore, we investigate high-performance implementations of sigmoid and tanh activation functions that achieve various levels of accuracy. These implementations rely on minimax polynomial approximations, rational polynomials, Taylor expansions and exponential approximation techniques. Our vectorized implementations can be flexibly integrated into deep learning computations with different accuracy requirements without compromising performance; in fact, these are able to outperform vectorized and reduced accuracy vendor-optimized (Intel SVML) libraries by 1.6–2.6× while speep up over GNU libm is close to two orders of magnitude. All our experiments are conducted on Intel’s latest CascadeLake architecture.
{"title":"Optimizing Deep Learning RNN Topologies on Intel Architecture","authors":"K. Banerjee, E. Georganas, Dhiraj D. Kalamkar, Barukh Ziv, Eden Segal, Cristina S. Anderson, A. Heinecke","doi":"10.14529/jsfi190304","DOIUrl":"https://doi.org/10.14529/jsfi190304","url":null,"abstract":"Recurrent neural network (RNN) models have been found to be well suited for processing temporal data. In this work, we present an optimized implementation of vanilla RNN cell and its two popular variants: LSTM and GRU for Intel Xeon architecture. Typical implementations of these RNN cells employ one or two large matrix multiplication (GEMM) calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our implementation is generally faster than Intel MKL-DNN library implementations, e.g. for RNN, forward pass is up to ~3× faster whereas the backward/weight update pass is up to ~5× faster. Furthermore, we investigate high-performance implementations of sigmoid and tanh activation functions that achieve various levels of accuracy. These implementations rely on minimax polynomial approximations, rational polynomials, Taylor expansions and exponential approximation techniques. Our vectorized implementations can be flexibly integrated into deep learning computations with different accuracy requirements without compromising performance; in fact, these are able to outperform vectorized and reduced accuracy vendor-optimized (Intel SVML) libraries by 1.6–2.6× while speep up over GNU libm is close to two orders of magnitude. All our experiments are conducted on Intel’s latest CascadeLake architecture.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133477604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Valentin Clement, P. Marti, X. Lapillonne, O. Fuhrer, W. Sawyer
In order to benefit from emerging high-performance computing systems, weather and climate models need to be adapted to run efficiently on different hardware architectures such as accelerators. This is a major challenge for existing community models that represent extremely large codebase written in Fortran. Large parts of the code can be ported using OpenACC compiler directives but for time-critical components such as physical parameterizations, code restructuring and optimizations specific to a hardware architecture are necessary to obtain high performance. In an effort to retain a single source code for multiple target architectures, the CLAW Compiler and the CLAW Single Column Abstraction were introduced. We report on the extension of the CLAW SCA to handle ELEMENTAL functions and subroutines. We demonstrate the new capability on the JSBACH land surface scheme of the ICON climate model. With the extension, JSBACH can be automatically ported to OpenACC or OpenMP for accelerators with minimal to no change to the original code.
{"title":"Automatic Port to OpenACC/OpenMP for Physical Parameterization in Climate and Weather Code Using the CLAW Compiler","authors":"Valentin Clement, P. Marti, X. Lapillonne, O. Fuhrer, W. Sawyer","doi":"10.14529/jsfi190303","DOIUrl":"https://doi.org/10.14529/jsfi190303","url":null,"abstract":"In order to benefit from emerging high-performance computing systems, weather and climate models need to be adapted to run efficiently on different hardware architectures such as accelerators. This is a major challenge for existing community models that represent extremely large codebase written in Fortran. Large parts of the code can be ported using OpenACC compiler directives but for time-critical components such as physical parameterizations, code restructuring and optimizations specific to a hardware architecture are necessary to obtain high performance. In an effort to retain a single source code for multiple target architectures, the CLAW Compiler and the CLAW Single Column Abstraction were introduced. We report on the extension of the CLAW SCA to handle ELEMENTAL functions and subroutines. We demonstrate the new capability on the JSBACH land surface scheme of the ICON climate model. With the extension, JSBACH can be automatically ported to OpenACC or OpenMP for accelerators with minimal to no change to the original code.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127624332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hikaru Takayashiki, Masayuki Sato, K. Komatsu, Hiroaki Kobayashi
As the number of cores and the memory bandwidth have increased in a balanced fashion, modern vector processors achieve high sustained performances, especially in memory-intensive applications in the fields of science and engineering. However, it is difficult to significantly increase the off-chip memory bandwidth owing to the limitation of the number of input/output pins integrated on a single chip. Under the circumstances, modern vector processors have adopted a shared cache to realize a high sustained memory bandwidth. The shared cache can effectively reduce the pressure to the off-chip memory bandwidth by keeping reusable data that multiple vector cores require. However, as the number of vector cores sharing a cache increases, more different blocks requested from multiple cores simultaneously use the same set. As a result, conflict misses caused by these blocks degrade the performance. In order to avoid increasing the conflict misses in the case of the increasing number of cores, this paper proposes a skewed cache for many-core vector processors. The skewed cache prevents the simultaneously requested blocks from being stored into the same set. This paper discusses how the most important two features of the skewed cache should be implemented in modern vector processors: hashing function and replacement policy. The proposed cache adopts the oddmultiplier displacement hashing for effective skewing and the static re-reference interval prediction policy for reasonable replacing. The evaluation results show that the proposed cache significantly improves the performance of a many-core vector processor by eliminating conflict misses.
{"title":"A Skewed Multi-banked Cache for Many-core Vector Processors","authors":"Hikaru Takayashiki, Masayuki Sato, K. Komatsu, Hiroaki Kobayashi","doi":"10.14529/jsfi190305","DOIUrl":"https://doi.org/10.14529/jsfi190305","url":null,"abstract":"As the number of cores and the memory bandwidth have increased in a balanced fashion, modern vector processors achieve high sustained performances, especially in memory-intensive applications in the fields of science and engineering. However, it is difficult to significantly increase the off-chip memory bandwidth owing to the limitation of the number of input/output pins integrated on a single chip. Under the circumstances, modern vector processors have adopted a shared cache to realize a high sustained memory bandwidth. The shared cache can effectively reduce the pressure to the off-chip memory bandwidth by keeping reusable data that multiple vector cores require. However, as the number of vector cores sharing a cache increases, more different blocks requested from multiple cores simultaneously use the same set. As a result, conflict misses caused by these blocks degrade the performance. In order to avoid increasing the conflict misses in the case of the increasing number of cores, this paper proposes a skewed cache for many-core vector processors. The skewed cache prevents the simultaneously requested blocks from being stored into the same set. This paper discusses how the most important two features of the skewed cache should be implemented in modern vector processors: hashing function and replacement policy. The proposed cache adopts the oddmultiplier displacement hashing for effective skewing and the static re-reference interval prediction policy for reasonable replacing. The evaluation results show that the proposed cache significantly improves the performance of a many-core vector processor by eliminating conflict misses.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117069923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Voevodin, A. Antonov, D. Nikitenko, P. Shvets, S. Sobolev, Igor Yu. Sidorov, K. Stefanov, V. Voevodin, S. Zhumatiy
The huge number of hardware and software components, together with a large number of parameters affecting the performance of each parallel application, makes ensuring the efficiency of a large scale supercomputer extremely difficult. In this situation, all basic parameters of the supercomputer should be constantly monitored, as well as many decisions about its functioning should be made by special software automatically. In this paper we describe the tight connection between complexity of modern large high performance computing systems and special techniques and tools required to ensure their efficiency in practice. The main subsystems of the developed complex (Octoshell, DiMMoN, Octotron, JobDigest, and an expert software system to bring fine analytics on parallel applications and the entire supercomputer to users and sysadmins) are actively operated on the large supercomputer systems at Lomonosov Moscow State University. A brief description of the architecture of Lomonosov-2 supercomputer is presented, and questions showing both a wide variety of emerging complex issues and the need for an integrated approach to solving the problem of effectively supporting large supercomputer systems are discussed.
{"title":"Supercomputer Lomonosov-2: Large Scale, Deep Monitoring and Fine Analytics for the User Community","authors":"V. Voevodin, A. Antonov, D. Nikitenko, P. Shvets, S. Sobolev, Igor Yu. Sidorov, K. Stefanov, V. Voevodin, S. Zhumatiy","doi":"10.14529/JSFI190201","DOIUrl":"https://doi.org/10.14529/JSFI190201","url":null,"abstract":"The huge number of hardware and software components, together with a large number of parameters affecting the performance of each parallel application, makes ensuring the efficiency of a large scale supercomputer extremely difficult. In this situation, all basic parameters of the supercomputer should be constantly monitored, as well as many decisions about its functioning should be made by special software automatically. In this paper we describe the tight connection between complexity of modern large high performance computing systems and special techniques and tools required to ensure their efficiency in practice. The main subsystems of the developed complex (Octoshell, DiMMoN, Octotron, JobDigest, and an expert software system to bring fine analytics on parallel applications and the entire supercomputer to users and sysadmins) are actively operated on the large supercomputer systems at Lomonosov Moscow State University. A brief description of the architecture of Lomonosov-2 supercomputer is presented, and questions showing both a wide variety of emerging complex issues and the need for an integrated approach to solving the problem of effectively supporting large supercomputer systems are discussed.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123243970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benchmarks in high performance computing often involve a single component used in the full solution of a computational problem, such as the solution of a linear system of equations. In many cases, the choice of algorithm, which can determine the components used, is also important when solving a full problem. Numerical evidence suggests that for the Taylor-Green vortex problem at a Reynolds number of 1600, a second order implicit midpoint rule method can require less computational time than the often used linearly implicit Carpenter-Kennedy method for solving the equations of incompressible fluid dynamics for moderate levels of accuracy at the beginning of the flow evolution. The primary reason is that even though the implicit midpoint rule is fully implicit, it can use a small number of iterations per time step, and thus require less computational work per time step than the Carpenter-Kennedy method. For the same number of timesteps, the Carpenter-Kennedy method is more accurate since it uses a higher order timestepping method.
{"title":"Fully Implicit Time Stepping Can Be Efficient on Parallel Computers","authors":"B. Cloutier, B. Muite, M. Parsani","doi":"10.14529/JSFI190206","DOIUrl":"https://doi.org/10.14529/JSFI190206","url":null,"abstract":"Benchmarks in high performance computing often involve a single component used in the full solution of a computational problem, such as the solution of a linear system of equations. In many cases, the choice of algorithm, which can determine the components used, is also important when solving a full problem. Numerical evidence suggests that for the Taylor-Green vortex problem at a Reynolds number of 1600, a second order implicit midpoint rule method can require less computational time than the often used linearly implicit Carpenter-Kennedy method for solving the equations of incompressible fluid dynamics for moderate levels of accuracy at the beginning of the flow evolution. The primary reason is that even though the implicit midpoint rule is fully implicit, it can use a small number of iterations per time step, and thus require less computational work per time step than the Carpenter-Kennedy method. For the same number of timesteps, the Carpenter-Kennedy method is more accurate since it uses a higher order timestepping method.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124903805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the performance limits of different algorithmic approaches to the implementation of a sample problem of wave equation solution with a cross stencil scheme. With this, we aim to find the highest limit of the achievable performance efficiency for stencil computing. To estimate the limits, we use a quantitative Roofline model to make a thorough analysis of the performance bottlenecks and develop the model further to account for the latency of different levels of GPU memory. These estimates provide an incentive to use spatial and temporal blocking algorithms. Thus, we study stepwise, domain decomposition, and domain decomposition with halo algorithms in that order. The knowledge of the limit incites the motivation to optimize the implementation. This led to the analysis of the block synchronization methods in CUDA, which is also provided in the text. After all optimizations, we have achieved 90% of the peak performance, which amounts to more than 1 trillion cell updates per second on one consumer level GPU device.
{"title":"Performance Limits Study of Stencil Codes on Modern GPGPUs","authors":"Ilya S. Pershin, V. Levchenko, A. Perepelkina","doi":"10.14529/JSFI190207","DOIUrl":"https://doi.org/10.14529/JSFI190207","url":null,"abstract":"We study the performance limits of different algorithmic approaches to the implementation of a sample problem of wave equation solution with a cross stencil scheme. With this, we aim to find the highest limit of the achievable performance efficiency for stencil computing. To estimate the limits, we use a quantitative Roofline model to make a thorough analysis of the performance bottlenecks and develop the model further to account for the latency of different levels of GPU memory. These estimates provide an incentive to use spatial and temporal blocking algorithms. Thus, we study stepwise, domain decomposition, and domain decomposition with halo algorithms in that order. The knowledge of the limit incites the motivation to optimize the implementation. This led to the analysis of the block synchronization methods in CUDA, which is also provided in the text. After all optimizations, we have achieved 90% of the peak performance, which amounts to more than 1 trillion cell updates per second on one consumer level GPU device.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134096410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A universal framework for modeling composites and fabrics of micro- and nanofibers, such as carbon nanotubes, carbon fibers and amyloid fibrils, is presented. Within this framework, fibers are represented with chains of rigid bodies, linked with elastic bonds. Elasticity of the bonds utilizes recently developed enhanced vector model formalism. The type of interactions between fibers is determined by their nature and physical length scale of the simulation. The dynamics of fibers is computed using the modification of rigid particle dynamics module of the waLBerla multiphysics framework. Our modeling system demonstrates exceptionally high parallel performance combined with the physical accuracy of the modeling. The efficiency of our technique is demonstrated with an illustrative mechanical test on a hypothetical carbon nanotube textile. In this example, the elasticity of the fibers represents the coarse-grained covalent bond within CNT surface, whereas interfiber interactions represent coarse-grained van der Waals forces between cylindrical segments of nanotubes. Numerical simulation demonstrates stability and extremal strength of a hypothetical carbon nanotube fabric.
{"title":"Distinct Element Simulation of Mechanical Properties of Hypothetical CNT Nanofabrics","authors":"I. Ostanin","doi":"10.14529/JSFI190208","DOIUrl":"https://doi.org/10.14529/JSFI190208","url":null,"abstract":"A universal framework for modeling composites and fabrics of micro- and nanofibers, such as carbon nanotubes, carbon fibers and amyloid fibrils, is presented. Within this framework, fibers are represented with chains of rigid bodies, linked with elastic bonds. Elasticity of the bonds utilizes recently developed enhanced vector model formalism. The type of interactions between fibers is determined by their nature and physical length scale of the simulation. The dynamics of fibers is computed using the modification of rigid particle dynamics module of the waLBerla multiphysics framework. Our modeling system demonstrates exceptionally high parallel performance combined with the physical accuracy of the modeling. The efficiency of our technique is demonstrated with an illustrative mechanical test on a hypothetical carbon nanotube textile. In this example, the elasticity of the fibers represents the coarse-grained covalent bond within CNT surface, whereas interfiber interactions represent coarse-grained van der Waals forces between cylindrical segments of nanotubes. Numerical simulation demonstrates stability and extremal strength of a hypothetical carbon nanotube fabric.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116727442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}