首页 > 最新文献

Supercomput. Front. Innov.最新文献

英文 中文
An Energy-aware Dynamic Data Allocation Mechanism for Many-channel Memory Systems 多通道存储系统的能量感知动态数据分配机制
Pub Date : 2020-01-28 DOI: 10.14529/jsfi190401
Masayuki Sato, Takuya Toyoshima, Hikaru Takayashiki, Ryusuke Egawa, Hiroaki Kobayashi
A modern memory system is equipped with many memory channels to obtain a high memory bandwidth. To take the advantage of this organization, applications’ data are distributed among the channels and transferred in an interleaved fashion. Although memory-intensive applications benefit from a high bandwidth by many memory channels, applications such as compute-intensive ones do not need the high bandwidth. To reduce the energy consumption for such applications, the memory system has low-power modes. During no memory request, the main memory can enter these modes and reduce energy consumption. However, these applications often cause intermittent memory requests to the channels that handle their data, resulting in not entering the low-power modes. Hence, the memory system cannot enter the low-power modes even though the applications do not need the high bandwidth. To solve this problem, this paper proposes a dynamic data allocation mechanism for many-channel memory systems. This mechanism forces data of such applications to use the specified channels by dynamically changing the address-mapping schemes and migrating the data. As a result, the other channels to which the data are not allocated can have a chance to enter the low-power modes for a long time. Therefore, the proposed mechanism has the potential to reduce the energy consumption of many-channel memory systems. The evaluation results show that this mechanism can reduce the energy consumption by up to 11.8% and 1.7% on average.
现代的存储系统配备了许多存储通道,以获得较高的存储带宽。为了利用这种组织,应用程序的数据分布在通道之间,并以交错的方式传输。尽管内存密集型应用程序可以从许多内存通道的高带宽中受益,但诸如计算密集型应用程序之类的应用程序并不需要高带宽。为了降低此类应用的能耗,存储器系统具有低功耗模式。在无内存请求时,主存可以进入这些模式,降低能耗。然而,这些应用程序通常会对处理其数据的通道发出间歇性的内存请求,从而导致无法进入低功耗模式。因此,即使应用程序不需要高带宽,存储系统也无法进入低功耗模式。为了解决这一问题,本文提出了一种多通道存储系统的动态数据分配机制。该机制通过动态更改地址映射方案和迁移数据,强制此类应用程序的数据使用指定的通道。这样,没有分配数据的其他信道就有机会长时间进入低功耗模式。因此,所提出的机制有可能降低多通道存储系统的能量消耗。评价结果表明,该机制可使能耗平均降低11.8%和1.7%。
{"title":"An Energy-aware Dynamic Data Allocation Mechanism for Many-channel Memory Systems","authors":"Masayuki Sato, Takuya Toyoshima, Hikaru Takayashiki, Ryusuke Egawa, Hiroaki Kobayashi","doi":"10.14529/jsfi190401","DOIUrl":"https://doi.org/10.14529/jsfi190401","url":null,"abstract":"A modern memory system is equipped with many memory channels to obtain a high memory bandwidth. To take the advantage of this organization, applications’ data are distributed among the channels and transferred in an interleaved fashion. Although memory-intensive applications benefit from a high bandwidth by many memory channels, applications such as compute-intensive ones do not need the high bandwidth. To reduce the energy consumption for such applications, the memory system has low-power modes. During no memory request, the main memory can enter these modes and reduce energy consumption. However, these applications often cause intermittent memory requests to the channels that handle their data, resulting in not entering the low-power modes. Hence, the memory system cannot enter the low-power modes even though the applications do not need the high bandwidth. To solve this problem, this paper proposes a dynamic data allocation mechanism for many-channel memory systems. This mechanism forces data of such applications to use the specified channels by dynamically changing the address-mapping schemes and migrating the data. As a result, the other channels to which the data are not allocated can have a chance to enter the low-power modes for a long time. Therefore, the proposed mechanism has the potential to reduce the energy consumption of many-channel memory systems. The evaluation results show that this mechanism can reduce the energy consumption by up to 11.8% and 1.7% on average.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121368709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Survey on Software Tools that Implement Deep Learning Algorithms on Intel/x86 and IBM/Power8/Power9 Platforms 在Intel/x86和IBM/Power8/Power9平台上实现深度学习算法的软件工具综述
Pub Date : 2019-12-01 DOI: 10.14529/jsfi190404
Denis Shaikhislamov, A. Sozykin, V. Voevodin
Neural networks are becoming more and more popular in scientific field and in the industry. It is mostly because new solutions using neural networks show state-of-the-art results in the domains previously occupied by traditional methods, eg. computer vision, speech recognition etc. But to get these results neural networks become progressively more complex, thus needing a lot more training. The training of neural networks today can take weeks. This problems can be solved by parallelization of the neural networks training and using modern clusters and supercomputers, which can significantly reduce the learning time. Today, a faster training for data scientist is essential, because it allows to get the results faster to make the next decision. In this paper we provide an overview of distributed learning provided by the popular modern deep learning frameworks, both in terms of provided functionality and performance. We consider multiple hardware choices: training on multiple GPUs and multiple computing nodes.
神经网络在科学领域和工业中得到越来越广泛的应用。这主要是因为使用神经网络的新解决方案在以前由传统方法占据的领域显示了最先进的结果,例如。计算机视觉、语音识别等。但是为了得到这些结果,神经网络变得越来越复杂,因此需要更多的训练。如今,神经网络的训练可能需要数周时间。这个问题可以通过神经网络训练的并行化和使用现代集群和超级计算机来解决,这可以显著减少学习时间。今天,对数据科学家进行更快的培训是必不可少的,因为它允许更快地获得结果以做出下一个决策。在本文中,我们概述了流行的现代深度学习框架提供的分布式学习,包括提供的功能和性能。我们考虑了多种硬件选择:在多个gpu和多个计算节点上进行训练。
{"title":"Survey on Software Tools that Implement Deep Learning Algorithms on Intel/x86 and IBM/Power8/Power9 Platforms","authors":"Denis Shaikhislamov, A. Sozykin, V. Voevodin","doi":"10.14529/jsfi190404","DOIUrl":"https://doi.org/10.14529/jsfi190404","url":null,"abstract":"Neural networks are becoming more and more popular in scientific field and in the industry. It is mostly because new solutions using neural networks show state-of-the-art results in the domains previously occupied by traditional methods, eg. computer vision, speech recognition etc. But to get these results neural networks become progressively more complex, thus needing a lot more training. The training of neural networks today can take weeks. This problems can be solved by parallelization of the neural networks training and using modern clusters and supercomputers, which can significantly reduce the learning time. Today, a faster training for data scientist is essential, because it allows to get the results faster to make the next decision. In this paper we provide an overview of distributed learning provided by the popular modern deep learning frameworks, both in terms of provided functionality and performance. We consider multiple hardware choices: training on multiple GPUs and multiple computing nodes.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114226343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Towards Heterogeneous Multi-scale Computing on Large Scale Parallel Supercomputers 面向大规模并行超级计算机的异构多尺度计算
Pub Date : 2019-12-01 DOI: 10.14529/jsfi190402
S. Alowayyed, M. Vassaux, B. Czaja, P. Coveney, A. Hoekstra
New applications that can exploit emerging exascale computing resources efficiently, while providing meaningful scientific results, are eagerly anticipated. Multi-scale models, especially multi-scale applications, will assuredly run at the exascale. We have established that a class of multi-scale applications implementing the heterogeneous multi-scale model follows, a heterogeneous multi-scale computing (HMC) pattern, which typically features a macroscopic model synchronising numerous independent microscopic model simulations. Consequently, communication between microscopic simulations is limited. Furthermore, a surrogate model can often be introduced between macro-scale and micro-scale models to interpolate required data from previously computed micro-scale simulations, thereby substantially reducing the number of micro-scale simulations. Nonetheless, HMC applications, though versatile, remain constrained by load balancing issues. We discuss two main issues: the a priori unknown and variable execution time of microscopic simulations, and the dynamic number of micro-scale simulations required. We tackle execution time variability using a pilot job mechanism to handle internal queuing and multiple sub-model execution on large-scale supercomputers, together with a data-informed execution time prediction model. To dynamically select the number of micro-scale simulations, the HMC pattern automatically detects and identifies three surrogate model phases that help control the available and used core amount. After relevant phase detection and micro-scale simulation scheduling, any idle cores can be used for surrogate model update or for processor release back to the system. We demonstrate HMC performance by testing it on two representative multi-scale applications. We conclude that, considering the subtle interplay between the macroscale model, surrogate models and micro-scale simulations, HMC provides a promising path towards exascale for many multiscale applications.
人们热切期待新的应用程序能够有效地利用新兴的百亿亿次计算资源,同时提供有意义的科学结果。多尺度模型,特别是多尺度应用,肯定会在百亿亿次上运行。我们已经建立了一类实现异构多尺度模型的多尺度应用程序,遵循异构多尺度计算(HMC)模式,其典型特征是宏观模型同步许多独立的微观模型模拟。因此,微观模拟之间的通信是有限的。此外,通常可以在宏观尺度和微观尺度模型之间引入替代模型,以插入先前计算的微观尺度模拟所需的数据,从而大大减少微观尺度模拟的数量。尽管如此,HMC应用程序虽然用途广泛,但仍然受到负载平衡问题的限制。我们讨论了两个主要问题:微观模拟的先验未知和可变执行时间,以及所需的微观尺度模拟的动态数量。我们使用试点作业机制来处理大型超级计算机上的内部排队和多子模型执行,以及数据通知的执行时间预测模型来解决执行时间的可变性。为了动态选择微尺度模拟的数量,HMC模式自动检测和识别三个代理模型阶段,这些阶段有助于控制可用和已使用的核心数量。在相关的相位检测和微尺度仿真调度之后,任何空闲的内核都可以用于代理模型更新或将处理器释放回系统。我们通过在两个代表性的多尺度应用程序上测试HMC来演示其性能。我们的结论是,考虑到宏观尺度模型、替代模型和微观尺度模拟之间微妙的相互作用,HMC为许多多尺度应用提供了通往百亿亿次的有希望的途径。
{"title":"Towards Heterogeneous Multi-scale Computing on Large Scale Parallel Supercomputers","authors":"S. Alowayyed, M. Vassaux, B. Czaja, P. Coveney, A. Hoekstra","doi":"10.14529/jsfi190402","DOIUrl":"https://doi.org/10.14529/jsfi190402","url":null,"abstract":"New applications that can exploit emerging exascale computing resources efficiently, while providing meaningful scientific results, are eagerly anticipated. Multi-scale models, especially multi-scale applications, will assuredly run at the exascale. We have established that a class of multi-scale applications implementing the heterogeneous multi-scale model follows, a heterogeneous multi-scale computing (HMC) pattern, which typically features a macroscopic model synchronising numerous independent microscopic model simulations. Consequently, communication between microscopic simulations is limited. Furthermore, a surrogate model can often be introduced between macro-scale and micro-scale models to interpolate required data from previously computed micro-scale simulations, thereby substantially reducing the number of micro-scale simulations. Nonetheless, HMC applications, though versatile, remain constrained by load balancing issues. We discuss two main issues: the a priori unknown and variable execution time of microscopic simulations, and the dynamic number of micro-scale simulations required. We tackle execution time variability using a pilot job mechanism to handle internal queuing and multiple sub-model execution on large-scale supercomputers, together with a data-informed execution time prediction model. To dynamically select the number of micro-scale simulations, the HMC pattern automatically detects and identifies three surrogate model phases that help control the available and used core amount. After relevant phase detection and micro-scale simulation scheduling, any idle cores can be used for surrogate model update or for processor release back to the system. We demonstrate HMC performance by testing it on two representative multi-scale applications. We conclude that, considering the subtle interplay between the macroscale model, surrogate models and micro-scale simulations, HMC provides a promising path towards exascale for many multiscale applications.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126183862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Optimizing Deep Learning RNN Topologies on Intel Architecture 基于Intel架构的深度学习RNN拓扑优化
Pub Date : 2019-09-30 DOI: 10.14529/jsfi190304
K. Banerjee, E. Georganas, Dhiraj D. Kalamkar, Barukh Ziv, Eden Segal, Cristina S. Anderson, A. Heinecke
Recurrent neural network (RNN) models have been found to be well suited for processing temporal data. In this work, we present an optimized implementation of vanilla RNN cell and its two popular variants: LSTM and GRU for Intel Xeon architecture. Typical implementations of these RNN cells employ one or two large matrix multiplication (GEMM) calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our implementation is generally faster than Intel MKL-DNN library implementations, e.g. for RNN, forward pass is up to ~3× faster whereas the backward/weight update pass is up to ~5× faster. Furthermore, we investigate high-performance implementations of sigmoid and tanh activation functions that achieve various levels of accuracy. These implementations rely on minimax polynomial approximations, rational polynomials, Taylor expansions and exponential approximation techniques. Our vectorized implementations can be flexibly integrated into deep learning computations with different accuracy requirements without compromising performance; in fact, these are able to outperform vectorized and reduced accuracy vendor-optimized (Intel SVML) libraries by 1.6–2.6× while speep up over GNU libm is close to two orders of magnitude. All our experiments are conducted on Intel’s latest CascadeLake architecture.
递归神经网络(RNN)模型已被发现非常适合于处理时间数据。在这项工作中,我们提出了香草RNN单元的优化实现及其两个流行的变体:LSTM和GRU,用于英特尔至强架构。这些RNN单元的典型实现使用一个或两个大矩阵乘法(GEMM)调用,然后对GEMM结果应用元素操作(sigmoid/tanh)。虽然这种方法很容易通过利用供应商优化的GEMM库调用来实现,但数据重用依赖于GEMM的并行化方式,并且对于源自小型minibatch的GEMM大小来说不是最优的。此外,元素操作在GEMM之后作为带宽绑定的内核公开,GEMM通常是计算绑定的内核。为了解决这个差异,我们实现了一个并行的阻塞矩阵GEMM,以便(a)实现负载平衡,(b)最大化权重矩阵重用,(c)在计算了部分GEMM块之后,当它们在缓存中处于热状态时,融合元素操作。此外,我们在单元格中引入了时间步长循环,以进一步增加权重重用和分摊开销,从而将权重转换为阻塞布局。结果表明,我们的实现通常比英特尔MKL-DNN库实现更快,例如对于RNN,前向传递快了3倍,而后向/权重更新传递快了5倍。此外,我们还研究了sigmoid和tanh激活函数的高性能实现,以达到不同的精度水平。这些实现依赖于极大极小多项式逼近、有理多项式、泰勒展开和指数逼近技术。我们的矢量化实现可以灵活地集成到具有不同精度要求的深度学习计算中,而不会影响性能;事实上,这些库的性能比矢量化和降低精度的供应商优化(Intel SVML)库高出1.6 - 2.6倍,而比GNU libm的速度提高了近两个数量级。我们所有的实验都是在英特尔最新的CascadeLake架构上进行的。
{"title":"Optimizing Deep Learning RNN Topologies on Intel Architecture","authors":"K. Banerjee, E. Georganas, Dhiraj D. Kalamkar, Barukh Ziv, Eden Segal, Cristina S. Anderson, A. Heinecke","doi":"10.14529/jsfi190304","DOIUrl":"https://doi.org/10.14529/jsfi190304","url":null,"abstract":"Recurrent neural network (RNN) models have been found to be well suited for processing temporal data. In this work, we present an optimized implementation of vanilla RNN cell and its two popular variants: LSTM and GRU for Intel Xeon architecture. Typical implementations of these RNN cells employ one or two large matrix multiplication (GEMM) calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our implementation is generally faster than Intel MKL-DNN library implementations, e.g. for RNN, forward pass is up to ~3× faster whereas the backward/weight update pass is up to ~5× faster. Furthermore, we investigate high-performance implementations of sigmoid and tanh activation functions that achieve various levels of accuracy. These implementations rely on minimax polynomial approximations, rational polynomials, Taylor expansions and exponential approximation techniques. Our vectorized implementations can be flexibly integrated into deep learning computations with different accuracy requirements without compromising performance; in fact, these are able to outperform vectorized and reduced accuracy vendor-optimized (Intel SVML) libraries by 1.6–2.6× while speep up over GNU libm is close to two orders of magnitude. All our experiments are conducted on Intel’s latest CascadeLake architecture.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133477604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Automatic Port to OpenACC/OpenMP for Physical Parameterization in Climate and Weather Code Using the CLAW Compiler 自动端口到OpenACC/OpenMP的物理参数化在气候和天气代码使用CLAW编译器
Pub Date : 2019-09-16 DOI: 10.14529/jsfi190303
Valentin Clement, P. Marti, X. Lapillonne, O. Fuhrer, W. Sawyer
In order to benefit from emerging high-performance computing systems, weather and climate models need to be adapted to run efficiently on different hardware architectures such as accelerators. This is a major challenge for existing community models that represent extremely large codebase written in Fortran. Large parts of the code can be ported using OpenACC compiler directives but for time-critical components such as physical parameterizations, code restructuring and optimizations specific to a hardware architecture are necessary to obtain high performance. In an effort to retain a single source code for multiple target architectures, the CLAW Compiler and the CLAW Single Column Abstraction were introduced. We report on the extension of the CLAW SCA to handle ELEMENTAL functions and subroutines. We demonstrate the new capability on the JSBACH land surface scheme of the ICON climate model. With the extension, JSBACH can be automatically ported to OpenACC or OpenMP for accelerators with minimal to no change to the original code.
为了从新兴的高性能计算系统中获益,天气和气候模型需要适应在不同的硬件架构(如加速器)上高效运行。对于现有的社区模型来说,这是一个主要的挑战,因为它们代表着用Fortran编写的非常大的代码库。大部分代码可以使用OpenACC编译器指令进行移植,但对于时间关键的组件,如物理参数化,代码重构和特定于硬件架构的优化是获得高性能所必需的。为了为多个目标体系结构保留单个源代码,引入了CLAW编译器和CLAW单列抽象。我们报告了CLAW SCA的扩展,以处理ELEMENTAL函数和子例程。我们在ICON气候模式的JSBACH陆面方案上验证了这种新能力。有了这个扩展,JSBACH可以自动移植到OpenACC或OpenMP加速器上,而对原始代码的修改很少甚至没有。
{"title":"Automatic Port to OpenACC/OpenMP for Physical Parameterization in Climate and Weather Code Using the CLAW Compiler","authors":"Valentin Clement, P. Marti, X. Lapillonne, O. Fuhrer, W. Sawyer","doi":"10.14529/jsfi190303","DOIUrl":"https://doi.org/10.14529/jsfi190303","url":null,"abstract":"In order to benefit from emerging high-performance computing systems, weather and climate models need to be adapted to run efficiently on different hardware architectures such as accelerators. This is a major challenge for existing community models that represent extremely large codebase written in Fortran. Large parts of the code can be ported using OpenACC compiler directives but for time-critical components such as physical parameterizations, code restructuring and optimizations specific to a hardware architecture are necessary to obtain high performance. In an effort to retain a single source code for multiple target architectures, the CLAW Compiler and the CLAW Single Column Abstraction were introduced. We report on the extension of the CLAW SCA to handle ELEMENTAL functions and subroutines. We demonstrate the new capability on the JSBACH land surface scheme of the ICON climate model. With the extension, JSBACH can be automatically ported to OpenACC or OpenMP for accelerators with minimal to no change to the original code.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127624332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Skewed Multi-banked Cache for Many-core Vector Processors 多核矢量处理器的倾斜多银行缓存
Pub Date : 2019-09-16 DOI: 10.14529/jsfi190305
Hikaru Takayashiki, Masayuki Sato, K. Komatsu, Hiroaki Kobayashi
As the number of cores and the memory bandwidth have increased in a balanced fashion, modern vector processors achieve high sustained performances, especially in memory-intensive applications in the fields of science and engineering. However, it is difficult to significantly increase the off-chip memory bandwidth owing to the limitation of the number of input/output pins integrated on a single chip. Under the circumstances, modern vector processors have adopted a shared cache to realize a high sustained memory bandwidth. The shared cache can effectively reduce the pressure to the off-chip memory bandwidth by keeping reusable data that multiple vector cores require. However, as the number of vector cores sharing a cache increases, more different blocks requested from multiple cores simultaneously use the same set. As a result, conflict misses caused by these blocks degrade the performance. In order to avoid increasing the conflict misses in the case of the increasing number of cores, this paper proposes a skewed cache for many-core vector processors. The skewed cache prevents the simultaneously requested blocks from being stored into the same set. This paper discusses how the most important two features of the skewed cache should be implemented in modern vector processors: hashing function and replacement policy. The proposed cache adopts the oddmultiplier displacement hashing for effective skewing and the static re-reference interval prediction policy for reasonable replacing. The evaluation results show that the proposed cache significantly improves the performance of a many-core vector processor by eliminating conflict misses.
随着核心数量和内存带宽的平衡增长,现代矢量处理器实现了高持续性能,特别是在科学和工程领域的内存密集型应用中。然而,由于集成在单个芯片上的输入/输出引脚数量的限制,很难显著增加片外存储器带宽。在这种情况下,现代矢量处理器采用共享缓存来实现高的持续内存带宽。共享缓存通过保留多个矢量核所需的可重用数据,有效地减少了对片外内存带宽的压力。然而,随着共享缓存的矢量内核数量的增加,同时从多个内核请求的更多不同块使用同一集合。因此,由这些块引起的冲突丢失会降低性能。为了避免在核数增加的情况下增加冲突缺失,本文提出了一种多核矢量处理器的倾斜缓存。倾斜缓存防止同时请求的块被存储到同一个集合中。本文讨论了如何在现代矢量处理器中实现倾斜缓存的两个最重要的特性:哈希函数和替换策略。该缓存采用奇乘位移哈希法进行有效倾斜,采用静态重引用间隔预测策略进行合理替换。评估结果表明,该缓存通过消除冲突缺失,显著提高了多核矢量处理器的性能。
{"title":"A Skewed Multi-banked Cache for Many-core Vector Processors","authors":"Hikaru Takayashiki, Masayuki Sato, K. Komatsu, Hiroaki Kobayashi","doi":"10.14529/jsfi190305","DOIUrl":"https://doi.org/10.14529/jsfi190305","url":null,"abstract":"As the number of cores and the memory bandwidth have increased in a balanced fashion, modern vector processors achieve high sustained performances, especially in memory-intensive applications in the fields of science and engineering. However, it is difficult to significantly increase the off-chip memory bandwidth owing to the limitation of the number of input/output pins integrated on a single chip. Under the circumstances, modern vector processors have adopted a shared cache to realize a high sustained memory bandwidth. The shared cache can effectively reduce the pressure to the off-chip memory bandwidth by keeping reusable data that multiple vector cores require. However, as the number of vector cores sharing a cache increases, more different blocks requested from multiple cores simultaneously use the same set. As a result, conflict misses caused by these blocks degrade the performance. In order to avoid increasing the conflict misses in the case of the increasing number of cores, this paper proposes a skewed cache for many-core vector processors. The skewed cache prevents the simultaneously requested blocks from being stored into the same set. This paper discusses how the most important two features of the skewed cache should be implemented in modern vector processors: hashing function and replacement policy. The proposed cache adopts the oddmultiplier displacement hashing for effective skewing and the static re-reference interval prediction policy for reasonable replacing. The evaluation results show that the proposed cache significantly improves the performance of a many-core vector processor by eliminating conflict misses.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117069923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Supercomputer Lomonosov-2: Large Scale, Deep Monitoring and Fine Analytics for the User Community 超级计算机Lomonosov-2:面向用户社区的大规模、深度监测和精细分析
Pub Date : 2019-06-26 DOI: 10.14529/JSFI190201
V. Voevodin, A. Antonov, D. Nikitenko, P. Shvets, S. Sobolev, Igor Yu. Sidorov, K. Stefanov, V. Voevodin, S. Zhumatiy
The huge number of hardware and software components, together with a large number of parameters affecting the performance of each parallel application, makes ensuring the efficiency of a large scale supercomputer extremely difficult. In this situation, all basic parameters of the supercomputer should be constantly monitored, as well as many decisions about its functioning should be made by special software automatically. In this paper we describe the tight connection between complexity of modern large high performance computing systems and special techniques and tools required to ensure their efficiency in practice. The main subsystems of the developed complex (Octoshell, DiMMoN, Octotron, JobDigest, and an expert software system to bring fine analytics on parallel applications and the entire supercomputer to users and sysadmins) are actively operated on the large supercomputer systems at Lomonosov Moscow State University. A brief description of the architecture of Lomonosov-2 supercomputer is presented, and questions showing both a wide variety of emerging complex issues and the need for an integrated approach to solving the problem of effectively supporting large supercomputer systems are discussed.
庞大的硬件和软件组件,以及影响每个并行应用程序性能的大量参数,使得确保大型超级计算机的效率变得极其困难。在这种情况下,需要对超级计算机的所有基本参数进行持续监测,并由专门的软件自动做出有关其功能的许多决定。在本文中,我们描述了现代大型高性能计算系统的复杂性与确保其在实践中的效率所需的特殊技术和工具之间的紧密联系。开发的复杂系统的主要子系统(Octoshell、DiMMoN、Octotron、JobDigest和一个专家软件系统,为用户和系统管理员提供并行应用程序和整个超级计算机的精细分析)在罗蒙诺索夫莫斯科国立大学的大型超级计算机系统上积极运行。简要介绍了Lomonosov-2超级计算机的体系结构,并讨论了显示各种各样的新出现的复杂问题以及解决有效支持大型超级计算机系统问题的综合方法的需要。
{"title":"Supercomputer Lomonosov-2: Large Scale, Deep Monitoring and Fine Analytics for the User Community","authors":"V. Voevodin, A. Antonov, D. Nikitenko, P. Shvets, S. Sobolev, Igor Yu. Sidorov, K. Stefanov, V. Voevodin, S. Zhumatiy","doi":"10.14529/JSFI190201","DOIUrl":"https://doi.org/10.14529/JSFI190201","url":null,"abstract":"The huge number of hardware and software components, together with a large number of parameters affecting the performance of each parallel application, makes ensuring the efficiency of a large scale supercomputer extremely difficult. In this situation, all basic parameters of the supercomputer should be constantly monitored, as well as many decisions about its functioning should be made by special software automatically. In this paper we describe the tight connection between complexity of modern large high performance computing systems and special techniques and tools required to ensure their efficiency in practice. The main subsystems of the developed complex (Octoshell, DiMMoN, Octotron, JobDigest, and an expert software system to bring fine analytics on parallel applications and the entire supercomputer to users and sysadmins) are actively operated on the large supercomputer systems at Lomonosov Moscow State University. A brief description of the architecture of Lomonosov-2 supercomputer is presented, and questions showing both a wide variety of emerging complex issues and the need for an integrated approach to solving the problem of effectively supporting large supercomputer systems are discussed.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123243970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 200
Fully Implicit Time Stepping Can Be Efficient on Parallel Computers 完全隐式时间步进在并行计算机上的有效性
Pub Date : 2019-06-25 DOI: 10.14529/JSFI190206
B. Cloutier, B. Muite, M. Parsani
Benchmarks in high performance computing often involve a single component used in the full solution of a computational problem, such as the solution of a linear system of equations. In many cases, the choice of algorithm, which can determine the components used, is also important when solving a full problem. Numerical evidence suggests that for the Taylor-Green vortex problem at a Reynolds number of 1600, a second order implicit midpoint rule method can require less computational time than the often used linearly implicit Carpenter-Kennedy method for solving the equations of incompressible fluid dynamics for moderate levels of accuracy at the beginning of the flow evolution. The primary reason is that even though the implicit midpoint rule is fully implicit, it can use a small number of iterations per time step, and thus require less computational work per time step than the Carpenter-Kennedy method. For the same number of timesteps, the Carpenter-Kennedy method is more accurate since it uses a higher order timestepping method.
高性能计算中的基准测试通常涉及在计算问题的完整解决方案中使用的单个组件,例如线性方程组的解决方案。在许多情况下,算法的选择(可以确定所使用的组件)在解决完整问题时也很重要。数值证据表明,对于雷诺数为1600的Taylor-Green涡旋问题,二阶隐式中点规则法比通常使用的线性隐式Carpenter-Kennedy法在求解不可压缩流体动力学方程时所需的计算时间更少,并且在流动演化开始时具有中等精度。主要原因是,尽管隐式中点规则是完全隐式的,但它可以在每个时间步使用少量迭代,因此与Carpenter-Kennedy方法相比,每个时间步需要更少的计算量。对于相同数量的时间步长,Carpenter-Kennedy方法更准确,因为它使用了更高阶的时间步长方法。
{"title":"Fully Implicit Time Stepping Can Be Efficient on Parallel Computers","authors":"B. Cloutier, B. Muite, M. Parsani","doi":"10.14529/JSFI190206","DOIUrl":"https://doi.org/10.14529/JSFI190206","url":null,"abstract":"Benchmarks in high performance computing often involve a single component used in the full solution of a computational problem, such as the solution of a linear system of equations. In many cases, the choice of algorithm, which can determine the components used, is also important when solving a full problem. Numerical evidence suggests that for the Taylor-Green vortex problem at a Reynolds number of 1600, a second order implicit midpoint rule method can require less computational time than the often used linearly implicit Carpenter-Kennedy method for solving the equations of incompressible fluid dynamics for moderate levels of accuracy at the beginning of the flow evolution. The primary reason is that even though the implicit midpoint rule is fully implicit, it can use a small number of iterations per time step, and thus require less computational work per time step than the Carpenter-Kennedy method. For the same number of timesteps, the Carpenter-Kennedy method is more accurate since it uses a higher order timestepping method.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124903805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Performance Limits Study of Stencil Codes on Modern GPGPUs 现代gpgpu上模板代码的性能极限研究
Pub Date : 2019-06-24 DOI: 10.14529/JSFI190207
Ilya S. Pershin, V. Levchenko, A. Perepelkina
We study the performance limits of different algorithmic approaches to the implementation of a sample problem of wave equation solution with a cross stencil scheme. With this, we aim to find the highest limit of the achievable performance efficiency for stencil computing. To estimate the limits, we use a quantitative Roofline model to make a thorough analysis of the performance bottlenecks and develop the model further to account for the latency of different levels of GPU memory.  These estimates provide an incentive to use spatial and temporal blocking algorithms. Thus, we study stepwise, domain decomposition, and domain decomposition with halo algorithms in that order. The knowledge of the limit incites the motivation to optimize the implementation. This led to the analysis of the block synchronization methods in CUDA, which is also provided in the text.  After all optimizations, we have achieved 90% of the peak performance, which amounts to more than 1 trillion cell updates per second on one consumer level GPU device.
我们研究了不同算法方法在求解一个具有交叉模板格式的波动方程解样题中的性能限制。因此,我们的目标是找到可实现的模板计算性能效率的最高极限。为了估计极限,我们使用定量的rooline模型对性能瓶颈进行彻底分析,并进一步开发模型以考虑不同级别GPU内存的延迟。这些估计提供了使用空间和时间阻塞算法的动机。因此,我们依次研究逐步分解、区域分解和晕轮算法的区域分解。对极限的了解激发了优化实现的动机。这导致了对CUDA中的块同步方法的分析,这也在文本中提供。在所有优化之后,我们已经达到了90%的峰值性能,这相当于在一个消费级GPU设备上每秒更新超过1万亿个单元。
{"title":"Performance Limits Study of Stencil Codes on Modern GPGPUs","authors":"Ilya S. Pershin, V. Levchenko, A. Perepelkina","doi":"10.14529/JSFI190207","DOIUrl":"https://doi.org/10.14529/JSFI190207","url":null,"abstract":"We study the performance limits of different algorithmic approaches to the implementation of a sample problem of wave equation solution with a cross stencil scheme. With this, we aim to find the highest limit of the achievable performance efficiency for stencil computing. To estimate the limits, we use a quantitative Roofline model to make a thorough analysis of the performance bottlenecks and develop the model further to account for the latency of different levels of GPU memory.  These estimates provide an incentive to use spatial and temporal blocking algorithms. Thus, we study stepwise, domain decomposition, and domain decomposition with halo algorithms in that order. The knowledge of the limit incites the motivation to optimize the implementation. This led to the analysis of the block synchronization methods in CUDA, which is also provided in the text.  After all optimizations, we have achieved 90% of the peak performance, which amounts to more than 1 trillion cell updates per second on one consumer level GPU device.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134096410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Distinct Element Simulation of Mechanical Properties of Hypothetical CNT Nanofabrics 假设碳纳米管纳米织物力学性能的离散元模拟
Pub Date : 2019-06-24 DOI: 10.14529/JSFI190208
I. Ostanin
A universal framework for modeling composites and fabrics of micro- and nanofibers, such as carbon nanotubes, carbon fibers and amyloid fibrils, is presented. Within this framework, fibers are represented with chains of rigid bodies, linked with elastic bonds. Elasticity of the bonds utilizes recently developed enhanced vector model formalism. The type of interactions between fibers is determined by their nature and physical length scale of the simulation. The dynamics of fibers is computed using the modification of rigid particle dynamics module of the waLBerla multiphysics framework. Our modeling system demonstrates exceptionally high parallel performance combined with the physical accuracy of the modeling. The efficiency of our technique is demonstrated with an illustrative mechanical test on a hypothetical carbon nanotube textile. In this example, the elasticity of the fibers represents the coarse-grained covalent bond within CNT surface, whereas interfiber interactions represent coarse-grained van der Waals forces between cylindrical segments of nanotubes. Numerical simulation demonstrates stability and extremal strength of a hypothetical carbon nanotube fabric.
提出了一种微纳米纤维(如碳纳米管、碳纤维和淀粉样原纤维)复合材料和织物建模的通用框架。在这个框架中,纤维用刚体链表示,用弹性键连接。键的弹性利用最近开发的增强向量模型形式。光纤之间相互作用的类型取决于它们的性质和模拟的物理长度尺度。通过修改waLBerla多物理场框架的刚体粒子动力学模块,计算了纤维的动力学特性。我们的建模系统展示了非常高的并行性能,并结合了建模的物理精度。通过对碳纳米管织物的力学测试,证明了该技术的有效性。在这个例子中,纤维的弹性代表了碳纳米管表面的粗粒度共价键,而纤维间的相互作用代表了纳米管圆柱形段之间的粗粒度范德华力。数值模拟证明了碳纳米管织物的稳定性和极限强度。
{"title":"Distinct Element Simulation of Mechanical Properties of Hypothetical CNT Nanofabrics","authors":"I. Ostanin","doi":"10.14529/JSFI190208","DOIUrl":"https://doi.org/10.14529/JSFI190208","url":null,"abstract":"A universal framework for modeling composites and fabrics of micro- and nanofibers, such as carbon nanotubes, carbon fibers and amyloid fibrils, is presented. Within this framework, fibers are represented with chains of rigid bodies, linked with elastic bonds. Elasticity of the bonds utilizes recently developed enhanced vector model formalism. The type of interactions between fibers is determined by their nature and physical length scale of the simulation. The dynamics of fibers is computed using the modification of rigid particle dynamics module of the waLBerla multiphysics framework. Our modeling system demonstrates exceptionally high parallel performance combined with the physical accuracy of the modeling. The efficiency of our technique is demonstrated with an illustrative mechanical test on a hypothetical carbon nanotube textile. In this example, the elasticity of the fibers represents the coarse-grained covalent bond within CNT surface, whereas interfiber interactions represent coarse-grained van der Waals forces between cylindrical segments of nanotubes. Numerical simulation demonstrates stability and extremal strength of a hypothetical carbon nanotube fabric.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116727442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Supercomput. Front. Innov.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1