基于Intel架构的深度学习RNN拓扑优化

Supercomput. Front. Innov. Pub Date : 2019-09-30 DOI:10.14529/jsfi190304

K. Banerjee, E. Georganas, Dhiraj D. Kalamkar, Barukh Ziv, Eden Segal, Cristina S. Anderson, A. Heinecke

{"title":"基于Intel架构的深度学习RNN拓扑优化","authors":"K. Banerjee, E. Georganas, Dhiraj D. Kalamkar, Barukh Ziv, Eden Segal, Cristina S. Anderson, A. Heinecke","doi":"10.14529/jsfi190304","DOIUrl":null,"url":null,"abstract":"Recurrent neural network (RNN) models have been found to be well suited for processing temporal data. In this work, we present an optimized implementation of vanilla RNN cell and its two popular variants: LSTM and GRU for Intel Xeon architecture. Typical implementations of these RNN cells employ one or two large matrix multiplication (GEMM) calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our implementation is generally faster than Intel MKL-DNN library implementations, e.g. for RNN, forward pass is up to ~3× faster whereas the backward/weight update pass is up to ~5× faster. Furthermore, we investigate high-performance implementations of sigmoid and tanh activation functions that achieve various levels of accuracy. These implementations rely on minimax polynomial approximations, rational polynomials, Taylor expansions and exponential approximation techniques. Our vectorized implementations can be flexibly integrated into deep learning computations with different accuracy requirements without compromising performance; in fact, these are able to outperform vectorized and reduced accuracy vendor-optimized (Intel SVML) libraries by 1.6–2.6× while speep up over GNU libm is close to two orders of magnitude. All our experiments are conducted on Intel’s latest CascadeLake architecture.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Optimizing Deep Learning RNN Topologies on Intel Architecture\",\"authors\":\"K. Banerjee, E. Georganas, Dhiraj D. Kalamkar, Barukh Ziv, Eden Segal, Cristina S. Anderson, A. Heinecke\",\"doi\":\"10.14529/jsfi190304\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recurrent neural network (RNN) models have been found to be well suited for processing temporal data. In this work, we present an optimized implementation of vanilla RNN cell and its two popular variants: LSTM and GRU for Intel Xeon architecture. Typical implementations of these RNN cells employ one or two large matrix multiplication (GEMM) calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our implementation is generally faster than Intel MKL-DNN library implementations, e.g. for RNN, forward pass is up to ~3× faster whereas the backward/weight update pass is up to ~5× faster. Furthermore, we investigate high-performance implementations of sigmoid and tanh activation functions that achieve various levels of accuracy. These implementations rely on minimax polynomial approximations, rational polynomials, Taylor expansions and exponential approximation techniques. Our vectorized implementations can be flexibly integrated into deep learning computations with different accuracy requirements without compromising performance; in fact, these are able to outperform vectorized and reduced accuracy vendor-optimized (Intel SVML) libraries by 1.6–2.6× while speep up over GNU libm is close to two orders of magnitude. All our experiments are conducted on Intel’s latest CascadeLake architecture.\",\"PeriodicalId\":338883,\"journal\":{\"name\":\"Supercomput. Front. Innov.\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Supercomput. Front. Innov.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14529/jsfi190304\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Supercomput. Front. Innov.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14529/jsfi190304","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

递归神经网络(RNN)模型已被发现非常适合于处理时间数据。在这项工作中，我们提出了香草RNN单元的优化实现及其两个流行的变体:LSTM和GRU，用于英特尔至强架构。这些RNN单元的典型实现使用一个或两个大矩阵乘法(GEMM)调用，然后对GEMM结果应用元素操作(sigmoid/tanh)。虽然这种方法很容易通过利用供应商优化的GEMM库调用来实现，但数据重用依赖于GEMM的并行化方式，并且对于源自小型minibatch的GEMM大小来说不是最优的。此外，元素操作在GEMM之后作为带宽绑定的内核公开，GEMM通常是计算绑定的内核。为了解决这个差异，我们实现了一个并行的阻塞矩阵GEMM，以便(a)实现负载平衡，(b)最大化权重矩阵重用，(c)在计算了部分GEMM块之后，当它们在缓存中处于热状态时，融合元素操作。此外，我们在单元格中引入了时间步长循环，以进一步增加权重重用和分摊开销，从而将权重转换为阻塞布局。结果表明，我们的实现通常比英特尔MKL-DNN库实现更快，例如对于RNN，前向传递快了3倍，而后向/权重更新传递快了5倍。此外，我们还研究了sigmoid和tanh激活函数的高性能实现，以达到不同的精度水平。这些实现依赖于极大极小多项式逼近、有理多项式、泰勒展开和指数逼近技术。我们的矢量化实现可以灵活地集成到具有不同精度要求的深度学习计算中，而不会影响性能;事实上，这些库的性能比矢量化和降低精度的供应商优化(Intel SVML)库高出1.6 - 2.6倍，而比GNU libm的速度提高了近两个数量级。我们所有的实验都是在英特尔最新的CascadeLake架构上进行的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Optimizing Deep Learning RNN Topologies on Intel Architecture

Recurrent neural network (RNN) models have been found to be well suited for processing temporal data. In this work, we present an optimized implementation of vanilla RNN cell and its two popular variants: LSTM and GRU for Intel Xeon architecture. Typical implementations of these RNN cells employ one or two large matrix multiplication (GEMM) calls and then apply the element-wise operations (sigmoid/tanh) onto the GEMM results. While this approach is easy to implement by exploiting vendor-optimized GEMM library calls, the data reuse relies on how GEMMs are parallelized and is sub-optimal for GEMM sizes stemming from small minibatch. Also, the element-wise operations are exposed as a bandwidth-bound kernel after the GEMM which is typically a compute-bound kernel. To address this discrepancy, we implemented a parallel blocked matrix GEMM in order to (a) achieve load balance, (b) maximize weight matrix reuse, (c) fuse the element-wise operations after partial GEMM blocks are computed and while they are hot in cache. Additionally, we bring the time step loop in our cell to further increase the weight reuse and amortize the overhead to transform the weights into blocked layout. The results show that our implementation is generally faster than Intel MKL-DNN library implementations, e.g. for RNN, forward pass is up to ~3× faster whereas the backward/weight update pass is up to ~5× faster. Furthermore, we investigate high-performance implementations of sigmoid and tanh activation functions that achieve various levels of accuracy. These implementations rely on minimax polynomial approximations, rational polynomials, Taylor expansions and exponential approximation techniques. Our vectorized implementations can be flexibly integrated into deep learning computations with different accuracy requirements without compromising performance; in fact, these are able to outperform vectorized and reduced accuracy vendor-optimized (Intel SVML) libraries by 1.6–2.6× while speep up over GNU libm is close to two orders of magnitude. All our experiments are conducted on Intel’s latest CascadeLake architecture.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Supercomput. Front. Innov.

自引率

0.00%

发文量