Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645933
G. Latu, Y. Asahi, Julien Bigot, Tamas B. Fehér, V. Grandgirard
The current generation of the Xeon Phi Knights Landing (KNL) processor provides a highly multi-threaded environment on which regular programming models such as MPIjopenMP can be used. Many factors impact the performance achieved by applications on these devices: one of the key points is the efficient exploitation of SIMD vector units, and one another is the memory access pattern. Works have been conducted to adapt a plasma turbulence application, namely Gysela, for this architecture. A set of different techniques have been used: standard vectorization techniques, auto-tuning of one computation kernel, switching to high-order scheme. As a result, KNL execution times have been reduced by up to a factor 3. This effort has also permitted to gain a speedup of 2x on Broadwell architecture and 3x on Skylake. Nice scalability curves up to a few thousands cores have been obtained on a strong scaling experiment. Incremental work meant a large payoff without resorting to using low-level intrinsics.
{"title":"Scaling and Optimizing the Gysela Code on a Cluster of Many-Core Processors","authors":"G. Latu, Y. Asahi, Julien Bigot, Tamas B. Fehér, V. Grandgirard","doi":"10.1109/CAHPC.2018.8645933","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645933","url":null,"abstract":"The current generation of the Xeon Phi Knights Landing (KNL) processor provides a highly multi-threaded environment on which regular programming models such as MPIjopenMP can be used. Many factors impact the performance achieved by applications on these devices: one of the key points is the efficient exploitation of SIMD vector units, and one another is the memory access pattern. Works have been conducted to adapt a plasma turbulence application, namely Gysela, for this architecture. A set of different techniques have been used: standard vectorization techniques, auto-tuning of one computation kernel, switching to high-order scheme. As a result, KNL execution times have been reduced by up to a factor 3. This effort has also permitted to gain a speedup of 2x on Broadwell architecture and 3x on Skylake. Nice scalability curves up to a few thousands cores have been obtained on a strong scaling experiment. Incremental work meant a large payoff without resorting to using low-level intrinsics.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121646048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645921
Yevhen Alforov, T. Ludwig, Anastasiia Novikova, Michael Kuhn, J. Kunkel
Every HPC system today has to cope with a deluge of data generated by scientific applications, simulations or large-scale experiments. The upscaling of supercomputer systems and infrastructures, generally results in a dramatic increase of their energy consumption. In this paper, we argue that techniques like data compression can lead to significant gains in terms of power efficiency by reducing both network and storage requirements. However, any data reduction is highly data specific and should comply with established requirements. Therefore, unsuitable or inappropriate compression strategy can utilize more resources and energy than necessary. To that end, we propose a novel methodology for achieving on-the-fly intelligent determination of energy efficient data reduction for a given data set by leveraging state-of-the-art compression algorithms and meta data at application-level I/O. We motivate our work by analyzing the energy and storage saving needs of data sets from real-life scientific HPC applications, and review the various lossless compression techniques that can be applied. We find that the resulting data reduction can decrease the data volume transferred and stored by as much as 80 % in some cases, consequently leading to significant savings in storage and networking costs.
{"title":"Towards Green Scientific Data Compression Through High-Level I/O Interfaces","authors":"Yevhen Alforov, T. Ludwig, Anastasiia Novikova, Michael Kuhn, J. Kunkel","doi":"10.1109/CAHPC.2018.8645921","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645921","url":null,"abstract":"Every HPC system today has to cope with a deluge of data generated by scientific applications, simulations or large-scale experiments. The upscaling of supercomputer systems and infrastructures, generally results in a dramatic increase of their energy consumption. In this paper, we argue that techniques like data compression can lead to significant gains in terms of power efficiency by reducing both network and storage requirements. However, any data reduction is highly data specific and should comply with established requirements. Therefore, unsuitable or inappropriate compression strategy can utilize more resources and energy than necessary. To that end, we propose a novel methodology for achieving on-the-fly intelligent determination of energy efficient data reduction for a given data set by leveraging state-of-the-art compression algorithms and meta data at application-level I/O. We motivate our work by analyzing the energy and storage saving needs of data sets from real-life scientific HPC applications, and review the various lossless compression techniques that can be applied. We find that the resulting data reduction can decrease the data volume transferred and stored by as much as 80 % in some cases, consequently leading to significant savings in storage and networking costs.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121858239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-08-03DOI: 10.1109/CAHPC.2018.8645935
Raul Puri, Robert Kirby, Nikolai Yakovenko, Bryan Catanzaro
Recent work has shown how to train Convolutional Neural Networks (CNNs) rapidly on large image datasets [1], then transfer the knowledge gained from these models to a variety of tasks [2]. Following [3], in this work, we demonstrate similar scalability and transfer for Recurrent Neural Networks (RNNs) for Natural Language tasks. By utilizing mixed precision arithmetic and a 32k batch size distributed across 128 NVIDIA Tesla V100 GPUs, we are able to train a character-level 4096-dimension multiplicative LSTM (mLSTM) [4] for unsupervised text reconstruction over 3 epochs of the 40 GB Amazon Reviews dataset [5] in four hours. This runtime compares favorably with previous work taking one month to train the same size and configuration for one epoch over the same dataset [3]. Converging large batch RNN models can be challenging. Recent work has suggested scaling the learning rate as a function of batch size, but we find that simply scaling the learning rate as a function of batch size leads either to significantly worse convergence or immediate divergence for this problem. We provide a learning rate schedule that allows our model to converge with a 32k batch size. Since our model converges over the Amazon Reviews dataset in hours, and our compute requirement of 128 Tesla V100 GPUs, while substantial, is commercially available, this work opens up large scale unsupervised NLP training to most commercial applications and deep learning researchers 11Our code is publicly available: https://github.com/NVIDIA/sentiment-discovery, A model can be trained over most public or private text datasets overnight.
最近的研究展示了如何在大型图像数据集上快速训练卷积神经网络(cnn)[1],然后将从这些模型中获得的知识转移到各种任务中[2]。接下来[3],在这项工作中,我们展示了递归神经网络(rnn)用于自然语言任务的类似可扩展性和迁移。通过使用混合精度算法和分布在128个NVIDIA Tesla V100 gpu上的32k批处理大小,我们能够在4小时内训练一个字符级4096维乘法LSTM (mLSTM)[4],用于在40gb Amazon Reviews数据集[5]的3个epoch上进行无监督文本重建。这个运行时与之前的工作相比,在相同的数据集上为一个epoch训练相同的大小和配置需要一个月的时间[3]。收敛大批量RNN模型可能具有挑战性。最近的研究表明,将学习率作为批量大小的函数进行缩放,但我们发现,简单地将学习率作为批量大小的函数进行缩放,要么会导致这个问题的收敛性显著恶化,要么会立即出现分歧。我们提供了一个学习率计划,允许我们的模型收敛于32k批处理大小。由于我们的模型在几个小时内就能在亚马逊评论数据集上收敛,而且我们对128个特斯拉V100 gpu的计算需求虽然很大,但在商业上是可用的,这项工作为大多数商业应用程序和深度学习研究人员打开了大规模无监督NLP训练的道路。我们的代码是公开的:https://github.com/NVIDIA/sentiment-discovery,一个模型可以在一夜之间在大多数公共或私人文本数据集上训练。
{"title":"Large Scale Language Modeling: Converging on 40GB of Text in Four Hours","authors":"Raul Puri, Robert Kirby, Nikolai Yakovenko, Bryan Catanzaro","doi":"10.1109/CAHPC.2018.8645935","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645935","url":null,"abstract":"Recent work has shown how to train Convolutional Neural Networks (CNNs) rapidly on large image datasets [1], then transfer the knowledge gained from these models to a variety of tasks [2]. Following [3], in this work, we demonstrate similar scalability and transfer for Recurrent Neural Networks (RNNs) for Natural Language tasks. By utilizing mixed precision arithmetic and a 32k batch size distributed across 128 NVIDIA Tesla V100 GPUs, we are able to train a character-level 4096-dimension multiplicative LSTM (mLSTM) [4] for unsupervised text reconstruction over 3 epochs of the 40 GB Amazon Reviews dataset [5] in four hours. This runtime compares favorably with previous work taking one month to train the same size and configuration for one epoch over the same dataset [3]. Converging large batch RNN models can be challenging. Recent work has suggested scaling the learning rate as a function of batch size, but we find that simply scaling the learning rate as a function of batch size leads either to significantly worse convergence or immediate divergence for this problem. We provide a learning rate schedule that allows our model to converge with a 32k batch size. Since our model converges over the Amazon Reviews dataset in hours, and our compute requirement of 128 Tesla V100 GPUs, while substantial, is commercially available, this work opens up large scale unsupervised NLP training to most commercial applications and deep learning researchers 11Our code is publicly available: https://github.com/NVIDIA/sentiment-discovery, A model can be trained over most public or private text datasets overnight.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128772278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-07-31DOI: 10.1109/CAHPC.2018.8645912
David Chan, Roshan Rao, Forrest Huang, J. Canny
Modern datasets and models are notoriously difficult to explore and analyze due to their inherent high dimensionality and massive numbers of samples. Existing visualization methods which employ dimensionality reduction to two or three dimensions are often inefficient and/or ineffective for these datasets. This paper introduces T-SNE-CUDA, a GPU-accelerated implementation of t-distributed Symmetric Neighbour Embedding (t-SNE) for visualizing datasets and models. T-SNE-CUDA significantly outperforms current implementations with 50-700x speedups on the CIFAR-10 and MNIST datasets. These speedups enable, for the first time, visualization of the neural network activations on the entire ImageNet dataset - a feat that was previously computationally intractable. We also demonstrate visualization performance in the NLP domain by visualizing the GloVe embedding vectors. From these visualizations, we can draw interesting conclusions about using the L2 metric in these embedding spaces. T-SNE-CUDA is publicly available at https://github.com/CannyLab/tsne-cuda.
{"title":"T-SNE-CUDA: GPU-Accelerated T-SNE and its Applications to Modern Data","authors":"David Chan, Roshan Rao, Forrest Huang, J. Canny","doi":"10.1109/CAHPC.2018.8645912","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645912","url":null,"abstract":"Modern datasets and models are notoriously difficult to explore and analyze due to their inherent high dimensionality and massive numbers of samples. Existing visualization methods which employ dimensionality reduction to two or three dimensions are often inefficient and/or ineffective for these datasets. This paper introduces T-SNE-CUDA, a GPU-accelerated implementation of t-distributed Symmetric Neighbour Embedding (t-SNE) for visualizing datasets and models. T-SNE-CUDA significantly outperforms current implementations with 50-700x speedups on the CIFAR-10 and MNIST datasets. These speedups enable, for the first time, visualization of the neural network activations on the entire ImageNet dataset - a feat that was previously computationally intractable. We also demonstrate visualization performance in the NLP domain by visualizing the GloVe embedding vectors. From these visualizations, we can draw interesting conclusions about using the L2 metric in these embedding spaces. T-SNE-CUDA is publicly available at https://github.com/CannyLab/tsne-cuda.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123842474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-07-24DOI: 10.1109/SBAC-PAD.2018.00057
R. L. F. Cunha, E. Rodrigues, Matheus Palhares Viana, Dario Augusto Borges Oliveira
In recent years, with the popularization of deep learning frameworks and large datasets, researchers have started parallelizing their models in order to train faster. This is crucially important, because they typically explore many hyperparameters in order to find the best ones for their applications. This process is time consuming and, consequently, speeding up training improves productivity. One approach to parallelize deep learning models followed by many researchers is based on weak scaling. The minibatches increase in size as new GPUs are added to the system. In addition, new learning rates schedules have been proposed to fix optimization issues that occur with large minibatch sizes. In this paper, however, we show that the recommendations provided by recent work do not apply to models that lack large datasets. In fact, we argument in favor of using strong scaling for achieving reliable performance in such cases. We evaluated our approach with up to 32 GPUs and show that weak scaling not only does not have the same accuracy as the sequential model, it also fails to converge most of time. Meanwhile, strong scaling has good scalability while having exactly the same accuracy of a sequential implementation.
{"title":"An Argument in Favor of Strong Scaling for Deep Neural Networks with Small Datasets","authors":"R. L. F. Cunha, E. Rodrigues, Matheus Palhares Viana, Dario Augusto Borges Oliveira","doi":"10.1109/SBAC-PAD.2018.00057","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2018.00057","url":null,"abstract":"In recent years, with the popularization of deep learning frameworks and large datasets, researchers have started parallelizing their models in order to train faster. This is crucially important, because they typically explore many hyperparameters in order to find the best ones for their applications. This process is time consuming and, consequently, speeding up training improves productivity. One approach to parallelize deep learning models followed by many researchers is based on weak scaling. The minibatches increase in size as new GPUs are added to the system. In addition, new learning rates schedules have been proposed to fix optimization issues that occur with large minibatch sizes. In this paper, however, we show that the recommendations provided by recent work do not apply to models that lack large datasets. In fact, we argument in favor of using strong scaling for achieving reliable performance in such cases. We evaluated our approach with up to 32 GPUs and show that weak scaling not only does not have the same accuracy as the sequential model, it also fails to converge most of time. Meanwhile, strong scaling has good scalability while having exactly the same accuracy of a sequential implementation.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116956979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-06-14DOI: 10.1109/CAHPC.2018.8645906
Behzad Salami, O. Unsal, A. Cristal
Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructured data which in turn requires massive computational resources. Due to the inherently compute and power-intensive structure of Neural Networks (NNs), hardware accelerators emerge as a promising solution. However, with technology node scaling below 10nm, hardware accelerators become more susceptible to faults, which in turn can impact the NN accuracy. In this paper, we study the resilience aspects of Register-Transfer Level (RTL) model of NN accelerators, in particular, fault characterization and mitigation. By following a High-Level Synthesis (HLS) approach, first, we characterize the vulnerability of various components of RTL NN. We observed that the severity of faults depends on both i) application-level specifications, i.e., NN data (inputs, weights, or intermediate) and NN layers and ii) architectural-level specifications, i.e., data representation model and the parallelism degree of the underlying accelerator. Second, motivated by characterization results, we present a low-overhead fault mitigation technique that can efficiently correct bit flips, by 47.3% better than state-of-the-art methods.
{"title":"On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation","authors":"Behzad Salami, O. Unsal, A. Cristal","doi":"10.1109/CAHPC.2018.8645906","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645906","url":null,"abstract":"Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructured data which in turn requires massive computational resources. Due to the inherently compute and power-intensive structure of Neural Networks (NNs), hardware accelerators emerge as a promising solution. However, with technology node scaling below 10nm, hardware accelerators become more susceptible to faults, which in turn can impact the NN accuracy. In this paper, we study the resilience aspects of Register-Transfer Level (RTL) model of NN accelerators, in particular, fault characterization and mitigation. By following a High-Level Synthesis (HLS) approach, first, we characterize the vulnerability of various components of RTL NN. We observed that the severity of faults depends on both i) application-level specifications, i.e., NN data (inputs, weights, or intermediate) and NN layers and ii) architectural-level specifications, i.e., data representation model and the parallelism degree of the underlying accelerator. Second, motivated by characterization results, we present a low-overhead fault mitigation technique that can efficiently correct bit flips, by 47.3% better than state-of-the-art methods.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121664327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}