2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)最新文献

英文中文

Vandermonde Wave Function Ansatz for Improved Variational Monte Carlo 改进变分蒙特卡罗的Vandermonde波函数分析

2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)

Pub Date : 2020-11-01 DOI: 10.1109/DLS51937.2020.00010

Alberto Acevedo, Michael Curry, Shantanu H. Joshi, Brett Leroux, Nicholas Malaya

Solutions to the Schrödinger equation can be used to predict the electronic structure of molecules and materials and therefore infer their complex physical and chemical properties. Variational Quantum Monte Carlo (VMC) is a technique that can be used to solve the weak form of the Schrödinger equation. Applying VMC to systems with N electrons involves evaluating the determinant of an N by N matrix. The evaluation of this determinant scales as $O(N^{3})$ and is the main computational cost in the VMC process. In this work, we investigate an alternative VMC technique based on the Vandermonde determinant. The Vandermonde determinant is a product of pairwise differences and so evaluating it scales as $O(N^{2})$. Therefore, this approach reduces the computational cost by a factor of N. The Vandermonde determinant was implemented in PyTorch and the performance was assessed in approximating the ground state energy of various quantum systems against existing techniques. The performance is evaluated in a variety of systems, starting with the one-dimensional particle in a box, and then considering more complicated atomic systems with multiple particles. The Vandermonde determinant was also implemented in PauliNet, a deep-learning architecture for VMC. The new method is shown to be computationally efficient, and results in a speed-up as large as 5X. In these cases, the new ansatz obtains a reasonable approximation for wavefunctions of atomic systems, but does not reach the accuracy of the Hartree-Fock method that relies on the Slater determinant. It is observed that while the use of neural networks in VMC can result in highly accurate solutions, further work is necessary to determine an appropriate balance between computational time and accuracy.

Schrödinger方程的解可以用来预测分子和材料的电子结构，从而推断它们复杂的物理和化学性质。变分量子蒙特卡罗(VMC)是一种可用于求解Schrödinger方程弱形式的技术。将VMC应用于有N个电子的系统涉及计算N × N矩阵的行列式。这个行列式的计算尺度为$O(N^{3})$，是VMC过程中的主要计算成本。在这项工作中，我们研究了一种基于Vandermonde行列式的替代VMC技术。Vandermonde行列式是两两差分的乘积，所以对它的求值等于0 (N^{2})。因此，该方法将计算成本降低了n个因子。在PyTorch中实现了Vandermonde行列式，并根据现有技术在近似各种量子系统的基态能量方面评估了性能。在各种系统中评估性能，从盒子中的一维粒子开始，然后考虑具有多个粒子的更复杂的原子系统。Vandermonde行列式也在PauliNet中实现，PauliNet是VMC的深度学习架构。新方法被证明是计算效率高的，并导致高达5倍的加速。在这些情况下，新方法对原子系统的波函数得到了合理的近似，但没有达到依赖于Slater行列式的Hartree-Fock方法的精度。可以观察到，虽然在VMC中使用神经网络可以产生高度精确的解决方案，但需要进一步的工作来确定计算时间和精度之间的适当平衡。

{"title":"Vandermonde Wave Function Ansatz for Improved Variational Monte Carlo","authors":"Alberto Acevedo, Michael Curry, Shantanu H. Joshi, Brett Leroux, Nicholas Malaya","doi":"10.1109/DLS51937.2020.00010","DOIUrl":"https://doi.org/10.1109/DLS51937.2020.00010","url":null,"abstract":"Solutions to the Schrödinger equation can be used to predict the electronic structure of molecules and materials and therefore infer their complex physical and chemical properties. Variational Quantum Monte Carlo (VMC) is a technique that can be used to solve the weak form of the Schrödinger equation. Applying VMC to systems with N electrons involves evaluating the determinant of an N by N matrix. The evaluation of this determinant scales as $O(N^{3})$ and is the main computational cost in the VMC process. In this work, we investigate an alternative VMC technique based on the Vandermonde determinant. The Vandermonde determinant is a product of pairwise differences and so evaluating it scales as $O(N^{2})$. Therefore, this approach reduces the computational cost by a factor of N. The Vandermonde determinant was implemented in PyTorch and the performance was assessed in approximating the ground state energy of various quantum systems against existing techniques. The performance is evaluated in a variety of systems, starting with the one-dimensional particle in a box, and then considering more complicated atomic systems with multiple particles. The Vandermonde determinant was also implemented in PauliNet, a deep-learning architecture for VMC. The new method is shown to be computationally efficient, and results in a speed-up as large as 5X. In these cases, the new ansatz obtains a reasonable approximation for wavefunctions of atomic systems, but does not reach the accuracy of the Hartree-Fock method that relies on the Slater determinant. It is observed that while the use of neural networks in VMC can result in highly accurate solutions, further work is necessary to determine an appropriate balance between computational time and accuracy.","PeriodicalId":185533,"journal":{"name":"2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123859655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Online-Codistillation Meets LARS, Going beyond the Limit of Data Parallelism in Deep Learning 在线协同蒸馏与LARS相遇，超越深度学习中数据并行性的极限

2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)

Pub Date : 2020-11-01 DOI: 10.1109/DLS51937.2020.00006

Shogo Murai, Hiroaki Mikami, Masanori Koyama, Shuji Suzuki, Takuya Akiba

Data parallel training is a powerful family of methods for the efficient training of deep neural networks on big data. Unfortunately, however, recent studies have shown that the merit of increased batch size in terms of both speed and model-performance diminishes rapidly beyond some point. This seem to apply to even LARS, the state-of-the-art large batch stochastic optimization method. In this paper, we combine LARS with online-codistillation, a recently developed, efficient deep learning algorithm built on a whole different philosophy of stabilizing the training procedure using a collaborative ensemble of models. We show that the combination of large-batch training and online-codistillation is much more efficient than either one alone. We also present a novel way of implementing the online-codistillation that can further speed up the computation. We will demonstrate the efficacy of our approach on various benchmark datasets.

数据并行训练是在大数据上有效训练深度神经网络的一种强大的方法。然而，不幸的是，最近的研究表明，在速度和模型性能方面，增加批大小的优点在超过某个点后会迅速减少。这似乎适用于LARS，最先进的大批量随机优化方法。在本文中，我们将LARS与在线协同蒸馏结合起来，在线协同蒸馏是一种最近开发的高效深度学习算法，它基于一种完全不同的理念，即使用模型的协作集成来稳定训练过程。我们的研究表明，将大批量训练和在线共蒸馏相结合比单独使用任何一种方法都要有效得多。我们还提出了一种新的实现在线共蒸馏的方法，可以进一步加快计算速度。我们将在各种基准数据集上演示我们的方法的有效性。

引用次数: 0

TopiQAL: Topic-aware Question Answering using Scalable Domain-specific Supercomputers TopiQAL:使用可扩展的特定领域超级计算机进行主题感知问答

2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)

Pub Date : 2020-11-01 DOI: 10.1109/DLS51937.2020.00011

H. Venkataram, C. Mattmann, Scott Penberthy

We all have questions. About today’s temperature, scores of our favorite baseball team, the Universe, and about vaccine for COVID-19. Life, physical, and natural scientists have been trying to find answers to various topics using scientific methods and experiments, while computer scientists have built language models as a tiny step towards automatically answering all of these questions across domains given a little bit of context. In this paper, we propose an architecture using state-of-the-art Natural Language Processing language models namely Topic Models and Bidirectional Encoder Representations from Transformers (BERT) that can transparently and automatically retrieve articles of relevance to questions across domains, and fetch answers to topical questions related to COVID-19 current and historical medical research literature. We demonstrate the benefits of using domain-specific supercomputers like Tensor Processing Units (TPUs), residing on cloud-based infrastructure, using which we could achieve significant gains in training and inference times, also with very minimal cost.

我们都有问题。关于今天的温度，我们最喜欢的棒球队的分数，宇宙，以及COVID-19的疫苗。生命、物理和自然科学家一直在尝试使用科学方法和实验来寻找各种主题的答案，而计算机科学家已经建立了语言模型，作为在给定一点上下文的情况下跨领域自动回答所有这些问题的一小步。在本文中，我们提出了一种使用最先进的自然语言处理语言模型的架构，即主题模型和变形器的双向编码器表示(BERT)，它可以透明和自动地检索与跨领域问题相关的文章，并获取与COVID-19当前和历史医学研究文献相关的主题问题的答案。我们展示了使用特定领域的超级计算机的好处，如张量处理单元(tpu)，驻留在基于云的基础设施上，使用它我们可以在训练和推理时间上获得显著的收益，而且成本也非常低。

引用次数: 4

DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning DDLBench:面向分布式深度学习的可扩展基准基础设施

2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)

Pub Date : 2020-11-01 DOI: 10.1109/DLS51937.2020.00009

Matthijs Jansen, V. Codreanu, A. Varbanescu

Due to its many applications across various fields of research, engineering, and daily life, deep learning has seen a surge in popularity. Therefore, larger and more expressive models have been proposed, with examples like Turing-NLG using as many as 17 billion parameters. Training these very large models becomes increasingly difficult due to the high computational costs and large memory footprint. Therefore, several approaches for distributed training based on data parallelism (e.g., Horovod) and model/pipeline parallelism (e.g., GPipe, PipeDream) have emerged. In this work, we focus on an in-depth comparison of three different parallelism models that address these needs: data, model and pipeline parallelism. To this end, we provide an analytical comparison of the three, both in terms of computation time and memory usage, and introduce DDLBench, a comprehensive (open-source1, ready-to-use) benchmark suite to quantify these differences in practice. Through in-depth performance analysis and experimentation with various models, datasets, distribution models and hardware systems, we demonstrate that DDLBench can accurately quantify the capability of a given system to perform distributed deep learning (DDL). By comparing our analytical models with the benchmarking results, we show how the performance of real-life implementations diverges from these analytical models, thus requiring benchmarking to capture the in-depth complexity of the frameworks themselves.1https://github.com/sara-nl/DDLBench

由于其在各个研究、工程和日常生活领域的许多应用，深度学习的受欢迎程度激增。因此，更大、更具表现力的模型被提出，像图灵- nlg这样的例子使用了多达170亿个参数。由于高计算成本和大内存占用，训练这些非常大的模型变得越来越困难。因此，出现了几种基于数据并行性(如Horovod)和模型/管道并行性(如GPipe、PipeDream)的分布式训练方法。在这项工作中，我们将重点对三种不同的并行模型进行深入的比较，以满足这些需求:数据并行、模型并行和管道并行。为此，我们从计算时间和内存使用两方面对这三者进行了分析比较，并介绍了DDLBench，这是一个全面的(开源1，即用型)基准测试套件，可以在实践中量化这些差异。通过对各种模型、数据集、分布模型和硬件系统进行深入的性能分析和实验，我们证明了DDLBench可以准确地量化给定系统执行分布式深度学习(DDL)的能力。通过比较我们的分析模型和基准测试结果，我们展示了实际实现的性能如何偏离这些分析模型，因此需要基准测试来捕获框架本身的深度复杂性

{"title":"DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning","authors":"Matthijs Jansen, V. Codreanu, A. Varbanescu","doi":"10.1109/DLS51937.2020.00009","DOIUrl":"https://doi.org/10.1109/DLS51937.2020.00009","url":null,"abstract":"Due to its many applications across various fields of research, engineering, and daily life, deep learning has seen a surge in popularity. Therefore, larger and more expressive models have been proposed, with examples like Turing-NLG using as many as 17 billion parameters. Training these very large models becomes increasingly difficult due to the high computational costs and large memory footprint. Therefore, several approaches for distributed training based on data parallelism (e.g., Horovod) and model/pipeline parallelism (e.g., GPipe, PipeDream) have emerged. In this work, we focus on an in-depth comparison of three different parallelism models that address these needs: data, model and pipeline parallelism. To this end, we provide an analytical comparison of the three, both in terms of computation time and memory usage, and introduce DDLBench, a comprehensive (open-source1, ready-to-use) benchmark suite to quantify these differences in practice. Through in-depth performance analysis and experimentation with various models, datasets, distribution models and hardware systems, we demonstrate that DDLBench can accurately quantify the capability of a given system to perform distributed deep learning (DDL). By comparing our analytical models with the benchmarking results, we show how the performance of real-life implementations diverges from these analytical models, thus requiring benchmarking to capture the in-depth complexity of the frameworks themselves.1https://github.com/sara-nl/DDLBench","PeriodicalId":185533,"journal":{"name":"2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132634827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

[Copyright notice] (版权)

2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)

Pub Date : 2020-11-01 DOI: 10.1109/dls51937.2020.00002

引用次数: 0

DeepGalaxy: Deducing the Properties of Galaxy Mergers from Images Using Deep Neural Networks DeepGalaxy:利用深度神经网络从图像中推断星系合并的特性

2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)

Pub Date : 2020-10-22 DOI: 10.1109/DLS51937.2020.00012

M. Cai, Jeroen B'edorf, V. Saletore, V. Codreanu, Damian Podareanu, Adel Chaibi, P. X. Qian

Galaxy mergers, the dynamical process during which two galaxies collide, are among the most spectacular phenomena in the Universe. During this process, the two colliding galaxies are tidally disrupted, producing significant visual features that evolve as a function of time. These visual features contain valuable clues for deducing the physical properties of the galaxy mergers. In this work, we propose DeepGalaxy, a visual analysis framework trained to predict the physical properties of galaxy mergers based on their morphology. Based on an encoder-decoder architecture, DeepGalaxy encodes the input images to a compressed latent space z, and determines the similarity of images according to the latent-space distance. DeepGalaxy consists of a fully convolutional autoencoder (FCAE) which generates activation maps at its 3D latent-space, and a variational autoencoder (VAE) which compresses the activation maps into a 1D vector, and a classifier that generates labels from the activation maps. The backbone of the FCAE can be fully customized according to the complexity of the images. DeepGalaxy demonstrates excellent scaling performance on parallel machines. On the Endeavour supercomputer, the scaling efficiency exceeds 0.93 when trained on 128 workers, and it maintains above 0.73 when trained with 512 workers. Without having to carry out expensive numerical simulations, DeepGalaxy makes inferences of the physical properties of galaxy mergers directly from images, and thereby achieves a speedup factor of ~105.

星系合并，即两个星系碰撞的动态过程，是宇宙中最壮观的现象之一。在这个过程中，两个碰撞的星系被潮汐性地破坏，产生了显著的视觉特征，这些特征随着时间的推移而演变。这些视觉特征包含了推断星系合并的物理特性的宝贵线索。在这项工作中，我们提出了DeepGalaxy，这是一个视觉分析框架，用于根据星系合并的形态预测其物理特性。DeepGalaxy基于编码器-解码器架构，将输入图像编码到压缩的潜在空间z，并根据潜在空间距离确定图像的相似性。DeepGalaxy由一个全卷积自编码器(FCAE)和一个变分自编码器(VAE)组成，前者可以在其三维潜在空间生成激活图，后者可以将激活图压缩成一维向量，还有一个分类器可以从激活图中生成标签。FCAE的主干可以根据图像的复杂程度完全定制。DeepGalaxy在并行机器上展示了出色的缩放性能。在奋进号超级计算机上，128人训练时的扩展效率超过0.93,512人训练时的扩展效率保持在0.73以上。DeepGalaxy无需进行昂贵的数值模拟，直接从图像中推断星系合并的物理性质，从而实现了~105的加速因子。

{"title":"DeepGalaxy: Deducing the Properties of Galaxy Mergers from Images Using Deep Neural Networks","authors":"M. Cai, Jeroen B'edorf, V. Saletore, V. Codreanu, Damian Podareanu, Adel Chaibi, P. X. Qian","doi":"10.1109/DLS51937.2020.00012","DOIUrl":"https://doi.org/10.1109/DLS51937.2020.00012","url":null,"abstract":"Galaxy mergers, the dynamical process during which two galaxies collide, are among the most spectacular phenomena in the Universe. During this process, the two colliding galaxies are tidally disrupted, producing significant visual features that evolve as a function of time. These visual features contain valuable clues for deducing the physical properties of the galaxy mergers. In this work, we propose DeepGalaxy, a visual analysis framework trained to predict the physical properties of galaxy mergers based on their morphology. Based on an encoder-decoder architecture, DeepGalaxy encodes the input images to a compressed latent space z, and determines the similarity of images according to the latent-space distance. DeepGalaxy consists of a fully convolutional autoencoder (FCAE) which generates activation maps at its 3D latent-space, and a variational autoencoder (VAE) which compresses the activation maps into a 1D vector, and a classifier that generates labels from the activation maps. The backbone of the FCAE can be fully customized according to the complexity of the images. DeepGalaxy demonstrates excellent scaling performance on parallel machines. On the Endeavour supercomputer, the scaling efficiency exceeds 0.93 when trained on 128 workers, and it maintains above 0.73 when trained with 512 workers. Without having to carry out expensive numerical simulations, DeepGalaxy makes inferences of the physical properties of galaxy mergers directly from images, and thereby achieves a speedup factor of ~105.","PeriodicalId":185533,"journal":{"name":"2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133451533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Towards a Scalable and Distributed Infrastructure for Deep Learning Applications 面向深度学习应用的可扩展分布式基础设施

2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)

Pub Date : 2020-10-06 DOI: 10.1109/DLS51937.2020.00008

Bita Hasheminezhad, S. Shirzad, Nanmiao Wu, Patrick Diehl, Hannes Schulz, Hartmut Kaiser

Although recent scaling up approaches to train deep neural networks have proven to be effective, the computational intensity of large and complex models, as well as the availability of large-scale datasets require deep learning frameworks to utilize scaling out techniques. Parallelization approaches and distribution requirements are not considered in the primary designs of most available distributed deep learning frameworks and most of them still are not able to perform effective and efficient fine-grained inter-node communication. We present Phylanx that has the potential to alleviate these shortcomings. Phylanx presents a productivity-oriented frontend where user Python code is translated to a futurized execution tree that can be executed efficiently on multiple nodes using the C++ standard library for parallelism and concurrency (HPX), leveraging fine-grained threading and an active messaging task-based runtime system.

尽管最近训练深度神经网络的扩展方法已被证明是有效的，但大型复杂模型的计算强度以及大规模数据集的可用性要求深度学习框架利用扩展技术。在大多数可用的分布式深度学习框架的初始设计中没有考虑并行化方法和分布要求，并且大多数框架仍然无法执行有效和高效的细粒度节点间通信。我们介绍的Phylanx有可能减轻这些缺点。Phylanx提供了一个面向生产力的前端，其中用户Python代码被转换为未来化的执行树，可以使用c++并行性和并发性标准库(HPX)在多个节点上有效地执行，利用细粒度线程和基于活动消息传递任务的运行时系统。

引用次数: 3

Time-Based Roofline for Deep Learning Performance Analysis 基于时间的rooline深度学习性能分析

2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)

Pub Date : 2020-09-09 DOI: 10.1109/DLS51937.2020.00007

Yunsong Wang, Charlene Yang, S. Farrell, Yan Zhang, T. Kurth, Samuel Williams

Deep learning applications based on neural networks are generating considerable interest in various fields due to their high accuracy. Such an application is usually very compute-intensive thus requires a long run time. Researchers and engineers are actively exploring new solutions to this issue from both hardware and software/algorithm sides. However, little previous work has focused on providing a practical methodology to characterize deep learning performance bottlenecks and potentially guide the following optimization efforts. In this paper, we introduce an extension of the Roofline model and use it to analyze two representative computation kernels in deep learning, 2D convolution and long short-term memory, on NVIDIA GPUs. This new time-based Roofline model incorporates both compute/bandwidth complexity and run time in its formulae to demonstrate performance issues that cannot be reflected by the classic Roofline. Factors such as arithmetic intensity, data transfer, kernel launch overhead, and the Tensor Core usage will be examined by varying different parameters such as batch size and feature size, etc. This work helped form a more systematic way to understand the performance issue of deep learning applications. Last but not least, this generic performance model can be applied to a wide category of applications besides deep learning as well.

基于神经网络的深度学习应用由于其较高的准确性，在各个领域引起了相当大的兴趣。这样的应用程序通常是计算密集型的，因此需要很长的运行时间。研究人员和工程师正在从硬件和软件/算法两方面积极探索解决这个问题的新方法。然而，之前的工作很少专注于提供一种实用的方法来表征深度学习的性能瓶颈，并可能指导后续的优化工作。在本文中，我们引入了rooline模型的扩展，并利用它分析了在NVIDIA gpu上深度学习中两个代表性的计算内核，2D卷积和长短期记忆。这种新的基于时间的rooline模型在其公式中结合了计算/带宽复杂性和运行时间，以展示经典rooline无法反映的性能问题。诸如算术强度、数据传输、内核启动开销和Tensor Core使用等因素将通过改变不同的参数(如批处理大小和特征大小等)来检查。这项工作有助于形成一种更系统的方式来理解深度学习应用程序的性能问题。最后但并非最不重要的是，除了深度学习之外，这个通用性能模型还可以应用于广泛的应用程序类别。

{"title":"Time-Based Roofline for Deep Learning Performance Analysis","authors":"Yunsong Wang, Charlene Yang, S. Farrell, Yan Zhang, T. Kurth, Samuel Williams","doi":"10.1109/DLS51937.2020.00007","DOIUrl":"https://doi.org/10.1109/DLS51937.2020.00007","url":null,"abstract":"Deep learning applications based on neural networks are generating considerable interest in various fields due to their high accuracy. Such an application is usually very compute-intensive thus requires a long run time. Researchers and engineers are actively exploring new solutions to this issue from both hardware and software/algorithm sides. However, little previous work has focused on providing a practical methodology to characterize deep learning performance bottlenecks and potentially guide the following optimization efforts. In this paper, we introduce an extension of the Roofline model and use it to analyze two representative computation kernels in deep learning, 2D convolution and long short-term memory, on NVIDIA GPUs. This new time-based Roofline model incorporates both compute/bandwidth complexity and run time in its formulae to demonstrate performance issues that cannot be reflected by the classic Roofline. Factors such as arithmetic intensity, data transfer, kernel launch overhead, and the Tensor Core usage will be examined by varying different parameters such as batch size and feature size, etc. This work helped form a more systematic way to understand the performance issue of deep learning applications. Last but not least, this generic performance model can be applied to a wide category of applications besides deep learning as well.","PeriodicalId":185533,"journal":{"name":"2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)","volume":"290 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117311979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Message from the Workshop Chairs 来自研讨会主席的信息

2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)

Pub Date : 2018-10-01 DOI: 10.1109/PERCOMW.2006.85

E. Biersack, P. Rodriguez

science. Other papers describe algorithms and systems for large-scale training on supercomputers, and approaches to performance benchmarking on the most powerful HPC systems. We thank both the authors and reviewers for their contributions to the workshop.

科学。其他论文描述了在超级计算机上进行大规模训练的算法和系统，以及在最强大的高性能计算系统上进行性能基准测试的方法。我们感谢作者和审稿人对研讨会的贡献。

引用次数: 0

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀