2019 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文中文

Distributed Deep Learning for Precipitation Nowcasting 降水临近预报的分布式深度学习

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-08-28 DOI: 10.1109/HPEC.2019.8916416

S. Samsi, Christopher J. Mattioli, M. Veillette

Effective training of Deep Neural Networks requires massive amounts of data and compute. As a result, longer times are needed to train complex models requiring large datasets, which can severely limit research on model development and the exploitation of all available data. In this paper, this problem is investigated in the context of precipitation nowcasting, a term used to describe highly detailed short-term forecasts of precipitation and other hazardous weather. Convolutional Neural Networks (CNNs) are a powerful class of models that are well-suited for this task; however, the high resolution input weather imagery combined with model complexity required to process this data makes training CNNs to solve this task time consuming. To address this issue, a data-parallel model is implemented where a CNN is replicated across multiple compute nodes and the training batches are distributed across multiple nodes. By leveraging multiple GPUs, we show that the training time for a given nowcasting model architecture can be reduced from 59 hours to just over 1 hour. This will allow for faster iterations for improving CNN architectures and will facilitate future advancement in the area of nowcasting.

深度神经网络的有效训练需要大量的数据和计算。因此，需要更长的时间来训练需要大数据集的复杂模型，这可能严重限制对模型开发和所有可用数据的利用的研究。本文在降水临近预报的背景下研究了这个问题，降水临近预报是一个用来描述降水和其他危险天气的非常详细的短期预报的术语。卷积神经网络(cnn)是一类非常适合这项任务的强大模型;然而，高分辨率输入的天气图像加上处理这些数据所需的模型复杂性使得训练cnn来解决这个任务非常耗时。为了解决这个问题，实现了一个数据并行模型，其中一个CNN在多个计算节点上复制，并且训练批次分布在多个节点上。通过利用多个gpu，我们证明了给定的临近投射模型架构的训练时间可以从59小时减少到1小时多一点。这将允许更快的迭代来改进CNN架构，并将促进未来在临近广播领域的进步。

{"title":"Distributed Deep Learning for Precipitation Nowcasting","authors":"S. Samsi, Christopher J. Mattioli, M. Veillette","doi":"10.1109/HPEC.2019.8916416","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916416","url":null,"abstract":"Effective training of Deep Neural Networks requires massive amounts of data and compute. As a result, longer times are needed to train complex models requiring large datasets, which can severely limit research on model development and the exploitation of all available data. In this paper, this problem is investigated in the context of precipitation nowcasting, a term used to describe highly detailed short-term forecasts of precipitation and other hazardous weather. Convolutional Neural Networks (CNNs) are a powerful class of models that are well-suited for this task; however, the high resolution input weather imagery combined with model complexity required to process this data makes training CNNs to solve this task time consuming. To address this issue, a data-parallel model is implemented where a CNN is replicated across multiple compute nodes and the training batches are distributed across multiple nodes. By leveraging multiple GPUs, we show that the training time for a given nowcasting model architecture can be reduced from 59 hours to just over 1 hour. This will allow for faster iterations for improving CNN architectures and will facilitate future advancement in the area of nowcasting.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129266529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Securing HPC using Federated Authentication 使用联邦身份验证保护HPC

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-08-20 DOI: 10.1109/HPEC.2019.8916255

Andrew Prout, W. Arcand, David Bestor, Bill Bergeron, C. Byun, V. Gadepally, Michael Houle, M. Hubbell, Michael Jones, Anna Klein, P. Michaleas, Lauren Milechin, J. Mullen, Antonio Rosa, S. Samsi, Charles Yee, A. Reuther, J. Kepner

Federated authentication can drastically reduce the overhead of basic account maintenance while simultaneously improving overall system security. Integrating with the user’s more frequently used account at their primary organization both provides a better experience to the end user and makes account compromise or changes in affiliation more likely to be noticed and acted upon. Additionally, with many organizations transitioning to multi-factor authentication for all account access, the ability to leverage external federated identity management systems provides the benefit of their efforts without the additional overhead of separately implementing a distinct multi-factor authentication process. This paper describes our experiences and the lessons we learned by enabling federated authentication with the U.S. Government PKI and In Common Federation, scaling it up to the user base of a production HPC system, and the motivations behind those choices. We have received only positive feedback from our users.

联邦身份验证可以大大减少基本帐户维护的开销，同时提高整个系统的安全性。与用户在其主要组织中更频繁使用的帐户集成，既为最终用户提供了更好的体验，又使帐户折衷或关联中的更改更容易被注意到并采取行动。此外，随着许多组织将所有帐户访问转换为多因素身份验证，利用外部联邦身份管理系统的能力提供了他们努力的好处，而无需单独实现不同的多因素身份验证过程的额外开销。本文描述了我们通过启用美国政府PKI和In Common Federation的联合身份验证，将其扩展到生产HPC系统的用户群，以及这些选择背后的动机所获得的经验和教训。我们只收到了用户的积极反馈。

{"title":"Securing HPC using Federated Authentication","authors":"Andrew Prout, W. Arcand, David Bestor, Bill Bergeron, C. Byun, V. Gadepally, Michael Houle, M. Hubbell, Michael Jones, Anna Klein, P. Michaleas, Lauren Milechin, J. Mullen, Antonio Rosa, S. Samsi, Charles Yee, A. Reuther, J. Kepner","doi":"10.1109/HPEC.2019.8916255","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916255","url":null,"abstract":"Federated authentication can drastically reduce the overhead of basic account maintenance while simultaneously improving overall system security. Integrating with the user’s more frequently used account at their primary organization both provides a better experience to the end user and makes account compromise or changes in affiliation more likely to be noticed and acted upon. Additionally, with many organizations transitioning to multi-factor authentication for all account access, the ability to leverage external federated identity management systems provides the benefit of their efforts without the additional overhead of separately implementing a distinct multi-factor authentication process. This paper describes our experiences and the lessons we learned by enabling federated authentication with the U.S. Government PKI and In Common Federation, scaling it up to the user base of a production HPC system, and the motivations behind those choices. We have received only positive feedback from our users.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123990685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Large Scale Organization and Inference of an Imagery Dataset for Public Safety 公共安全图像数据集的大规模组织与推理

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-08-16 DOI: 10.1109/HPEC.2019.8916437

Jeffrey Liu, David Strohschein, S. Samsi, A. Weinert

Video applications and analytics are routinely projected as a stressing and significant service of the Nationwide Public Safety Broadband Network. As part of a NIST PSCR funded effort, the New Jersey Office of Homeland Security and Preparedness and MIT Lincoln Laboratory have been developing a computer vision dataset of operational and representative public safety scenarios. The scale and scope of this dataset necessitates a hierarchical organization approach for efficient compute and storage. We overview architectural considerations using the Lincoln Laboratory Supercomputing Cluster as a test architecture. We then describe how we intelligently organized the dataset across LLSC and evaluated it with large scale imagery inference across terabytes of data.

视频应用和分析通常被认为是全国公共安全宽带网络的一个重点和重要服务。作为NIST PSCR资助项目的一部分，新泽西国土安全和准备办公室和麻省理工学院林肯实验室一直在开发一个可操作的和具有代表性的公共安全场景的计算机视觉数据集。该数据集的规模和范围需要分层组织方法来实现高效的计算和存储。我们概述了使用林肯实验室超级计算集群作为测试体系结构的体系结构考虑因素。然后，我们描述了我们如何智能地组织跨LLSC的数据集，并使用跨tb数据的大规模图像推断对其进行评估。

引用次数: 18

Optimizing Xeon Phi for Interactive Data Analysis 优化Xeon Phi的交互式数据分析

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-07-06 DOI: 10.1109/HPEC.2019.8916300

C. Byun, J. Kepner, W. Arcand, David Bestor, William Bergeron, M. Hubbell, V. Gadepally, Michael Houle, Michael Jones, Anna Klein, Lauren Milechin, P. Michaleas, J. Mullen, Andrew Prout, Antonio Rosa, S. Samsi, Charles Yee, A. Reuther

The Intel Xeon Phi manycore processor is designed to provide high performance matrix computations of the type often performed in data analysis. Common data analysis environments include Matlab, GNU Octave, Julia, Python, and R. Achieving optimal performance of matrix operations within data analysis environments requires tuning the Xeon Phi OpenMP settings, process pinning, and memory modes. This paper describes matrix multiplication performance results for Matlab and GNU Octave over a variety of combinations of process counts and OpenMP threads and Xeon Phi memory modes. These results indicate that using KMP_AFFINITY=granlarity=fine, taskset pinning, and all2all cache memory mode allows both Matlab and GNU Octave to achieve 66% of the practical peak performance for process counts ranging from 1 to 64 and OpenMP threads ranging from 1 to 64. These settings have resulted in generally improved performance across a range of applications and has enabled our Xeon Phi system to deliver significant results in a number of real-world applications.

Intel Xeon Phi多核处理器旨在提供数据分析中经常执行的高性能矩阵计算。常见的数据分析环境包括Matlab、GNU Octave、Julia、Python和r。在数据分析环境中实现矩阵操作的最佳性能需要调整Xeon Phi OpenMP设置、进程绑定和内存模式。本文描述了Matlab和GNU Octave在各种进程计数和OpenMP线程组合以及Xeon Phi内存模式下的矩阵乘法性能结果。这些结果表明，使用KMP_AFFINITY=粒度=fine、任务集固定和all2all缓存内存模式，可以使Matlab和GNU Octave在进程数从1到64和OpenMP线程数从1到64的情况下达到66%的实际峰值性能。这些设置在一系列应用程序中普遍提高了性能，并使我们的Xeon Phi系统能够在许多实际应用中提供显着的结果。

{"title":"Optimizing Xeon Phi for Interactive Data Analysis","authors":"C. Byun, J. Kepner, W. Arcand, David Bestor, William Bergeron, M. Hubbell, V. Gadepally, Michael Houle, Michael Jones, Anna Klein, Lauren Milechin, P. Michaleas, J. Mullen, Andrew Prout, Antonio Rosa, S. Samsi, Charles Yee, A. Reuther","doi":"10.1109/HPEC.2019.8916300","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916300","url":null,"abstract":"The Intel Xeon Phi manycore processor is designed to provide high performance matrix computations of the type often performed in data analysis. Common data analysis environments include Matlab, GNU Octave, Julia, Python, and R. Achieving optimal performance of matrix operations within data analysis environments requires tuning the Xeon Phi OpenMP settings, process pinning, and memory modes. This paper describes matrix multiplication performance results for Matlab and GNU Octave over a variety of combinations of process counts and OpenMP threads and Xeon Phi memory modes. These results indicate that using KMP_AFFINITY=granlarity=fine, taskset pinning, and all2all cache memory mode allows both Matlab and GNU Octave to achieve 66% of the practical peak performance for process counts ranging from 1 to 64 and OpenMP threads ranging from 1 to 64. These settings have resulted in generally improved performance across a range of applications and has enabled our Xeon Phi system to deliver significant results in a number of real-world applications.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132850249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M 使用D4M每秒流式传输19亿次超稀疏网络更新

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-07-06 DOI: 10.1109/HPEC.2019.8916508

J. Kepner, V. Gadepally, Lauren Milechin, S. Samsi, W. Arcand, David Bestor, William Bergeron, C. Byun, M. Hubbell, Michael Houle, Michael Jones, Anna Klein, P. Michaleas, J. Mullen, Andrew Prout, Antonio Rosa, Charles Yee, A. Reuther

The Dynamic Distributed Dimensional Data Model (D4M) library implements associative arrays in a variety of languages (Python, Julia, and Matlab/Octave) and provides a lightweight in-memory database implementation of hypersparse arrays that are ideal for analyzing many types of network data. D4M relies on associative arrays which combine properties of spreadsheets, databases, matrices, graphs, and networks, while providing rigorous mathematical guarantees, such as linearity. Streaming updates of D4M associative arrays put enormous pressure on the memory hierarchy. This work describes the design and performance optimization of an implementation of hierarchical associative arrays that reduces memory pressure and dramatically increases the update rate into an associative array. The parameters of hierarchical associative arrays rely on controlling the number of entries in each level in the hierarchy before an update is cascaded. The parameters are easily tunable to achieve optimal performance for a variety of applications. Hierarchical arrays achieve over 40,000 updates per second in a single instance. Scaling to 34,000 instances of hierarchical D4M associative arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 1,900,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.

动态分布式维度数据模型(D4M)库用各种语言(Python、Julia和Matlab/Octave)实现了关联数组，并提供了超稀疏数组的轻量级内存数据库实现，非常适合分析多种类型的网络数据。D4M依赖于结合了电子表格、数据库、矩阵、图形和网络属性的关联数组，同时提供严格的数学保证，例如线性。D4M关联数组的流更新给内存层次结构带来了巨大的压力。这项工作描述了分层关联数组实现的设计和性能优化，该实现减少了内存压力并显着提高了关联数组的更新速率。层次关联数组的参数依赖于在级联更新之前控制层次结构中每个级别的条目数量。参数很容易调整，以实现各种应用程序的最佳性能。分层数组在单个实例中实现每秒超过40,000次更新。在MIT SuperCloud上的1,100个服务器节点上扩展到34,000个分层D4M关联数组实例，实现了每秒19亿次更新的持续更新速率。这种能力允许麻省理工学院的超级云分析非常大的流网络数据集。

{"title":"Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M","authors":"J. Kepner, V. Gadepally, Lauren Milechin, S. Samsi, W. Arcand, David Bestor, William Bergeron, C. Byun, M. Hubbell, Michael Houle, Michael Jones, Anna Klein, P. Michaleas, J. Mullen, Andrew Prout, Antonio Rosa, Charles Yee, A. Reuther","doi":"10.1109/HPEC.2019.8916508","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916508","url":null,"abstract":"The Dynamic Distributed Dimensional Data Model (D4M) library implements associative arrays in a variety of languages (Python, Julia, and Matlab/Octave) and provides a lightweight in-memory database implementation of hypersparse arrays that are ideal for analyzing many types of network data. D4M relies on associative arrays which combine properties of spreadsheets, databases, matrices, graphs, and networks, while providing rigorous mathematical guarantees, such as linearity. Streaming updates of D4M associative arrays put enormous pressure on the memory hierarchy. This work describes the design and performance optimization of an implementation of hierarchical associative arrays that reduces memory pressure and dramatically increases the update rate into an associative array. The parameters of hierarchical associative arrays rely on controlling the number of entries in each level in the hierarchy before an update is cascaded. The parameters are easily tunable to achieve optimal performance for a variety of applications. Hierarchical arrays achieve over 40,000 updates per second in a single instance. Scaling to 34,000 instances of hierarchical D4M associative arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 1,900,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125826483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Spaceland Embedding of Sparse Stochastic Graphs 稀疏随机图的空间嵌入

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-06-13 DOI: 10.1109/HPEC.2019.8916505

N. Pitsianis, A. Iliopoulos, D. Floros, Xiaobai Sun

We introduce SG-t-SNE, a nonlinear method for embedding stochastic graphs/networks into d-dimensional spaces, d = 1, 2, 3, without requiring vertex features to reside in, or be transformed into, a metric space. Graphs/networks are relational data, prevalent in real-world applications. Graph embedding is fundamental to many graph analysis tasks, besides graph visualization. SG-t-SNE follows and builds upon the core principle of t-SNE, which is a widely used method for visualizing high-dimensional data. We also introduce SG-t-SNE-Π, a high-performance software for rapid d-dimensional embedding of large, sparse, stochastic graphs on personal computers with superior efficiency. It empowers SG-t-SNE with modern computing techniques exploiting matrix structures in tandem with memory architectures. We present elucidating graph embedding results with several synthetic graphs and real-world networks in this paper and its Supplementary Material.11Supplementary Material is at http://t-sne-pi.cs.duke.edu.

我们引入SG-t-SNE，这是一种将随机图/网络嵌入d维空间(d = 1,2,3)的非线性方法，不需要将顶点特征驻留在度量空间中或转换为度量空间。图/网络是关系数据，在实际应用中很普遍。除了图形可视化之外，图嵌入是许多图分析任务的基础。SG-t-SNE遵循并建立在t-SNE的核心原则之上，t-SNE是一种广泛使用的高维数据可视化方法。我们还介绍了SG-t-SNE-Π，这是一种高性能软件，用于在个人计算机上快速嵌入大型，稀疏，随机图形，具有卓越的效率。它使SG-t-SNE具有利用矩阵结构与内存体系结构相结合的现代计算技术。本文给出了几个合成图和真实网络的图嵌入结果及其补充材料。补充材料在http://t-sne-pi.cs.duke.edu。

引用次数: 4

Deploying AI Frameworks on Secure HPC Systems with Containers. 在带有容器的安全HPC系统上部署AI框架。

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-05-24 DOI: 10.1109/HPEC.2019.8916576

D. Brayford, S. Vallecorsa, Atanas Z. Atanasov, F. Baruffa, Walter Riviera

The increasing interest in the usage of Artificial Intelligence (AI) techniques from the research community and industry to tackle “real world” problems, requires High Performance Computing (HPC) resources to efficiently compute and scale complex algorithms across thousands of nodes. Unfortunately, typical data scientists are not familiar with the unique requirements and characteristics of HPC environments. They usually develop their applications with high level scripting languages or frameworks such as TensorFlow and the installation processes often require connection to external systems to download open source software during the build. HPC environments, on the other hand, are often based on closed source applications that incorporate parallel and distributed computing API’s such as MPI and OpenMP, while users have restricted administrator privileges, and face security restrictions such as not allowing access to external systems. In this paper we discuss the issues associated with the deployment of AI frameworks in a secure HPC environment and how we successfully deploy AI frameworks on SuperMUC-NG with Charliecloud.

研究界和工业界对使用人工智能(AI)技术来解决“现实世界”问题的兴趣日益浓厚，这需要高性能计算(HPC)资源来高效地计算和扩展跨越数千个节点的复杂算法。不幸的是，典型的数据科学家并不熟悉HPC环境的独特需求和特征。他们通常使用高级脚本语言或框架(如TensorFlow)开发应用程序，并且安装过程通常需要连接到外部系统以在构建期间下载开源软件。另一方面，HPC环境通常基于封闭源应用程序，这些应用程序结合了并行和分布式计算API(如MPI和OpenMP)，而用户具有受限的管理员权限，并且面临不允许访问外部系统等安全限制。在本文中，我们讨论了在安全的高性能计算环境中部署AI框架的相关问题，以及我们如何通过charlicloud在supermu - ng上成功部署AI框架。

{"title":"Deploying AI Frameworks on Secure HPC Systems with Containers.","authors":"D. Brayford, S. Vallecorsa, Atanas Z. Atanasov, F. Baruffa, Walter Riviera","doi":"10.1109/HPEC.2019.8916576","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916576","url":null,"abstract":"The increasing interest in the usage of Artificial Intelligence (AI) techniques from the research community and industry to tackle “real world” problems, requires High Performance Computing (HPC) resources to efficiently compute and scale complex algorithms across thousands of nodes. Unfortunately, typical data scientists are not familiar with the unique requirements and characteristics of HPC environments. They usually develop their applications with high level scripting languages or frameworks such as TensorFlow and the installation processes often require connection to external systems to download open source software during the build. HPC environments, on the other hand, are often based on closed source applications that incorporate parallel and distributed computing API’s such as MPI and OpenMP, while users have restricted administrator privileges, and face security restrictions such as not allowing access to external systems. In this paper we discuss the issues associated with the deployment of AI frameworks in a secure HPC environment and how we successfully deploy AI frameworks on SuperMUC-NG with Charliecloud.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127387869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs NVIDIA gpgpu的低开销指令延迟特性

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-05-21 DOI: 10.1109/HPEC.2019.8916466

Yehia Arafa, Abdel-Hameed A. Badawy, Gopinath Chennupati, N. Santhi, S. Eidenbenz

The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for graphics operations as well as general-purpose computing (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed characteristics beyond what vendors provide is not small. In this paper, we introduce a very low overhead and portable analysis for exposing the latency of each instruction executing in the GPU pipeline(s) and the access overhead of the various memory hierarchies found in GPUs at the micro-architecture level. Furthermore, we show the impact of the various optimizations the CUDA compiler can perform over the various latencies. We perform our evaluation on seven different high-end NVIDIA GPUs from five different generations/architectures: Kepler, Maxwell, Pascal, Volta, and Turing. The results in this paper can help architects to have an accurate characterization of the latencies of these GPUs, which will help in modeling the hardware accurately. Also, software developers can perform informed optimizations to their applications.

在过去的十年中，计算机系统行业发生了转变，异构计算变得普遍起来。图形处理单元(gpu)现在出现在超级计算机、移动电话和平板电脑中。gpu用于图形操作和通用计算(gpgpu)，以提高计算密集型应用程序的性能。然而，超出供应商提供的未公开特性的百分比并不小。在本文中，我们介绍了一个非常低的开销和可移植的分析，用于暴露GPU管道中执行的每个指令的延迟，以及在微体系结构级别上GPU中发现的各种内存层次的访问开销。此外，我们还展示了CUDA编译器可以在各种延迟上执行的各种优化的影响。我们对来自5代/架构的七种不同的高端NVIDIA gpu进行了评估:Kepler, Maxwell, Pascal, Volta和Turing。本文的结果可以帮助架构师准确地描述这些gpu的延迟，这将有助于准确地建模硬件。此外，软件开发人员可以对他们的应用程序执行明智的优化。

{"title":"Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs","authors":"Yehia Arafa, Abdel-Hameed A. Badawy, Gopinath Chennupati, N. Santhi, S. Eidenbenz","doi":"10.1109/HPEC.2019.8916466","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916466","url":null,"abstract":"The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for graphics operations as well as general-purpose computing (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed characteristics beyond what vendors provide is not small. In this paper, we introduce a very low overhead and portable analysis for exposing the latency of each instruction executing in the GPU pipeline(s) and the access overhead of the various memory hierarchies found in GPUs at the micro-architecture level. Furthermore, we show the impact of the various optimizations the CUDA compiler can perform over the various latencies. We perform our evaluation on seven different high-end NVIDIA GPUs from five different generations/architectures: Kepler, Maxwell, Pascal, Volta, and Turing. The results in this paper can help architects to have an accurate characterization of the latencies of these GPUs, which will help in modeling the hardware accurately. Also, software developers can perform informed optimizations to their applications.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115000938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Multistart Methods for Quantum Approximate optimization 量子近似优化的多启动方法

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-05-21 DOI: 10.1109/HPEC.2019.8916288

Ruslan Shaydulin, Ilya Safro, Jeffrey Larson

Hybrid quantum-classical algorithms such as the quantum approximate optimization algorithm (QAOA) are considered one of the most promising approaches for leveraging near-term quantum computers for practical applications. Such algorithms are often implemented in a variational form, combining classical optimization methods with a quantum machine to find parameters that maximize performance. The quality of the QAOA solution depends heavily on quality of the parameters produced by the classical optimizer. Moreover, the presence of multiple local optima makes it difficult for the classical optimizer to identify high-quality parameters. In this paper we study the use of a multistart optimization approach within QAOA to improve the performance of quantum machines on important graph clustering problems. We also demonstrate that reusing the optimal parameters from similar problems can improve the performance of classical optimization methods, expanding on similar results for MAXCUT.

量子近似优化算法(QAOA)等混合量子经典算法被认为是利用近期量子计算机进行实际应用的最有前途的方法之一。这种算法通常以变分形式实现，将经典优化方法与量子机器相结合，以找到性能最大化的参数。QAOA解决方案的质量在很大程度上取决于经典优化器产生的参数的质量。此外，由于存在多个局部最优点，使得经典优化器难以识别出高质量的参数。在本文中，我们研究了在QAOA中使用多启动优化方法来提高量子机器在重要图聚类问题上的性能。我们还证明了重用来自类似问题的最优参数可以提高经典优化方法的性能，扩展了MAXCUT的类似结果。

引用次数: 77

Overcoming Limitations of GPGPU-Computing in Scientific Applications 克服科学应用中gpgpu计算的局限性

2019 IEEE High Performance Extreme Computing Conference (HPEC)

Pub Date : 2019-05-10 DOI: 10.1109/HPEC.2019.8916330

Connor Kenyon, Glenn Volkema, G. Khanna

The performance of discrete general purpose graphics processing units (GPGPUs) has been improving at a rapid pace. The PCIe interconnect that controls the communication of data between the system host memory and the GPU has not improved as quickly, leaving a gap in performance due to GPU downtime while waiting for PCIe data transfer. In this article, we explore two alternatives to the limited PCIe bandwidth, NVIDIA NVLink interconnect, and zero-copy algorithms for shared memory Heterogeneous System Architecture (HSA) devices. The OpenCL SHOC benchmark suite is used to measure the performance of each device on various scientific application kernels.

离散通用图形处理单元(gpgpu)的性能一直在快速提高。控制系统主机内存和GPU之间数据通信的PCIe互连没有得到快速改进，由于GPU在等待PCIe数据传输时停机，导致性能下降。在本文中，我们探讨了两种替代有限的PCIe带宽、NVIDIA NVLink互连和用于共享内存异构系统架构(HSA)设备的零复制算法。OpenCL SHOC基准测试套件用于在各种科学应用内核上测量每个设备的性能。

引用次数: 1

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE High Performance Extreme Computing Conference (HPEC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀