Pub Date : 2019-08-28DOI: 10.1109/HPEC.2019.8916416
S. Samsi, Christopher J. Mattioli, M. Veillette
Effective training of Deep Neural Networks requires massive amounts of data and compute. As a result, longer times are needed to train complex models requiring large datasets, which can severely limit research on model development and the exploitation of all available data. In this paper, this problem is investigated in the context of precipitation nowcasting, a term used to describe highly detailed short-term forecasts of precipitation and other hazardous weather. Convolutional Neural Networks (CNNs) are a powerful class of models that are well-suited for this task; however, the high resolution input weather imagery combined with model complexity required to process this data makes training CNNs to solve this task time consuming. To address this issue, a data-parallel model is implemented where a CNN is replicated across multiple compute nodes and the training batches are distributed across multiple nodes. By leveraging multiple GPUs, we show that the training time for a given nowcasting model architecture can be reduced from 59 hours to just over 1 hour. This will allow for faster iterations for improving CNN architectures and will facilitate future advancement in the area of nowcasting.
{"title":"Distributed Deep Learning for Precipitation Nowcasting","authors":"S. Samsi, Christopher J. Mattioli, M. Veillette","doi":"10.1109/HPEC.2019.8916416","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916416","url":null,"abstract":"Effective training of Deep Neural Networks requires massive amounts of data and compute. As a result, longer times are needed to train complex models requiring large datasets, which can severely limit research on model development and the exploitation of all available data. In this paper, this problem is investigated in the context of precipitation nowcasting, a term used to describe highly detailed short-term forecasts of precipitation and other hazardous weather. Convolutional Neural Networks (CNNs) are a powerful class of models that are well-suited for this task; however, the high resolution input weather imagery combined with model complexity required to process this data makes training CNNs to solve this task time consuming. To address this issue, a data-parallel model is implemented where a CNN is replicated across multiple compute nodes and the training batches are distributed across multiple nodes. By leveraging multiple GPUs, we show that the training time for a given nowcasting model architecture can be reduced from 59 hours to just over 1 hour. This will allow for faster iterations for improving CNN architectures and will facilitate future advancement in the area of nowcasting.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129266529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-20DOI: 10.1109/HPEC.2019.8916255
Andrew Prout, W. Arcand, David Bestor, Bill Bergeron, C. Byun, V. Gadepally, Michael Houle, M. Hubbell, Michael Jones, Anna Klein, P. Michaleas, Lauren Milechin, J. Mullen, Antonio Rosa, S. Samsi, Charles Yee, A. Reuther, J. Kepner
Federated authentication can drastically reduce the overhead of basic account maintenance while simultaneously improving overall system security. Integrating with the user’s more frequently used account at their primary organization both provides a better experience to the end user and makes account compromise or changes in affiliation more likely to be noticed and acted upon. Additionally, with many organizations transitioning to multi-factor authentication for all account access, the ability to leverage external federated identity management systems provides the benefit of their efforts without the additional overhead of separately implementing a distinct multi-factor authentication process. This paper describes our experiences and the lessons we learned by enabling federated authentication with the U.S. Government PKI and In Common Federation, scaling it up to the user base of a production HPC system, and the motivations behind those choices. We have received only positive feedback from our users.
联邦身份验证可以大大减少基本帐户维护的开销,同时提高整个系统的安全性。与用户在其主要组织中更频繁使用的帐户集成,既为最终用户提供了更好的体验,又使帐户折衷或关联中的更改更容易被注意到并采取行动。此外,随着许多组织将所有帐户访问转换为多因素身份验证,利用外部联邦身份管理系统的能力提供了他们努力的好处,而无需单独实现不同的多因素身份验证过程的额外开销。本文描述了我们通过启用美国政府PKI和In Common Federation的联合身份验证,将其扩展到生产HPC系统的用户群,以及这些选择背后的动机所获得的经验和教训。我们只收到了用户的积极反馈。
{"title":"Securing HPC using Federated Authentication","authors":"Andrew Prout, W. Arcand, David Bestor, Bill Bergeron, C. Byun, V. Gadepally, Michael Houle, M. Hubbell, Michael Jones, Anna Klein, P. Michaleas, Lauren Milechin, J. Mullen, Antonio Rosa, S. Samsi, Charles Yee, A. Reuther, J. Kepner","doi":"10.1109/HPEC.2019.8916255","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916255","url":null,"abstract":"Federated authentication can drastically reduce the overhead of basic account maintenance while simultaneously improving overall system security. Integrating with the user’s more frequently used account at their primary organization both provides a better experience to the end user and makes account compromise or changes in affiliation more likely to be noticed and acted upon. Additionally, with many organizations transitioning to multi-factor authentication for all account access, the ability to leverage external federated identity management systems provides the benefit of their efforts without the additional overhead of separately implementing a distinct multi-factor authentication process. This paper describes our experiences and the lessons we learned by enabling federated authentication with the U.S. Government PKI and In Common Federation, scaling it up to the user base of a production HPC system, and the motivations behind those choices. We have received only positive feedback from our users.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123990685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-16DOI: 10.1109/HPEC.2019.8916437
Jeffrey Liu, David Strohschein, S. Samsi, A. Weinert
Video applications and analytics are routinely projected as a stressing and significant service of the Nationwide Public Safety Broadband Network. As part of a NIST PSCR funded effort, the New Jersey Office of Homeland Security and Preparedness and MIT Lincoln Laboratory have been developing a computer vision dataset of operational and representative public safety scenarios. The scale and scope of this dataset necessitates a hierarchical organization approach for efficient compute and storage. We overview architectural considerations using the Lincoln Laboratory Supercomputing Cluster as a test architecture. We then describe how we intelligently organized the dataset across LLSC and evaluated it with large scale imagery inference across terabytes of data.
{"title":"Large Scale Organization and Inference of an Imagery Dataset for Public Safety","authors":"Jeffrey Liu, David Strohschein, S. Samsi, A. Weinert","doi":"10.1109/HPEC.2019.8916437","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916437","url":null,"abstract":"Video applications and analytics are routinely projected as a stressing and significant service of the Nationwide Public Safety Broadband Network. As part of a NIST PSCR funded effort, the New Jersey Office of Homeland Security and Preparedness and MIT Lincoln Laboratory have been developing a computer vision dataset of operational and representative public safety scenarios. The scale and scope of this dataset necessitates a hierarchical organization approach for efficient compute and storage. We overview architectural considerations using the Lincoln Laboratory Supercomputing Cluster as a test architecture. We then describe how we intelligently organized the dataset across LLSC and evaluated it with large scale imagery inference across terabytes of data.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123943589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-06DOI: 10.1109/HPEC.2019.8916300
C. Byun, J. Kepner, W. Arcand, David Bestor, William Bergeron, M. Hubbell, V. Gadepally, Michael Houle, Michael Jones, Anna Klein, Lauren Milechin, P. Michaleas, J. Mullen, Andrew Prout, Antonio Rosa, S. Samsi, Charles Yee, A. Reuther
The Intel Xeon Phi manycore processor is designed to provide high performance matrix computations of the type often performed in data analysis. Common data analysis environments include Matlab, GNU Octave, Julia, Python, and R. Achieving optimal performance of matrix operations within data analysis environments requires tuning the Xeon Phi OpenMP settings, process pinning, and memory modes. This paper describes matrix multiplication performance results for Matlab and GNU Octave over a variety of combinations of process counts and OpenMP threads and Xeon Phi memory modes. These results indicate that using KMP_AFFINITY=granlarity=fine, taskset pinning, and all2all cache memory mode allows both Matlab and GNU Octave to achieve 66% of the practical peak performance for process counts ranging from 1 to 64 and OpenMP threads ranging from 1 to 64. These settings have resulted in generally improved performance across a range of applications and has enabled our Xeon Phi system to deliver significant results in a number of real-world applications.
{"title":"Optimizing Xeon Phi for Interactive Data Analysis","authors":"C. Byun, J. Kepner, W. Arcand, David Bestor, William Bergeron, M. Hubbell, V. Gadepally, Michael Houle, Michael Jones, Anna Klein, Lauren Milechin, P. Michaleas, J. Mullen, Andrew Prout, Antonio Rosa, S. Samsi, Charles Yee, A. Reuther","doi":"10.1109/HPEC.2019.8916300","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916300","url":null,"abstract":"The Intel Xeon Phi manycore processor is designed to provide high performance matrix computations of the type often performed in data analysis. Common data analysis environments include Matlab, GNU Octave, Julia, Python, and R. Achieving optimal performance of matrix operations within data analysis environments requires tuning the Xeon Phi OpenMP settings, process pinning, and memory modes. This paper describes matrix multiplication performance results for Matlab and GNU Octave over a variety of combinations of process counts and OpenMP threads and Xeon Phi memory modes. These results indicate that using KMP_AFFINITY=granlarity=fine, taskset pinning, and all2all cache memory mode allows both Matlab and GNU Octave to achieve 66% of the practical peak performance for process counts ranging from 1 to 64 and OpenMP threads ranging from 1 to 64. These settings have resulted in generally improved performance across a range of applications and has enabled our Xeon Phi system to deliver significant results in a number of real-world applications.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132850249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-06DOI: 10.1109/HPEC.2019.8916508
J. Kepner, V. Gadepally, Lauren Milechin, S. Samsi, W. Arcand, David Bestor, William Bergeron, C. Byun, M. Hubbell, Michael Houle, Michael Jones, Anna Klein, P. Michaleas, J. Mullen, Andrew Prout, Antonio Rosa, Charles Yee, A. Reuther
The Dynamic Distributed Dimensional Data Model (D4M) library implements associative arrays in a variety of languages (Python, Julia, and Matlab/Octave) and provides a lightweight in-memory database implementation of hypersparse arrays that are ideal for analyzing many types of network data. D4M relies on associative arrays which combine properties of spreadsheets, databases, matrices, graphs, and networks, while providing rigorous mathematical guarantees, such as linearity. Streaming updates of D4M associative arrays put enormous pressure on the memory hierarchy. This work describes the design and performance optimization of an implementation of hierarchical associative arrays that reduces memory pressure and dramatically increases the update rate into an associative array. The parameters of hierarchical associative arrays rely on controlling the number of entries in each level in the hierarchy before an update is cascaded. The parameters are easily tunable to achieve optimal performance for a variety of applications. Hierarchical arrays achieve over 40,000 updates per second in a single instance. Scaling to 34,000 instances of hierarchical D4M associative arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 1,900,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.
{"title":"Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M","authors":"J. Kepner, V. Gadepally, Lauren Milechin, S. Samsi, W. Arcand, David Bestor, William Bergeron, C. Byun, M. Hubbell, Michael Houle, Michael Jones, Anna Klein, P. Michaleas, J. Mullen, Andrew Prout, Antonio Rosa, Charles Yee, A. Reuther","doi":"10.1109/HPEC.2019.8916508","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916508","url":null,"abstract":"The Dynamic Distributed Dimensional Data Model (D4M) library implements associative arrays in a variety of languages (Python, Julia, and Matlab/Octave) and provides a lightweight in-memory database implementation of hypersparse arrays that are ideal for analyzing many types of network data. D4M relies on associative arrays which combine properties of spreadsheets, databases, matrices, graphs, and networks, while providing rigorous mathematical guarantees, such as linearity. Streaming updates of D4M associative arrays put enormous pressure on the memory hierarchy. This work describes the design and performance optimization of an implementation of hierarchical associative arrays that reduces memory pressure and dramatically increases the update rate into an associative array. The parameters of hierarchical associative arrays rely on controlling the number of entries in each level in the hierarchy before an update is cascaded. The parameters are easily tunable to achieve optimal performance for a variety of applications. Hierarchical arrays achieve over 40,000 updates per second in a single instance. Scaling to 34,000 instances of hierarchical D4M associative arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 1,900,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125826483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-06-13DOI: 10.1109/HPEC.2019.8916505
N. Pitsianis, A. Iliopoulos, D. Floros, Xiaobai Sun
We introduce SG-t-SNE, a nonlinear method for embedding stochastic graphs/networks into d-dimensional spaces, d = 1, 2, 3, without requiring vertex features to reside in, or be transformed into, a metric space. Graphs/networks are relational data, prevalent in real-world applications. Graph embedding is fundamental to many graph analysis tasks, besides graph visualization. SG-t-SNE follows and builds upon the core principle of t-SNE, which is a widely used method for visualizing high-dimensional data. We also introduce SG-t-SNE-Π, a high-performance software for rapid d-dimensional embedding of large, sparse, stochastic graphs on personal computers with superior efficiency. It empowers SG-t-SNE with modern computing techniques exploiting matrix structures in tandem with memory architectures. We present elucidating graph embedding results with several synthetic graphs and real-world networks in this paper and its Supplementary Material.11Supplementary Material is at http://t-sne-pi.cs.duke.edu.
{"title":"Spaceland Embedding of Sparse Stochastic Graphs","authors":"N. Pitsianis, A. Iliopoulos, D. Floros, Xiaobai Sun","doi":"10.1109/HPEC.2019.8916505","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916505","url":null,"abstract":"We introduce SG-t-SNE, a nonlinear method for embedding stochastic graphs/networks into d-dimensional spaces, d = 1, 2, 3, without requiring vertex features to reside in, or be transformed into, a metric space. Graphs/networks are relational data, prevalent in real-world applications. Graph embedding is fundamental to many graph analysis tasks, besides graph visualization. SG-t-SNE follows and builds upon the core principle of t-SNE, which is a widely used method for visualizing high-dimensional data. We also introduce SG-t-SNE-Π, a high-performance software for rapid d-dimensional embedding of large, sparse, stochastic graphs on personal computers with superior efficiency. It empowers SG-t-SNE with modern computing techniques exploiting matrix structures in tandem with memory architectures. We present elucidating graph embedding results with several synthetic graphs and real-world networks in this paper and its Supplementary Material.11Supplementary Material is at http://t-sne-pi.cs.duke.edu.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122956087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-24DOI: 10.1109/HPEC.2019.8916576
D. Brayford, S. Vallecorsa, Atanas Z. Atanasov, F. Baruffa, Walter Riviera
The increasing interest in the usage of Artificial Intelligence (AI) techniques from the research community and industry to tackle “real world” problems, requires High Performance Computing (HPC) resources to efficiently compute and scale complex algorithms across thousands of nodes. Unfortunately, typical data scientists are not familiar with the unique requirements and characteristics of HPC environments. They usually develop their applications with high level scripting languages or frameworks such as TensorFlow and the installation processes often require connection to external systems to download open source software during the build. HPC environments, on the other hand, are often based on closed source applications that incorporate parallel and distributed computing API’s such as MPI and OpenMP, while users have restricted administrator privileges, and face security restrictions such as not allowing access to external systems. In this paper we discuss the issues associated with the deployment of AI frameworks in a secure HPC environment and how we successfully deploy AI frameworks on SuperMUC-NG with Charliecloud.
{"title":"Deploying AI Frameworks on Secure HPC Systems with Containers.","authors":"D. Brayford, S. Vallecorsa, Atanas Z. Atanasov, F. Baruffa, Walter Riviera","doi":"10.1109/HPEC.2019.8916576","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916576","url":null,"abstract":"The increasing interest in the usage of Artificial Intelligence (AI) techniques from the research community and industry to tackle “real world” problems, requires High Performance Computing (HPC) resources to efficiently compute and scale complex algorithms across thousands of nodes. Unfortunately, typical data scientists are not familiar with the unique requirements and characteristics of HPC environments. They usually develop their applications with high level scripting languages or frameworks such as TensorFlow and the installation processes often require connection to external systems to download open source software during the build. HPC environments, on the other hand, are often based on closed source applications that incorporate parallel and distributed computing API’s such as MPI and OpenMP, while users have restricted administrator privileges, and face security restrictions such as not allowing access to external systems. In this paper we discuss the issues associated with the deployment of AI frameworks in a secure HPC environment and how we successfully deploy AI frameworks on SuperMUC-NG with Charliecloud.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127387869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-21DOI: 10.1109/HPEC.2019.8916466
Yehia Arafa, Abdel-Hameed A. Badawy, Gopinath Chennupati, N. Santhi, S. Eidenbenz
The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for graphics operations as well as general-purpose computing (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed characteristics beyond what vendors provide is not small. In this paper, we introduce a very low overhead and portable analysis for exposing the latency of each instruction executing in the GPU pipeline(s) and the access overhead of the various memory hierarchies found in GPUs at the micro-architecture level. Furthermore, we show the impact of the various optimizations the CUDA compiler can perform over the various latencies. We perform our evaluation on seven different high-end NVIDIA GPUs from five different generations/architectures: Kepler, Maxwell, Pascal, Volta, and Turing. The results in this paper can help architects to have an accurate characterization of the latencies of these GPUs, which will help in modeling the hardware accurately. Also, software developers can perform informed optimizations to their applications.
{"title":"Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs","authors":"Yehia Arafa, Abdel-Hameed A. Badawy, Gopinath Chennupati, N. Santhi, S. Eidenbenz","doi":"10.1109/HPEC.2019.8916466","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916466","url":null,"abstract":"The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for graphics operations as well as general-purpose computing (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed characteristics beyond what vendors provide is not small. In this paper, we introduce a very low overhead and portable analysis for exposing the latency of each instruction executing in the GPU pipeline(s) and the access overhead of the various memory hierarchies found in GPUs at the micro-architecture level. Furthermore, we show the impact of the various optimizations the CUDA compiler can perform over the various latencies. We perform our evaluation on seven different high-end NVIDIA GPUs from five different generations/architectures: Kepler, Maxwell, Pascal, Volta, and Turing. The results in this paper can help architects to have an accurate characterization of the latencies of these GPUs, which will help in modeling the hardware accurately. Also, software developers can perform informed optimizations to their applications.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115000938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-21DOI: 10.1109/HPEC.2019.8916288
Ruslan Shaydulin, Ilya Safro, Jeffrey Larson
Hybrid quantum-classical algorithms such as the quantum approximate optimization algorithm (QAOA) are considered one of the most promising approaches for leveraging near-term quantum computers for practical applications. Such algorithms are often implemented in a variational form, combining classical optimization methods with a quantum machine to find parameters that maximize performance. The quality of the QAOA solution depends heavily on quality of the parameters produced by the classical optimizer. Moreover, the presence of multiple local optima makes it difficult for the classical optimizer to identify high-quality parameters. In this paper we study the use of a multistart optimization approach within QAOA to improve the performance of quantum machines on important graph clustering problems. We also demonstrate that reusing the optimal parameters from similar problems can improve the performance of classical optimization methods, expanding on similar results for MAXCUT.
{"title":"Multistart Methods for Quantum Approximate optimization","authors":"Ruslan Shaydulin, Ilya Safro, Jeffrey Larson","doi":"10.1109/HPEC.2019.8916288","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916288","url":null,"abstract":"Hybrid quantum-classical algorithms such as the quantum approximate optimization algorithm (QAOA) are considered one of the most promising approaches for leveraging near-term quantum computers for practical applications. Such algorithms are often implemented in a variational form, combining classical optimization methods with a quantum machine to find parameters that maximize performance. The quality of the QAOA solution depends heavily on quality of the parameters produced by the classical optimizer. Moreover, the presence of multiple local optima makes it difficult for the classical optimizer to identify high-quality parameters. In this paper we study the use of a multistart optimization approach within QAOA to improve the performance of quantum machines on important graph clustering problems. We also demonstrate that reusing the optimal parameters from similar problems can improve the performance of classical optimization methods, expanding on similar results for MAXCUT.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123679677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-10DOI: 10.1109/HPEC.2019.8916330
Connor Kenyon, Glenn Volkema, G. Khanna
The performance of discrete general purpose graphics processing units (GPGPUs) has been improving at a rapid pace. The PCIe interconnect that controls the communication of data between the system host memory and the GPU has not improved as quickly, leaving a gap in performance due to GPU downtime while waiting for PCIe data transfer. In this article, we explore two alternatives to the limited PCIe bandwidth, NVIDIA NVLink interconnect, and zero-copy algorithms for shared memory Heterogeneous System Architecture (HSA) devices. The OpenCL SHOC benchmark suite is used to measure the performance of each device on various scientific application kernels.
{"title":"Overcoming Limitations of GPGPU-Computing in Scientific Applications","authors":"Connor Kenyon, Glenn Volkema, G. Khanna","doi":"10.1109/HPEC.2019.8916330","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916330","url":null,"abstract":"The performance of discrete general purpose graphics processing units (GPGPUs) has been improving at a rapid pace. The PCIe interconnect that controls the communication of data between the system host memory and the GPU has not improved as quickly, leaving a gap in performance due to GPU downtime while waiting for PCIe data transfer. In this article, we explore two alternatives to the limited PCIe bandwidth, NVIDIA NVLink interconnect, and zero-copy algorithms for shared memory Heterogeneous System Architecture (HSA) devices. The OpenCL SHOC benchmark suite is used to measure the performance of each device on various scientific application kernels.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133554372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}