首页 > 最新文献

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)最新文献

英文 中文
Main-Memory Requirements of Big Data Applications on Commodity Server Platform 商品服务器平台上大数据应用的主存需求
Hosein Mohammadi Makrani, S. Rafatirad, A. Houmansadr, H. Homayoun
The emergence of big data frameworks requires computational and memory resources that can naturally scale to manage massive amounts of diverse data. It is currently unclear whether big data frameworks such as Hadoop, Spark, and MPI will require high bandwidth and large capacity memory to cope with this change. The primary purpose of this study is to answer this question through empirical analysis of different memory configurations available for commodity server and to assess the impact of these configurations on the performance Hadoop and Spark frameworks, and MPI based applications. Our results show that neither DRAM capacity, frequency, nor the number of channels play a critical role on the performance of all studied Hadoop as well as most studied Spark applications. However, our results reveal that iterative tasks (e.g. machine learning) in Spark and MPI are benefiting from a high bandwidth and large capacity memory.
大数据框架的出现需要计算和内存资源,这些资源可以自然扩展以管理大量不同的数据。目前还不清楚Hadoop、Spark和MPI等大数据框架是否需要高带宽和大容量内存来应对这种变化。本研究的主要目的是通过对商品服务器可用的不同内存配置的实证分析来回答这个问题,并评估这些配置对Hadoop和Spark框架以及基于MPI的应用程序性能的影响。我们的研究结果表明,DRAM容量、频率和通道数量对所有研究的Hadoop以及大多数研究的Spark应用程序的性能都没有关键作用。然而,我们的研究结果表明,Spark和MPI中的迭代任务(例如机器学习)受益于高带宽和大容量内存。
{"title":"Main-Memory Requirements of Big Data Applications on Commodity Server Platform","authors":"Hosein Mohammadi Makrani, S. Rafatirad, A. Houmansadr, H. Homayoun","doi":"10.1109/CCGRID.2018.00097","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00097","url":null,"abstract":"The emergence of big data frameworks requires computational and memory resources that can naturally scale to manage massive amounts of diverse data. It is currently unclear whether big data frameworks such as Hadoop, Spark, and MPI will require high bandwidth and large capacity memory to cope with this change. The primary purpose of this study is to answer this question through empirical analysis of different memory configurations available for commodity server and to assess the impact of these configurations on the performance Hadoop and Spark frameworks, and MPI based applications. Our results show that neither DRAM capacity, frequency, nor the number of channels play a critical role on the performance of all studied Hadoop as well as most studied Spark applications. However, our results reveal that iterative tasks (e.g. machine learning) in Spark and MPI are benefiting from a high bandwidth and large capacity memory.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115154017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Efficient Fault Tolerance Through Dynamic Node Replacement 通过动态节点替换实现高效容错
Suraj Prabhakaran, M. Neumann, F. Wolf
The mean time between failures of upcoming exascale systems is expected to be one hour or less. To be able to successfully complete execution of applications in such scenarios, several improved checkpoint/restart mechanisms such as multi-level checkpointing are being developed. Today, resource management systems handle job interruptions due to node failures by restarting the affected job from a checkpoint on a fresh set of nodes. This method, however, will add non-negligible overhead and will not allow taking full advantage of multi-level checkpointing in future systems. Alternatively, some spare nodes can be allocated for each job so that only processes that die on the failed nodes need to be restarted on spare nodes. However, given the magnitude of the expected failure rates, the number of spare nodes to be allocated for each job would be high, causing significant resource wastage. This work proposes a dynamic way handling node failures by enabling on-the-fly replacement of failed nodes with healthy ones. We propose a dynamic node replacement algorithm that finds replacement nodes by utilizing the flexibility of moldable and malleable jobs. Our evaluation with a simulator shows that this approach can maintain high throughput even when a system is experiencing frequent node failures, thereby making it a perfect technique to complement multi-level checkpointing.
即将到来的百亿亿级系统的平均故障间隔时间预计将在一个小时或更短。为了能够在这种情况下成功完成应用程序的执行,正在开发几种改进的检查点/重启机制,例如多级检查点。目前,资源管理系统通过在一组新的节点上的检查点重新启动受影响的作业来处理由于节点故障而导致的作业中断。然而,这种方法将增加不可忽略的开销,并且不允许在未来的系统中充分利用多级检查点。或者,可以为每个作业分配一些备用节点,这样只有在故障节点上死亡的进程才需要在备用节点上重新启动。然而,考虑到预期故障率的大小,为每个作业分配的备用节点数量将会很高,从而导致严重的资源浪费。这项工作提出了一种动态处理节点故障的方法,通过启用故障节点与健康节点的动态替换。我们提出了一种动态节点替换算法,利用可塑和可塑作业的灵活性来寻找替换节点。我们用模拟器进行的评估表明,即使在系统经历频繁的节点故障时,这种方法也可以保持高吞吐量,从而使其成为补充多级检查点的完美技术。
{"title":"Efficient Fault Tolerance Through Dynamic Node Replacement","authors":"Suraj Prabhakaran, M. Neumann, F. Wolf","doi":"10.1109/CCGRID.2018.00031","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00031","url":null,"abstract":"The mean time between failures of upcoming exascale systems is expected to be one hour or less. To be able to successfully complete execution of applications in such scenarios, several improved checkpoint/restart mechanisms such as multi-level checkpointing are being developed. Today, resource management systems handle job interruptions due to node failures by restarting the affected job from a checkpoint on a fresh set of nodes. This method, however, will add non-negligible overhead and will not allow taking full advantage of multi-level checkpointing in future systems. Alternatively, some spare nodes can be allocated for each job so that only processes that die on the failed nodes need to be restarted on spare nodes. However, given the magnitude of the expected failure rates, the number of spare nodes to be allocated for each job would be high, causing significant resource wastage. This work proposes a dynamic way handling node failures by enabling on-the-fly replacement of failed nodes with healthy ones. We propose a dynamic node replacement algorithm that finds replacement nodes by utilizing the flexibility of moldable and malleable jobs. Our evaluation with a simulator shows that this approach can maintain high throughput even when a system is experiencing frequent node failures, thereby making it a perfect technique to complement multi-level checkpointing.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115654108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Optimizing Preconditioned Conjugate Gradient on TaihuLight for OpenFOAM 太湖光下OpenFOAM预条件共轭梯度优化
James Lin, Minhua Wen, Delong Meng, Xin Liu, Akira Nukada, S. Matsuoka
Porting the domain-specific software OpenFOAM onto the TaihuLight supercomputer is a challenging task, due to the highly memory-bound nature of both the supercomputer's processor (SW26010) and the software's liner solvers. Our study tackles this technical challenge, in three steps, by optimizing the linear solvers, such as Preconditioned Conjugate Gradient (PCG), on the SW26010. First, in order to minimize the all_reduce communication cost of PCG, we developed a new algorithm RNPCG, a non-blocking PCG leveraging the on-chip register communication. Second, we optimized three key kernels of the PCG, including proposing a localized version of the Diagonal-based Incomplete Cholesky (LDIC) preconditioner. Third, to scale the RNPCG on TaihuLight, we designed the three-level non-blocking all_reduce operations. With these three steps, we implemented the RNPCG in OpenFOAM. The experimental results on TaihuLight show that 1) compared with the default implementations of OpenFOAM, the RNPCG and the LDIC on a single-core group of SW26010 can achieve a maximum speedup of 8.9X and 3.1X, respectively; 2) the scalable RNPCG can outperform the standard PCG both in the strong and the weak scaling up to 66,560 cores.
将特定领域的软件OpenFOAM移植到太湖之光超级计算机上是一项具有挑战性的任务,因为超级计算机的处理器(SW26010)和软件的线性求解器都具有高度内存限制的特性。我们的研究通过优化SW26010上的线性求解器(如预条件共轭梯度(PCG)),分三步解决了这一技术挑战。首先,为了最小化PCG的all_reduce通信成本,我们开发了一种新的算法RNPCG,一种利用片上寄存器通信的非阻塞PCG。其次,我们优化了PCG的三个关键内核,包括提出了一个本地化版本的基于对角的不完全Cholesky (LDIC)预条件。第三,为了扩展太湖之光上的RNPCG,我们设计了三层无阻塞的all_reduce操作。通过这三个步骤,我们在OpenFOAM中实现了RNPCG。在太湖之光上的实验结果表明:1)与OpenFOAM的默认实现相比,RNPCG和LDIC在SW26010单核组上的最大加速分别可达到8.9X和3.1X;2)可扩展的RNPCG在强、弱两方面都优于标准PCG,扩展到66,560核。
{"title":"Optimizing Preconditioned Conjugate Gradient on TaihuLight for OpenFOAM","authors":"James Lin, Minhua Wen, Delong Meng, Xin Liu, Akira Nukada, S. Matsuoka","doi":"10.1109/CCGRID.2018.00042","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00042","url":null,"abstract":"Porting the domain-specific software OpenFOAM onto the TaihuLight supercomputer is a challenging task, due to the highly memory-bound nature of both the supercomputer's processor (SW26010) and the software's liner solvers. Our study tackles this technical challenge, in three steps, by optimizing the linear solvers, such as Preconditioned Conjugate Gradient (PCG), on the SW26010. First, in order to minimize the all_reduce communication cost of PCG, we developed a new algorithm RNPCG, a non-blocking PCG leveraging the on-chip register communication. Second, we optimized three key kernels of the PCG, including proposing a localized version of the Diagonal-based Incomplete Cholesky (LDIC) preconditioner. Third, to scale the RNPCG on TaihuLight, we designed the three-level non-blocking all_reduce operations. With these three steps, we implemented the RNPCG in OpenFOAM. The experimental results on TaihuLight show that 1) compared with the default implementations of OpenFOAM, the RNPCG and the LDIC on a single-core group of SW26010 can achieve a maximum speedup of 8.9X and 3.1X, respectively; 2) the scalable RNPCG can outperform the standard PCG both in the strong and the weak scaling up to 66,560 cores.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117215751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Intelligently-Automated Facilities Expansion with the HEPCloud Decision Engine 智能自动化设施扩展与HEPCloud决策引擎
P. Mhashilkar, Mine Altunay, W. Dagenhart, S. Fuess, B. Holzman, J. Kowalkowski, D. Litvintsev, Qiming Lu, A. Moibenko, M. Paterno, P. Spentzouris, S. Timm, A. Tiradani
The next generation of High Energy Physics experiments are expected to generate exabytes of data—two orders of magnitude greater than the current generation. In order to reliably meet peak demands, facilities must either plan to provision enough resources to cover the forecasted need, or find ways to elastically expand their computational capabilities. Commercial cloud and allocation-based High Performance Computing (HPC) resources both have explicit and implicit costs that must be considered when deciding when to provision these resources, and to choose an appropriate scale. In order to support such provisioning in a manner consistent with organizational business rules and budget constraints, we have developed a modular intelligent decision support system (IDSS) to aid in the automatic provisioning of resources—spanning multiple cloud providers, multiple HPC centers, and grid computing federations.
下一代高能物理实验预计将产生艾字节的数据——比当前一代多两个数量级。为了可靠地满足高峰需求,设施必须计划提供足够的资源来满足预测的需求,或者找到弹性扩展其计算能力的方法。商业云和基于分配的高性能计算(HPC)资源都有显式和隐性成本,在决定何时提供这些资源以及选择适当的规模时,必须考虑这些成本。为了以与组织业务规则和预算约束一致的方式支持这种配置,我们开发了一个模块化智能决策支持系统(IDSS)来帮助自动配置跨多个云提供商、多个HPC中心和网格计算联盟的资源。
{"title":"Intelligently-Automated Facilities Expansion with the HEPCloud Decision Engine","authors":"P. Mhashilkar, Mine Altunay, W. Dagenhart, S. Fuess, B. Holzman, J. Kowalkowski, D. Litvintsev, Qiming Lu, A. Moibenko, M. Paterno, P. Spentzouris, S. Timm, A. Tiradani","doi":"10.1109/CCGRID.2018.00053","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00053","url":null,"abstract":"The next generation of High Energy Physics experiments are expected to generate exabytes of data—two orders of magnitude greater than the current generation. In order to reliably meet peak demands, facilities must either plan to provision enough resources to cover the forecasted need, or find ways to elastically expand their computational capabilities. Commercial cloud and allocation-based High Performance Computing (HPC) resources both have explicit and implicit costs that must be considered when deciding when to provision these resources, and to choose an appropriate scale. In order to support such provisioning in a manner consistent with organizational business rules and budget constraints, we have developed a modular intelligent decision support system (IDSS) to aid in the automatic provisioning of resources—spanning multiple cloud providers, multiple HPC centers, and grid computing federations.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125291690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
SAIDS: A Self-Adaptable Intrusion Detection System for IaaS Clouds 基于IaaS云的自适应入侵检测系统
Anna Giannakou, Louis Rilling, C. Morin, Jean-Louis Pazat
IaaS clouds allow customers (called tenants) to deploy their IT as virtualized infrastructures. However IaaS clouds features, such as multi-tenancy and elasticity, generate new security vulnerabilities for which the security monitoring must be partly run by the cloud provider to give visibility at the virtualization infrastructure level. Unfortunately the same IaaS clouds features make the virtualized infrastructures frequently reconfigurable and thus affect the ability of a provider-run security monitoring system to detect attacks.
IaaS云允许客户(称为租户)将其IT部署为虚拟化基础设施。然而,IaaS云特性(如多租户和弹性)会产生新的安全漏洞,云提供商必须对这些漏洞进行部分安全监控,以在虚拟化基础设施级别提供可见性。不幸的是,相同的IaaS云特性使虚拟化基础设施经常可重新配置,从而影响提供商运行的安全监视系统检测攻击的能力。
{"title":"SAIDS: A Self-Adaptable Intrusion Detection System for IaaS Clouds","authors":"Anna Giannakou, Louis Rilling, C. Morin, Jean-Louis Pazat","doi":"10.1109/CCGRID.2018.00054","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00054","url":null,"abstract":"IaaS clouds allow customers (called tenants) to deploy their IT as virtualized infrastructures. However IaaS clouds features, such as multi-tenancy and elasticity, generate new security vulnerabilities for which the security monitoring must be partly run by the cloud provider to give visibility at the virtualization infrastructure level. Unfortunately the same IaaS clouds features make the virtualized infrastructures frequently reconfigurable and thus affect the ability of a provider-run security monitoring system to detect attacks.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125344484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
GPU-Accelerated Algorithms for Allocating Virtual Infrastructure in Cloud Data Centers 基于gpu加速的云数据中心虚拟基础设施分配算法
Lucas Leandro Nesi, M. A. Pillon, M. Assunção, G. Koslovski
Allocating IT resources to Virtual Infrastructures (VIs) (i.e. groups of VMs, virtual switches, and their network interconnections) is an NP-hard problem. Most allocation algorithms designed to run on CPUs face scalability issues when considering current cloud data centers comprising thousands of servers. This work offers and evaluates a set of allocation algorithms refactored for Graphic Processing Units (GPUs). Experimental results demonstrate their ability to handle three large-scale data center topologies.
将IT资源分配给虚拟基础设施(Virtual infrastructure, VIs)(即虚拟机组、虚拟交换机及其网络互连)是一个np难题。当考虑到当前包含数千台服务器的云数据中心时,大多数设计用于在cpu上运行的分配算法都面临可伸缩性问题。这项工作提供并评估了一组分配算法重构图形处理单元(gpu)。实验结果表明,该算法能够处理三种大型数据中心拓扑结构。
{"title":"GPU-Accelerated Algorithms for Allocating Virtual Infrastructure in Cloud Data Centers","authors":"Lucas Leandro Nesi, M. A. Pillon, M. Assunção, G. Koslovski","doi":"10.1109/CCGRID.2018.00057","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00057","url":null,"abstract":"Allocating IT resources to Virtual Infrastructures (VIs) (i.e. groups of VMs, virtual switches, and their network interconnections) is an NP-hard problem. Most allocation algorithms designed to run on CPUs face scalability issues when considering current cloud data centers comprising thousands of servers. This work offers and evaluates a set of allocation algorithms refactored for Graphic Processing Units (GPUs). Experimental results demonstrate their ability to handle three large-scale data center topologies.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116439111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Accelerating Vertex Cover Optimization on a GPU Architecture 在GPU架构上加速顶点覆盖优化
F. Abu-Khzam, DoKyung Kim, Matthew Perry, Kai Wang, Peter Shaw
Graphics Processing Units (GPUs) are gaining notable popularity due to their affordable high performance multi-core architecture. They are particularly useful for massive computations that involve large data sets. In this paper, we present a highly scalable approach for the NP-hard Vertex Cover problem. Our method is based on an advanced data structure to reduce memory usage for more parallelism and we propose a load balancing scheme that is effective for multiGPU architectures. Our parallel algorithm was implemented on multiple AMD GPUs using OpenCL. Experimental results show that our proposed approach can achieve signi?cant speedups on the hard instances of the DIMACS benchmarks as well as the notoriously hard 120-Cell graph and its variants.
图形处理单元(gpu)由于其经济实惠的高性能多核架构而越来越受欢迎。它们对于涉及大型数据集的大规模计算特别有用。在本文中,我们提出了一种高度可扩展的NP-hard顶点覆盖问题的方法。我们的方法是基于一种先进的数据结构,以减少内存的使用,以获得更多的并行性,我们提出了一种有效的多gpu架构负载平衡方案。我们的并行算法在多个AMD gpu上使用OpenCL实现。实验结果表明,我们提出的方法可以达到显著的效果。在DIMACS基准测试的硬实例上,以及众所周知的硬120 cell图及其变体上,都不能加速。
{"title":"Accelerating Vertex Cover Optimization on a GPU Architecture","authors":"F. Abu-Khzam, DoKyung Kim, Matthew Perry, Kai Wang, Peter Shaw","doi":"10.1109/CCGRID.2018.00008","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00008","url":null,"abstract":"Graphics Processing Units (GPUs) are gaining notable popularity due to their affordable high performance multi-core architecture. They are particularly useful for massive computations that involve large data sets. In this paper, we present a highly scalable approach for the NP-hard Vertex Cover problem. Our method is based on an advanced data structure to reduce memory usage for more parallelism and we propose a load balancing scheme that is effective for multiGPU architectures. Our parallel algorithm was implemented on multiple AMD GPUs using OpenCL. Experimental results show that our proposed approach can achieve signi?cant speedups on the hard instances of the DIMACS benchmarks as well as the notoriously hard 120-Cell graph and its variants.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122556258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Exposing Hidden Performance Opportunities in High Performance GPU Applications 揭示高性能GPU应用程序中隐藏的性能机会
Benjamin Welton, B. Miller
Leadership class systems with nodes containing many-core accelerators, such as GPUs, have the potential to increase the performance of applications. Effectively exploiting the parallelism provided by many-core accelerators requires developers to identify where accelerator parallelization would provide benefit and ensuring efficient interaction between the CPU and accelerator. In the abstract, these issues appear straightforward and well understood. However, we have found that significant untapped performance opportunities exist in these areas even in well-known, heavily optimized, real world applications created by experienced GPU developers. These untapped performance opportunities exist because accelerated libraries can create unexpected synchronization delay and memory transfer requests, interaction between accelerated libraries can cause unexpected inefficiencies when combined, and vectorization opportunities can be hidden by the structure of the program. In applications we have studied (Qball, QBox, Hoomd-blue, LAMMPs, and cuIBM), exploiting these opportunities resulted in reduction of their execution time by 18%-87%. In this work, we provide concrete evidence of the existence and impact that these performance issues have on real world applications today. We characterize the missed performance opportunities we have identified by their underlying cause and describe a preliminary design of detection methods that can be used by performance tools to identify these missed opportunities.
具有包含多核加速器(如gpu)的节点的领导类系统具有提高应用程序性能的潜力。有效地利用多核加速器提供的并行性要求开发人员确定加速器并行化将在何处提供好处,并确保CPU和加速器之间的有效交互。抽象地说,这些问题似乎很简单,也很容易理解。然而,我们发现在这些领域中存在着大量未开发的性能机会,即使是在由经验丰富的GPU开发人员创建的知名的、经过大量优化的现实世界应用程序中也是如此。这些未开发的性能机会之所以存在,是因为加速库可能会产生意想不到的同步延迟和内存传输请求,加速库之间的交互在组合时可能会导致意想不到的低效率,并且向量化机会可能被程序的结构所隐藏。在我们研究的应用程序(Qball、QBox、hood -blue、lamp和cuIBM)中,利用这些机会可以将它们的执行时间减少18%-87%。在这项工作中,我们提供了具体的证据,证明这些性能问题的存在及其对当今现实世界应用程序的影响。我们通过其潜在原因对错失的性能机会进行了描述,并描述了性能工具可以使用的检测方法的初步设计,以识别这些错失的机会。
{"title":"Exposing Hidden Performance Opportunities in High Performance GPU Applications","authors":"Benjamin Welton, B. Miller","doi":"10.1109/CCGRID.2018.00045","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00045","url":null,"abstract":"Leadership class systems with nodes containing many-core accelerators, such as GPUs, have the potential to increase the performance of applications. Effectively exploiting the parallelism provided by many-core accelerators requires developers to identify where accelerator parallelization would provide benefit and ensuring efficient interaction between the CPU and accelerator. In the abstract, these issues appear straightforward and well understood. However, we have found that significant untapped performance opportunities exist in these areas even in well-known, heavily optimized, real world applications created by experienced GPU developers. These untapped performance opportunities exist because accelerated libraries can create unexpected synchronization delay and memory transfer requests, interaction between accelerated libraries can cause unexpected inefficiencies when combined, and vectorization opportunities can be hidden by the structure of the program. In applications we have studied (Qball, QBox, Hoomd-blue, LAMMPs, and cuIBM), exploiting these opportunities resulted in reduction of their execution time by 18%-87%. In this work, we provide concrete evidence of the existence and impact that these performance issues have on real world applications today. We characterize the missed performance opportunities we have identified by their underlying cause and describe a preliminary design of detection methods that can be used by performance tools to identify these missed opportunities.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"494 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122930900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Process Affinity, Metrics and Impact on Performance: An Empirical Study 过程亲和、度量和对绩效的影响:一个实证研究
Cyril Bordage, E. Jeannot
Process placement, also called topology mapping, is a well-known strategy to improve parallel program execution by reducing the communication cost between processes. It requires two inputs: the topology of the target machine and a measure of the affinity between processes. In the literature, the dominant affinity measure is the communication matrix that describes the amount of communication between processes. The goal of this paper is to study the accuracy of the communication matrix as a measure of affinity. We have done an extensive set of tests with two fat-tree machines and a 3d-torus machine to evaluate several hypotheses that are often made in the literature and to discuss their validity. First, we check the correlation between algorithmic metrics and the performance of the application. Then, we check whether a good generic process placement algorithm never degrades performance. And finally, we see whether the structure of the communication matrix can be used to predict gain.
进程放置,也称为拓扑映射,是一种众所周知的通过减少进程之间的通信成本来改进并行程序执行的策略。它需要两个输入:目标机器的拓扑结构和进程之间的关联度量。在文献中,主要的亲和度量是描述进程之间通信量的通信矩阵。本文的目的是研究通信矩阵作为亲和度度量的准确性。我们用两台脂肪树机和一台3d环面机做了一组广泛的测试,以评估文献中经常提出的几个假设,并讨论它们的有效性。首先,我们检查算法度量和应用程序性能之间的相关性。然后,我们检查一个好的通用进程放置算法是否永远不会降低性能。最后,我们看看通信矩阵的结构是否可以用来预测增益。
{"title":"Process Affinity, Metrics and Impact on Performance: An Empirical Study","authors":"Cyril Bordage, E. Jeannot","doi":"10.1109/CCGRID.2018.00079","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00079","url":null,"abstract":"Process placement, also called topology mapping, is a well-known strategy to improve parallel program execution by reducing the communication cost between processes. It requires two inputs: the topology of the target machine and a measure of the affinity between processes. In the literature, the dominant affinity measure is the communication matrix that describes the amount of communication between processes. The goal of this paper is to study the accuracy of the communication matrix as a measure of affinity. We have done an extensive set of tests with two fat-tree machines and a 3d-torus machine to evaluate several hypotheses that are often made in the literature and to discuss their validity. First, we check the correlation between algorithmic metrics and the performance of the application. Then, we check whether a good generic process placement algorithm never degrades performance. And finally, we see whether the structure of the communication matrix can be used to predict gain.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"252 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129848036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
RPPC: A Holistic Runtime System for Maximizing Performance Under Power Capping RPPC:在功率上限下实现性能最大化的整体运行系统
Jinsu Park, Seongbeom Park, Woongki Baek
Maximizing performance in power-constrained computing environments is highly important in cloud and datacenter computing. To achieve the best possible performance of parallel applications under power capping, it is crucial to execute them with the optimal concurrency level and cross-component power allocation between CPUs and memory. Despite extensive prior works, it still remains unexplored to investigate the efficient runtime support that maximizes the performance of parallel applications under power capping through the coordinated control of concurrency level and cross-component power allocation. To bridge this gap, this work proposes RPPC, a holistic runtime system for maximizing performance under power capping. In contrast to the state-of-the-art techniques, RPPC robustly controls the two performance-critical knobs (i.e., concurrency level and cross-component power allocation) in a coordinated manner to maximize the performance of parallel applications under power capping. RPPC dynamically identifies the characteristics of the target parallel application and explores the system state space to find an efficient system state. Our experimental results demonstrate that RPPC significantly outperforms the two state-of-the-art power-capping techniques, achieves the performance comparable with the static best version that requires extensive per-application offline profiling, incurs small performance overheads, and provides the re-adaptation mechanism to external events such as total power budget changes.
在功率受限的计算环境中最大化性能在云和数据中心计算中非常重要。为了在功率上限下实现并行应用程序的最佳性能,在cpu和内存之间使用最佳并发级别和跨组件功率分配来执行它们至关重要。尽管之前已经做了大量的工作,但通过协调控制并发级别和跨组件功率分配来研究在功率上限下并行应用程序性能最大化的高效运行时支持仍然没有被探索。为了弥补这一差距,本工作提出了RPPC,这是一个在功率上限下最大化性能的整体运行时系统。与最先进的技术相比,RPPC以协调的方式稳健地控制两个性能关键旋钮(即并发级别和跨组件功率分配),以最大限度地提高功率上限下并行应用程序的性能。RPPC动态识别目标并行应用的特征,探索系统状态空间,寻找有效的系统状态。我们的实验结果表明,RPPC显著优于两种最先进的功率封顶技术,实现了与静态最佳版本相当的性能,静态最佳版本需要广泛的每个应用程序脱机分析,产生较小的性能开销,并提供对外部事件(如总功率预算变化)的重新适应机制。
{"title":"RPPC: A Holistic Runtime System for Maximizing Performance Under Power Capping","authors":"Jinsu Park, Seongbeom Park, Woongki Baek","doi":"10.1109/CCGRID.2018.00019","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00019","url":null,"abstract":"Maximizing performance in power-constrained computing environments is highly important in cloud and datacenter computing. To achieve the best possible performance of parallel applications under power capping, it is crucial to execute them with the optimal concurrency level and cross-component power allocation between CPUs and memory. Despite extensive prior works, it still remains unexplored to investigate the efficient runtime support that maximizes the performance of parallel applications under power capping through the coordinated control of concurrency level and cross-component power allocation. To bridge this gap, this work proposes RPPC, a holistic runtime system for maximizing performance under power capping. In contrast to the state-of-the-art techniques, RPPC robustly controls the two performance-critical knobs (i.e., concurrency level and cross-component power allocation) in a coordinated manner to maximize the performance of parallel applications under power capping. RPPC dynamically identifies the characteristics of the target parallel application and explores the system state space to find an efficient system state. Our experimental results demonstrate that RPPC significantly outperforms the two state-of-the-art power-capping techniques, achieves the performance comparable with the static best version that requires extensive per-application offline profiling, incurs small performance overheads, and provides the re-adaptation mechanism to external events such as total power budget changes.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130546899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1