首页 > 最新文献

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
Practical Symbolic Race Checking of GPU Programs 实用的GPU程序符号竞争检查
Peng Li, Guodong Li, G. Gopalakrishnan
Even the careful GPU programmer can inadvertently introduce data races while writing and optimizing code. Currently available GPU race checking methods fall short either in terms of their formal guarantees, ease of use, or practicality. Existing symbolic methods: (1) do not fully support existing CUDA kernels, (2) may require user-specified assertions or invariants, (3) often require users to guess which inputs may be safely made concrete, (4) tend to explode in complexity when the number of threads is increased, and (5) explode in the face of thread-ID based decisions, especially in a loop. We present SESA, a new tool combining Symbolic Execution and Static Analysis to analyze C++ CUDA programs that overcomes all these limitations. SESA also scales well to handle non-trivial benchmarks such as Parboil and Lonestar, and is the only tool of its class that handles such practical examples. This paper presents SESA's methodological innovations and practical results.
即使是细心的GPU程序员也可能在编写和优化代码时无意中引入数据竞争。目前可用的GPU竞争检查方法在形式保证、易用性或实用性方面都存在不足。现有的符号方法:(1)不完全支持现有的CUDA内核,(2)可能需要用户指定的断言或不变量,(3)通常需要用户猜测哪些输入可以安全具体化,(4)当线程数量增加时,复杂性往往会爆炸,(5)面对基于线程id的决策时爆炸,特别是在循环中。我们提出了SESA,一个结合符号执行和静态分析的新工具来分析c++ CUDA程序,克服了所有这些限制。SESA还可以很好地扩展以处理Parboil和Lonestar等重要的基准测试,并且是其类中唯一处理此类实际示例的工具。本文介绍了SESA的方法创新和实践成果。
{"title":"Practical Symbolic Race Checking of GPU Programs","authors":"Peng Li, Guodong Li, G. Gopalakrishnan","doi":"10.1109/SC.2014.20","DOIUrl":"https://doi.org/10.1109/SC.2014.20","url":null,"abstract":"Even the careful GPU programmer can inadvertently introduce data races while writing and optimizing code. Currently available GPU race checking methods fall short either in terms of their formal guarantees, ease of use, or practicality. Existing symbolic methods: (1) do not fully support existing CUDA kernels, (2) may require user-specified assertions or invariants, (3) often require users to guess which inputs may be safely made concrete, (4) tend to explode in complexity when the number of threads is increased, and (5) explode in the face of thread-ID based decisions, especially in a loop. We present SESA, a new tool combining Symbolic Execution and Static Analysis to analyze C++ CUDA programs that overcomes all these limitations. SESA also scales well to handle non-trivial benchmarks such as Parboil and Lonestar, and is the only tool of its class that handles such practical examples. This paper presents SESA's methodological innovations and practical results.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128190666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Using an Adaptive HPC Runtime System to Reconfigure the Cache Hierarchy 使用自适应HPC运行时系统重新配置缓存层次结构
E. Totoni, J. Torrellas, L. Kalé
The cache hierarchy often consumes a large portion of a processor's energy. To save energy in HPC environments, this paper proposes software-controlled reconfiguration of the cache hierarchy with an adaptive runtime system. Our approach addresses the two major limitations associated with other methods that reconfigure the caches: predicting the application's future and finding the best cache hierarchy configuration. Our approach uses formal language theory to express the application's pattern and help predict its future. Furthermore, it uses the prevalent Single Program Multiple Data (SPMD) model of HPC codes to find the best configuration in parallel quickly. Our experiments using cycle-level simulations indicate that 67% of the cache energy can be saved with only a 2.4% performance penalty on average. Moreover, we demonstrate that, for some applications, switching to a software-controlled reconfigurable streaming buffer configuration can improve performance by up to 30% and save 75% of the cache energy.
缓存层次结构通常消耗处理器能量的很大一部分。为了在高性能计算环境下节约能源,本文提出了一种软件控制的具有自适应运行时系统的缓存层次结构重构方法。我们的方法解决了与其他重新配置缓存的方法相关的两个主要限制:预测应用程序的未来和找到最佳的缓存层次结构配置。我们的方法使用形式语言理论来表达应用程序的模式并帮助预测其未来。此外,它还采用了当前流行的HPC代码单程序多数据(SPMD)模型,可以快速找到并行的最佳配置。我们使用循环级模拟的实验表明,平均只有2.4%的性能损失,可以节省67%的缓存能量。此外,我们证明,对于某些应用程序,切换到软件控制的可重构流缓冲配置可以提高30%的性能,并节省75%的缓存能量。
{"title":"Using an Adaptive HPC Runtime System to Reconfigure the Cache Hierarchy","authors":"E. Totoni, J. Torrellas, L. Kalé","doi":"10.1109/SC.2014.90","DOIUrl":"https://doi.org/10.1109/SC.2014.90","url":null,"abstract":"The cache hierarchy often consumes a large portion of a processor's energy. To save energy in HPC environments, this paper proposes software-controlled reconfiguration of the cache hierarchy with an adaptive runtime system. Our approach addresses the two major limitations associated with other methods that reconfigure the caches: predicting the application's future and finding the best cache hierarchy configuration. Our approach uses formal language theory to express the application's pattern and help predict its future. Furthermore, it uses the prevalent Single Program Multiple Data (SPMD) model of HPC codes to find the best configuration in parallel quickly. Our experiments using cycle-level simulations indicate that 67% of the cache energy can be saved with only a 2.4% performance penalty on average. Moreover, we demonstrate that, for some applications, switching to a software-controlled reconfigurable streaming buffer configuration can improve performance by up to 30% and save 75% of the cache energy.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133177539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection 通过硬件质子照射和软件故障注入了解蓝色基因/Q计算芯片的软错误弹性
Chen-Yong Cher, M. Gupta, P. Bose, K. Muller
Soft Error Resiliency is a major concern for Petascale high performance computing (HPC) systems. Blue Gene/Q (BG/Q) is the third generation of IBM's massively parallel, energy efficient Blue Gene series of supercomputers. The principal goal of this work is to understand the interaction between Blue-Gene/Q's hardware resiliency features and high-performance applications through proton irradiation of a real chip, and software resiliency inherent in these applications through application-level fault injection (AFI) experiments. From the proton irradiation experiments we derived that the mean time between correctable errors at sea level of the SRAM-based register files and Level-1 caches for a system similar to the scale of Sequoia system. From the AFI experiments, we characterized relative vulnerability among the applications in both general purpose and floating point register files. We categorized and quantified the failure outcomes, and discovered characteristics in the applications that lead to many masking improvement opportunities.
软错误弹性是千兆级高性能计算(HPC)系统的主要关注点。蓝色基因/Q (BG/Q)是IBM大规模并行、节能的蓝色基因系列超级计算机的第三代。这项工作的主要目标是通过质子辐照真实芯片了解Blue-Gene/Q的硬件弹性特性与高性能应用之间的相互作用,并通过应用级故障注入(AFI)实验了解这些应用中固有的软件弹性。从质子辐照实验中,我们推导出了类似红杉系统规模的sram寄存器文件与一级缓存在海平面可校正误差之间的平均时间。从AFI实验中,我们描述了通用和浮点寄存器文件中应用程序的相对脆弱性。我们对失败结果进行了分类和量化,并发现了应用程序中导致许多隐藏改进机会的特征。
{"title":"Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection","authors":"Chen-Yong Cher, M. Gupta, P. Bose, K. Muller","doi":"10.1109/SC.2014.53","DOIUrl":"https://doi.org/10.1109/SC.2014.53","url":null,"abstract":"Soft Error Resiliency is a major concern for Petascale high performance computing (HPC) systems. Blue Gene/Q (BG/Q) is the third generation of IBM's massively parallel, energy efficient Blue Gene series of supercomputers. The principal goal of this work is to understand the interaction between Blue-Gene/Q's hardware resiliency features and high-performance applications through proton irradiation of a real chip, and software resiliency inherent in these applications through application-level fault injection (AFI) experiments. From the proton irradiation experiments we derived that the mean time between correctable errors at sea level of the SRAM-based register files and Level-1 caches for a system similar to the scale of Sequoia system. From the AFI experiments, we characterized relative vulnerability among the applications in both general purpose and floating point register files. We categorized and quantified the failure outcomes, and discovered characteristics in the applications that lead to many masking improvement opportunities.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131270981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
A System Software Approach to Proactive Memory-Error Avoidance 主动避免内存错误的系统软件方法
Carlos H. A. Costa, Yoonho Park, Bryan S. Rosenburg, Chen-Yong Cher, K. D. Ryu
Today's HPC systems use two mechanisms to address main-memory errors. Error-correcting codes make correctable errors transparent to software, while checkpoint/restart (CR) enables recovery from uncorrectable errors. Unfortunately, CR overhead will be enormous at exascale due to the high failure rate of memory. We propose a new OS-based approach that proactively avoids memory errors using prediction. This scheme exposes correctable error information to the OS, which migrates pages and off lines unhealthy memory to avoid application crashes. We analyze memory error patterns in extensive logs from a BG/P system and show how correctable error patterns can be used to identify memory likely to fail. We implement a proactive memory management system on BG/Q by extending the firmware and Linux. We evaluate our approach with a realistic workload and compare our overhead against CR. We show improved resilience with negligible performance overhead for applications.
当今的高性能计算系统使用两种机制来处理主内存错误。纠错码使可纠正的错误对软件透明,而检查点/重启(CR)可从不可纠错的错误中恢复。遗憾的是,由于内存的高故障率,在超大规模系统中,CR 的开销将是巨大的。我们提出了一种基于操作系统的新方法,利用预测主动避免内存错误。该方案将可纠正的错误信息暴露给操作系统,由操作系统迁移页面和非健康内存行,以避免应用程序崩溃。我们分析了 BG/P 系统大量日志中的内存错误模式,并展示了如何利用可纠正错误模式来识别可能发生故障的内存。通过扩展固件和 Linux,我们在 BG/Q 上实现了主动内存管理系统。我们用现实的工作负载评估了我们的方法,并将我们的开销与 CR 进行了比较。结果表明,我们的方法提高了应用程序的恢复能力,而性能开销却可以忽略不计。
{"title":"A System Software Approach to Proactive Memory-Error Avoidance","authors":"Carlos H. A. Costa, Yoonho Park, Bryan S. Rosenburg, Chen-Yong Cher, K. D. Ryu","doi":"10.1109/SC.2014.63","DOIUrl":"https://doi.org/10.1109/SC.2014.63","url":null,"abstract":"Today's HPC systems use two mechanisms to address main-memory errors. Error-correcting codes make correctable errors transparent to software, while checkpoint/restart (CR) enables recovery from uncorrectable errors. Unfortunately, CR overhead will be enormous at exascale due to the high failure rate of memory. We propose a new OS-based approach that proactively avoids memory errors using prediction. This scheme exposes correctable error information to the OS, which migrates pages and off lines unhealthy memory to avoid application crashes. We analyze memory error patterns in extensive logs from a BG/P system and show how correctable error patterns can be used to identify memory likely to fail. We implement a proactive memory management system on BG/Q by extending the firmware and Linux. We evaluate our approach with a realistic workload and compare our overhead against CR. We show improved resilience with negligible performance overhead for applications.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131481877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems 部署和操作大规模以数据为中心的并行文件系统的最佳实践和经验教训
S. Oral, James Simmons, Jason Hill, Dustin Leverman, Feiyi Wang, M. Ezell, Ross G. Miller, Douglas Fuller, Raghul Gunasekaran, Youngjae Kim, Saurabh Gupta, Devesh Tiwari, Sudharshan S. Vazhkudai, James H. Rogers, D. Dillow, G. Shipman, Arthur S. Bland
The Oak Ridge Leadership Computing Facility (OLCF) has deployed multiple large-scale parallel file systems (PFS) to support its operations. During this process, OLCF acquired significant expertise in large-scale storage system design, file system software development, technology evaluation, benchmarking, procurement, deployment, and operational practices. Based on the lessons learned from each new PFS deployment, OLCF improved its operating procedures, and strategies. This paper provides an account of our experience and lessons learned in acquiring, deploying, and operating large-scale parallel file systems. We believe that these lessons will be useful to the wider HPC community.
橡树岭领导计算设施(OLCF)部署了多个大规模并行文件系统(PFS)来支持其运行。在此过程中,OLCF获得了大型存储系统设计、文件系统软件开发、技术评估、基准测试、采购、部署和操作实践方面的重要专业知识。根据从每次新的PFS部署中吸取的经验教训,OLCF改进了其操作程序和策略。本文介绍了我们在获取、部署和操作大规模并行文件系统方面的经验和教训。我们相信这些经验教训将对更广泛的HPC社区有用。
{"title":"Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems","authors":"S. Oral, James Simmons, Jason Hill, Dustin Leverman, Feiyi Wang, M. Ezell, Ross G. Miller, Douglas Fuller, Raghul Gunasekaran, Youngjae Kim, Saurabh Gupta, Devesh Tiwari, Sudharshan S. Vazhkudai, James H. Rogers, D. Dillow, G. Shipman, Arthur S. Bland","doi":"10.1109/SC.2014.23","DOIUrl":"https://doi.org/10.1109/SC.2014.23","url":null,"abstract":"The Oak Ridge Leadership Computing Facility (OLCF) has deployed multiple large-scale parallel file systems (PFS) to support its operations. During this process, OLCF acquired significant expertise in large-scale storage system design, file system software development, technology evaluation, benchmarking, procurement, deployment, and operational practices. Based on the lessons learned from each new PFS deployment, OLCF improved its operating procedures, and strategies. This paper provides an account of our experience and lessons learned in acquiring, deploying, and operating large-scale parallel file systems. We believe that these lessons will be useful to the wider HPC community.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124005081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Scalable Kernel Fusion for Memory-Bound GPU Applications 内存绑定GPU应用的可扩展内核融合
M. Wahib, N. Maruyama
GPU implementations of HPC applications relying on finite difference methods can include tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing data traffic to off-chip memory, kernels that share data arrays are fused to larger kernels where on-chip cache is used to hold the data reused by instructions originating from different kernels. The main challenges are a) searching for the optimal kernel fusions while constrained by data dependencies and kernels' precedences and b) effectively applying kernel fusion to achieve speedup. This paper introduces a problem definition and proposes a scalable method for searching the space of possible kernel fusions to identify optimal kernel fusions for large problems. The paper also proposes a codeless performance upper-bound projection model to achieve effective fusions. Results show that using the proposed scalable method for kernel fusion improved the performance of two real-world applications containing tens of kernels by 1.35x and 1.2x.
依靠有限差分方法的高性能计算应用程序的GPU实现可能包括数十个内存受限的内核。内核融合可以通过减少到片外内存的数据流量来提高性能,共享数据数组的内核被融合到更大的内核中,其中片上缓存用于保存来自不同内核的指令重用的数据。主要的挑战是a)在受数据依赖关系和核优先级约束的情况下寻找最优的核融合;b)有效地应用核融合来实现加速。本文引入了问题的定义,提出了一种可扩展的搜索可能核融合空间的方法,以识别大型问题的最优核融合。为了实现有效的融合,本文还提出了一种无编码性能上界投影模型。结果表明,使用所提出的可扩展核融合方法将两个包含数十个核的实际应用程序的性能分别提高了1.35倍和1.2倍。
{"title":"Scalable Kernel Fusion for Memory-Bound GPU Applications","authors":"M. Wahib, N. Maruyama","doi":"10.1109/SC.2014.21","DOIUrl":"https://doi.org/10.1109/SC.2014.21","url":null,"abstract":"GPU implementations of HPC applications relying on finite difference methods can include tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing data traffic to off-chip memory, kernels that share data arrays are fused to larger kernels where on-chip cache is used to hold the data reused by instructions originating from different kernels. The main challenges are a) searching for the optimal kernel fusions while constrained by data dependencies and kernels' precedences and b) effectively applying kernel fusion to achieve speedup. This paper introduces a problem definition and proposes a scalable method for searching the space of possible kernel fusions to identify optimal kernel fusions for large problems. The paper also proposes a codeless performance upper-bound projection model to achieve effective fusions. Results show that using the proposed scalable method for kernel fusion improved the performance of two real-world applications containing tens of kernels by 1.35x and 1.2x.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125083128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
A Communication-Optimal Framework for Contracting Distributed Tensors 收缩分布张量的通信最优框架
Samyam Rajbhandari, Akshay Nikam, Pai-Wei Lai, Kevin Stock, S. Krishnamoorthy, P. Sadayappan
Tensor contractions are extremely compute intensive generalized matrix multiplication operations encountered in many computational science fields, such as quantum chemistry and nuclear physics. Unlike distributed matrix multiplication, which has been extensively studied, limited work has been done in understanding distributed tensor contractions. In this paper, we characterize distributed tensor contraction algorithms on torus networks. We develop a framework with three fundamental communication operators to generate communication-efficient contraction algorithms for arbitrary tensor contractions. We show that for a given amount of memory per processor, the framework is communication optimal for all tensor contractions. We demonstrate performance and scalability of the framework on up to 262,144 cores on a Blue Gene/Q supercomputer.
张量收缩是一种计算密集型的广义矩阵乘法运算,在量子化学和核物理等计算科学领域都有应用。与已被广泛研究的分布矩阵乘法不同,在理解分布张量收缩方面所做的工作有限。本文对环面网络上的分布张量收缩算法进行了刻画。我们开发了一个具有三个基本通信算子的框架,用于生成任意张量收缩的通信高效收缩算法。我们表明,对于每个处理器给定的内存量,该框架对于所有张量收缩是通信最优的。我们在Blue Gene/Q超级计算机上演示了该框架在高达262,144个内核上的性能和可扩展性。
{"title":"A Communication-Optimal Framework for Contracting Distributed Tensors","authors":"Samyam Rajbhandari, Akshay Nikam, Pai-Wei Lai, Kevin Stock, S. Krishnamoorthy, P. Sadayappan","doi":"10.1109/SC.2014.36","DOIUrl":"https://doi.org/10.1109/SC.2014.36","url":null,"abstract":"Tensor contractions are extremely compute intensive generalized matrix multiplication operations encountered in many computational science fields, such as quantum chemistry and nuclear physics. Unlike distributed matrix multiplication, which has been extensively studied, limited work has been done in understanding distributed tensor contractions. In this paper, we characterize distributed tensor contraction algorithms on torus networks. We develop a framework with three fundamental communication operators to generate communication-efficient contraction algorithms for arbitrary tensor contractions. We show that for a given amount of memory per processor, the framework is communication optimal for all tensor contractions. We demonstrate performance and scalability of the framework on up to 262,144 cores on a Blue Gene/Q supercomputer.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115663201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Dissecting On-Node Memory Access Performance: A Semantic Approach 剖析节点上内存访问性能:一种语义方法
Alfredo Giménez, T. Gamblin, B. Rountree, A. Bhatele, Ilir Jusufi, P. Bremer, B. Hamann
Optimizing memory access is critical for performance and power efficiency. CPU manufacturers have developed sampling-based performance measurement units (PMUs) that report precise costs of memory accesses at specific addresses. However, this data is too low-level to be meaningfully interpreted and contains an excessive amount of irrelevant or uninteresting information. We have developed a method to gather fine-grained memory access performance data for specific data objects and regions of code with low overhead and attribute semantic information to the sampled memory accesses. This information provides the context necessary to more effectively interpret the data. We have developed a tool that performs this sampling and attribution and used the tool to discover and diagnose performance problems in real-world applications. Our techniques provide useful insight into the memory behaviour of applications and allow programmers to understand the performance ramifications of key design decisions: domain decomposition, multi-threading, and data motion within distributed memory systems.
优化内存访问对性能和能效至关重要。CPU制造商已经开发出基于采样的性能测量单元(pmu),可以报告特定地址上内存访问的精确成本。然而,这些数据层次太低,无法进行有意义的解释,并且包含了过多的不相关或无趣的信息。我们已经开发了一种方法来收集特定数据对象和代码区域的细粒度内存访问性能数据,这些数据具有较低的开销,并将语义信息属性赋予采样的内存访问。这些信息提供了更有效地解释数据所需的上下文。我们已经开发了一个工具来执行这种抽样和归因,并使用该工具来发现和诊断实际应用程序中的性能问题。我们的技术为应用程序的内存行为提供了有用的见解,并允许程序员理解关键设计决策的性能影响:分布式内存系统中的域分解、多线程和数据移动。
{"title":"Dissecting On-Node Memory Access Performance: A Semantic Approach","authors":"Alfredo Giménez, T. Gamblin, B. Rountree, A. Bhatele, Ilir Jusufi, P. Bremer, B. Hamann","doi":"10.1109/SC.2014.19","DOIUrl":"https://doi.org/10.1109/SC.2014.19","url":null,"abstract":"Optimizing memory access is critical for performance and power efficiency. CPU manufacturers have developed sampling-based performance measurement units (PMUs) that report precise costs of memory accesses at specific addresses. However, this data is too low-level to be meaningfully interpreted and contains an excessive amount of irrelevant or uninteresting information. We have developed a method to gather fine-grained memory access performance data for specific data objects and regions of code with low overhead and attribute semantic information to the sampled memory accesses. This information provides the context necessary to more effectively interpret the data. We have developed a tool that performs this sampling and attribution and used the tool to discover and diagnose performance problems in real-world applications. Our techniques provide useful insight into the memory behaviour of applications and allow programmers to understand the performance ramifications of key design decisions: domain decomposition, multi-threading, and data motion within distributed memory systems.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127978456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
A Volume Integral Equation Stokes Solver for Problems with Variable Coefficients 变系数问题的体积积分方程Stokes求解器
D. Malhotra, A. Gholami, G. Biros
We present a novel numerical scheme for solving the Stokes equation with variable coefficients in the unit box. Our scheme is based on a volume integral equation formulation. Compared to finite element methods, our formulation decouples the velocity and pressure, generates velocity fields that are by construction divergence free to high accuracy and its performance does not depend on the order of the basis used for discretization. In addition, we employ a novel adaptive fast multipole method for volume integrals to obtain a scheme that is algorithmically optimal. Our scheme supports non-uniform discretizations and is spectrally accurate. To increase per node performance, we have integrated our code with both NVIDIA and Intel accelerators. In our largest scalability test, we solved a problem with 20 billion unknowns, using a 14-order approximation for the velocity, on 2048 nodes of the Stampede system at the Texas Advanced Computing Center. We achieved 0.656 peta FLOPS for the overall code (23% efficiency) and one peta FLOPS for the volume integrals (33% efficiency). As an application example, we simulate Stokes ow in a porous medium with highly complex pore structure using a penalty formulation to enforce the no slip condition.
本文提出了一种求解单位框内变系数Stokes方程的新数值格式。我们的方案是基于一个体积积分方程公式。与有限元方法相比,我们的公式解耦了速度和压力,生成的速度场在构造上是无散度的,精度很高,其性能不依赖于用于离散化的基的顺序。此外,我们采用了一种新的自适应快速多极体积分方法,以获得算法最优的方案。该方案支持非均匀离散化,具有光谱精度。为了提高每个节点的性能,我们将代码与NVIDIA和Intel加速器集成在一起。在我们最大的可伸缩性测试中,我们在Texas Advanced Computing Center的Stampede系统的2048个节点上,使用14阶近似速度,解决了一个包含200亿个未知数的问题。我们为整个代码实现了0.656 peta FLOPS(23%的效率),为体积积分实现了1 peta FLOPS(33%的效率)。作为应用实例,我们在具有高度复杂孔隙结构的多孔介质中,采用惩罚公式来模拟Stokes流,以实现无滑移条件。
{"title":"A Volume Integral Equation Stokes Solver for Problems with Variable Coefficients","authors":"D. Malhotra, A. Gholami, G. Biros","doi":"10.1109/SC.2014.13","DOIUrl":"https://doi.org/10.1109/SC.2014.13","url":null,"abstract":"We present a novel numerical scheme for solving the Stokes equation with variable coefficients in the unit box. Our scheme is based on a volume integral equation formulation. Compared to finite element methods, our formulation decouples the velocity and pressure, generates velocity fields that are by construction divergence free to high accuracy and its performance does not depend on the order of the basis used for discretization. In addition, we employ a novel adaptive fast multipole method for volume integrals to obtain a scheme that is algorithmically optimal. Our scheme supports non-uniform discretizations and is spectrally accurate. To increase per node performance, we have integrated our code with both NVIDIA and Intel accelerators. In our largest scalability test, we solved a problem with 20 billion unknowns, using a 14-order approximation for the velocity, on 2048 nodes of the Stampede system at the Texas Advanced Computing Center. We achieved 0.656 peta FLOPS for the overall code (23% efficiency) and one peta FLOPS for the volume integrals (33% efficiency). As an application example, we simulate Stokes ow in a porous medium with highly complex pore structure using a penalty formulation to enforce the no slip condition.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128001724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
FlexSlot: Moving Hadoop Into the Cloud with Flexible Slot Management FlexSlot:通过灵活的插槽管理将Hadoop迁移到云端
Yanfei Guo, J. Rao, Changjun Jiang, Xiaobo Zhou
Load imbalance is a major source of overhead in Hadoop where the uneven distribution of input data among tasks can significantly delays the job completion. Running Hadoop in a private cloud opens up opportunities for mitigating data skew with elastic resource allocation, where stragglers are expedited with more resources, yet introduces problems that often cancel out the performance gain: (1) performance interference from co running jobs may create new stragglers, (2) there exist a semantic gap between Hadoop task management and resource pool-based virtual cluster management preventing efficient resource usage. We present FlexSlot, a user-transparent task slot management scheme that automatically identifies map stragglers and resizes their slots accordingly to accelerate task execution. FlexSlot adaptively changes the number of slots on each virtual node to promote efficient usage of resource pool. Experimental results with representative benchmarks show that FlexSlot effectively reduces job completion time by 46% and achieves better resource utilization.
负载不平衡是Hadoop开销的主要来源,任务之间输入数据的不均匀分布会严重延迟任务的完成。在私有云中运行Hadoop为通过弹性资源分配来减轻数据倾斜提供了机会,在这种情况下,分散的任务可以使用更多的资源来加速,但也会引入一些问题,这些问题通常会抵消性能收益:(1)共同运行作业的性能干扰可能会产生新的分散的任务,(2)在Hadoop任务管理和基于资源池的虚拟集群管理之间存在语义差距,阻碍了有效的资源使用。我们提出FlexSlot,一个用户透明的任务插槽管理方案,自动识别映射掉队者,并相应地调整其插槽大小,以加快任务执行。FlexSlot自适应地改变每个虚拟节点的槽位数,以提高资源池的利用率。具有代表性的基准测试的实验结果表明,FlexSlot有效地减少了46%的作业完成时间,实现了更好的资源利用率。
{"title":"FlexSlot: Moving Hadoop Into the Cloud with Flexible Slot Management","authors":"Yanfei Guo, J. Rao, Changjun Jiang, Xiaobo Zhou","doi":"10.1109/SC.2014.83","DOIUrl":"https://doi.org/10.1109/SC.2014.83","url":null,"abstract":"Load imbalance is a major source of overhead in Hadoop where the uneven distribution of input data among tasks can significantly delays the job completion. Running Hadoop in a private cloud opens up opportunities for mitigating data skew with elastic resource allocation, where stragglers are expedited with more resources, yet introduces problems that often cancel out the performance gain: (1) performance interference from co running jobs may create new stragglers, (2) there exist a semantic gap between Hadoop task management and resource pool-based virtual cluster management preventing efficient resource usage. We present FlexSlot, a user-transparent task slot management scheme that automatically identifies map stragglers and resizes their slots accordingly to accelerate task execution. FlexSlot adaptively changes the number of slots on each virtual node to promote efficient usage of resource pool. Experimental results with representative benchmarks show that FlexSlot effectively reduces job completion time by 46% and achieves better resource utilization.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127664328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
期刊
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1