首页 > 最新文献

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)最新文献

英文 中文
A distributed dynamic load balancer for iterative applications 用于迭代应用程序的分布式动态负载平衡器
Harshitha Menon, L. Kalé
For many applications, computation load varies over time. Such applications require dynamic load balancing to improve performance. Centralized load balancing schemes, which perform the load balancing decisions at a central location, are not scalable. In contrast, fully distributed strategies are scalable but typically do not produce a balanced work distribution as they tend to consider only local information. This paper describes a fully distributed algorithm for load balancing that uses partial information about the global state of the system to perform load balancing. This algorithm, referred to as GrapevineLB, consists of two stages: global information propagation using a lightweight algorithm inspired by epidemic [21] algorithms, and work unit transfer using a randomized algorithm. We provide analysis of the algorithm along with detailed simulation and performance comparison with other load balancing strategies. We demonstrate the effectiveness of GrapevineLB for adaptive mesh refinement and molecular dynamics on up to 131,072 cores of BlueGene/Q.
对于许多应用程序,计算负载随时间变化。这样的应用程序需要动态负载平衡来提高性能。在中心位置执行负载平衡决策的集中式负载平衡方案是不可扩展的。相反,完全分布式策略是可伸缩的,但通常不会产生平衡的工作分布,因为它们倾向于只考虑本地信息。本文描述了一种完全分布式的负载均衡算法,该算法利用系统全局状态的部分信息来实现负载均衡。该算法称为GrapevineLB,分为两个阶段:采用受流行病[21]算法启发的轻量级算法进行全局信息传播,采用随机化算法进行工作单元传递。我们对该算法进行了分析,并与其他负载均衡策略进行了详细的仿真和性能比较。我们在多达131,072个BlueGene/Q内核上验证了GrapevineLB自适应网格细化和分子动力学的有效性。
{"title":"A distributed dynamic load balancer for iterative applications","authors":"Harshitha Menon, L. Kalé","doi":"10.1145/2503210.2503284","DOIUrl":"https://doi.org/10.1145/2503210.2503284","url":null,"abstract":"For many applications, computation load varies over time. Such applications require dynamic load balancing to improve performance. Centralized load balancing schemes, which perform the load balancing decisions at a central location, are not scalable. In contrast, fully distributed strategies are scalable but typically do not produce a balanced work distribution as they tend to consider only local information. This paper describes a fully distributed algorithm for load balancing that uses partial information about the global state of the system to perform load balancing. This algorithm, referred to as GrapevineLB, consists of two stages: global information propagation using a lightweight algorithm inspired by epidemic [21] algorithms, and work unit transfer using a randomized algorithm. We provide analysis of the algorithm along with detailed simulation and performance comparison with other load balancing strategies. We demonstrate the effectiveness of GrapevineLB for adaptive mesh refinement and molecular dynamics on up to 131,072 cores of BlueGene/Q.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115137705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
A scalable, efficient scheme for evaluation of stencil computations over unstructured meshes 一种可扩展的、高效的非结构化网格模板计算评估方案
James King, R. Kirby
Stencil computations are a common class of operations that appear in many computational scientific and engineering applications. Stencil computations often benefit from compiletime analysis, exploiting data-locality, and parallelism. Post-processing of discontinuous Galerkin (dG) simulation solutions with B-spline kernels is an example of a numerical method which requires evaluating computationally intensive stencil operations over a mesh. Previous work on stencil computations has focused on structured meshes, while giving little attention to unstructured meshes. Performing stencil operations over an unstructured mesh requires sampling of heterogeneous elements which often leads to inefficient memory access patterns and limits data locality/reuse. In this paper, we present an efficient method for performing stencil computations over unstructured meshes which increases data-locality and cache efficiency, and a scalable approach for stencil tiling and concurrent execution. We provide experimental results in the context of post-processing of dG solutions that demonstrate the effectiveness of our approach.
模板计算是在许多计算科学和工程应用中出现的一类常见操作。模板计算通常受益于编译时分析、利用数据局部性和并行性。具有b样条核的不连续伽辽金(dG)模拟解的后处理是一种数值方法的例子,该方法需要在网格上评估计算密集型的模板操作。以前的模板计算工作主要集中在结构化网格上,而对非结构化网格的关注很少。在非结构化网格上执行模板操作需要对异构元素进行采样,这通常会导致低效的内存访问模式,并限制数据的局部性/重用。在本文中,我们提出了一种在非结构化网格上执行模板计算的有效方法,该方法增加了数据局部性和缓存效率,以及一种可扩展的模板平铺和并发执行方法。我们提供了dG溶液后处理背景下的实验结果,证明了我们方法的有效性。
{"title":"A scalable, efficient scheme for evaluation of stencil computations over unstructured meshes","authors":"James King, R. Kirby","doi":"10.1145/2503210.2503214","DOIUrl":"https://doi.org/10.1145/2503210.2503214","url":null,"abstract":"Stencil computations are a common class of operations that appear in many computational scientific and engineering applications. Stencil computations often benefit from compiletime analysis, exploiting data-locality, and parallelism. Post-processing of discontinuous Galerkin (dG) simulation solutions with B-spline kernels is an example of a numerical method which requires evaluating computationally intensive stencil operations over a mesh. Previous work on stencil computations has focused on structured meshes, while giving little attention to unstructured meshes. Performing stencil operations over an unstructured mesh requires sampling of heterogeneous elements which often leads to inefficient memory access patterns and limits data locality/reuse. In this paper, we present an efficient method for performing stencil computations over unstructured meshes which increases data-locality and cache efficiency, and a scalable approach for stencil tiling and concurrent execution. We provide experimental results in the context of post-processing of dG solutions that demonstrate the effectiveness of our approach.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128117929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Low-power, low-storage-overhead chipkill correct via multi-line error correction 低功耗,低存储开销的芯片kill通过多行纠错纠正
Xun Jian, Henry Duwe, J. Sartori, Vilas Sridharan, Rakesh Kumar
Due to their large memory capacities, many modern servers require chipkill correct, an advanced type of memory error detection and correction, to meet their reliability requirements. However, existing chipkill-correct solutions incur high power or storage overheads, or both because they use dedicated error-correction resources per codeword to perform error correction. This requires high overhead for correction and results in high overhead for error detection. We propose a novel chipkill-correct solution, multi-line error correction, that uses resources shared across multiple lines in memory for error correction to reduce the overhead of both error detection and correction. Our evaluations show that the proposed solution reduces memory power by a mean of 27%, and up to 38% with respect to commercial solutions, at a cost of 0.4% increase in storage overhead and minimal impact on reliability.
由于它们的大内存容量,许多现代服务器需要chipkill correct,一种先进的内存错误检测和纠正,以满足它们的可靠性要求。然而,现有的芯片kill-correct解决方案会导致较高的功耗或存储开销,或者两者兼而有之,因为它们对每个码字使用专用的纠错资源来执行纠错。这需要很高的纠错开销,并导致很高的错误检测开销。我们提出了一种新的芯片杀伤-纠错解决方案,即多行纠错,它利用内存中多行共享的资源进行纠错,以减少错误检测和纠错的开销。我们的评估表明,与商业解决方案相比,提议的解决方案平均降低了27%的内存功耗,最多可降低38%,而存储开销仅增加0.4%,对可靠性的影响最小。
{"title":"Low-power, low-storage-overhead chipkill correct via multi-line error correction","authors":"Xun Jian, Henry Duwe, J. Sartori, Vilas Sridharan, Rakesh Kumar","doi":"10.1145/2503210.2503243","DOIUrl":"https://doi.org/10.1145/2503210.2503243","url":null,"abstract":"Due to their large memory capacities, many modern servers require chipkill correct, an advanced type of memory error detection and correction, to meet their reliability requirements. However, existing chipkill-correct solutions incur high power or storage overheads, or both because they use dedicated error-correction resources per codeword to perform error correction. This requires high overhead for correction and results in high overhead for error detection. We propose a novel chipkill-correct solution, multi-line error correction, that uses resources shared across multiple lines in memory for error correction to reduce the overhead of both error detection and correction. Our evaluations show that the proposed solution reduces memory power by a mean of 27%, and up to 38% with respect to commercial solutions, at a cost of 0.4% increase in storage overhead and minimal impact on reliability.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133197257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Coordinated energy management in heterogeneous processors 异构处理器中的协调能量管理
Indrani Paul, Vignesh T. Ravi, Srilatha Manne, Manish Arora, S. Yalamanchili
This paper examines energy management in a heterogeneous processor consisting of an integrated CPU-GPU for high-performance computing (HPC) applications. Energy management for HPC applications is challenged by their uncompromising performance requirements and complicated by the need for coordinating energy management across distinct core types - a new and less understood problem. We examine the intra-node CPU-GPU frequency sensitivity of HPC applications on tightly coupled CPU-GPU architectures as the first step in understanding power and performance optimization for a heterogeneous multi-node HPC system. The insights from this analysis form the basis of a coordinated energy management scheme, called DynaCo, for integrated CPU-GPU architectures. We implement DynaCo on a modern heterogeneous processor and compare its performance to a state-of-the-art power- and performance-management algorithm. DynaCo improves measured average energy-delay squared (ED^2) product by up to 30% with less than 2% average performance loss across several exascale and other HPC workloads.
本文研究了高性能计算(HPC)应用中由集成CPU-GPU组成的异构处理器的能量管理。高性能计算应用的能源管理面临着不妥协的性能要求的挑战,并且由于需要协调不同核心类型的能源管理而变得复杂——这是一个新的和鲜为人知的问题。我们研究了在紧密耦合的CPU-GPU架构上HPC应用的节点内CPU-GPU频率敏感性,作为理解异构多节点HPC系统功率和性能优化的第一步。从这一分析中得出的见解形成了一个协调能源管理方案的基础,称为DynaCo,用于集成CPU-GPU架构。我们在现代异构处理器上实现DynaCo,并将其性能与最先进的功率和性能管理算法进行比较。DynaCo将测量的平均能量延迟平方(ED^2)产品提高了30%,在多个百亿亿级和其他HPC工作负载中平均性能损失不到2%。
{"title":"Coordinated energy management in heterogeneous processors","authors":"Indrani Paul, Vignesh T. Ravi, Srilatha Manne, Manish Arora, S. Yalamanchili","doi":"10.1145/2503210.2503227","DOIUrl":"https://doi.org/10.1145/2503210.2503227","url":null,"abstract":"This paper examines energy management in a heterogeneous processor consisting of an integrated CPU-GPU for high-performance computing (HPC) applications. Energy management for HPC applications is challenged by their uncompromising performance requirements and complicated by the need for coordinating energy management across distinct core types - a new and less understood problem. We examine the intra-node CPU-GPU frequency sensitivity of HPC applications on tightly coupled CPU-GPU architectures as the first step in understanding power and performance optimization for a heterogeneous multi-node HPC system. The insights from this analysis form the basis of a coordinated energy management scheme, called DynaCo, for integrated CPU-GPU architectures. We implement DynaCo on a modern heterogeneous processor and compare its performance to a state-of-the-art power- and performance-management algorithm. DynaCo improves measured average energy-delay squared (ED^2) product by up to 30% with less than 2% average performance loss across several exascale and other HPC workloads.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133426553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
SIDR: Structure-aware intelligent data routing in hadoop SIDR: hadoop中的结构感知智能数据路由
Joe B. Buck, Noah Watkins, Greg Levin, A. Crume, Kleoni Ioannidou, S. Brandt, C. Maltzahn, N. Polyzotis, Aaron Torres
The MapReduce framework is being extended for domains quite different from the web applications for which it was designed, including the processing of big structured data, e.g., scientific and financial data. Previous work using MapReduce to process scientific data ignores existing structure when assigning intermediate data and scheduling tasks. In this paper, we present a method for incorporating knowledge of the structure of scientific data and executing query into the MapReduce communication model. Built in SciHadoop, a version of the Hadoop MapReduce framework for scientific data, SIDR intelligently partitions and routes intermediate data, allowing it to: remove Hadoop's global barrier and execute Reduce tasks prior to all Map tasks completing; minimize intermediate key skew; and produce early, correct results. SIDR executes queries up to 2.5 times faster than Hadoop and 37% faster than SciHadoop; produces initial results with only 6% of the query completed; and produces dense, contiguous output.
MapReduce框架正在被扩展到与web应用程序完全不同的领域,包括处理大结构化数据,例如科学和金融数据。以前使用MapReduce处理科学数据的工作在分配中间数据和调度任务时忽略了现有结构。在本文中,我们提出了一种将科学数据结构知识和执行查询合并到MapReduce通信模型中的方法。SIDR内置在SciHadoop中,是用于科学数据的Hadoop MapReduce框架的一个版本,它智能地对中间数据进行分区和路由,允许它:消除Hadoop的全局障碍,在所有Map任务完成之前执行Reduce任务;最小化中间键倾斜;并产生早期,正确的结果。SIDR执行查询的速度比Hadoop快2.5倍,比SciHadoop快37%;只完成了6%的查询就产生了初始结果;并产生密集、连续的输出。
{"title":"SIDR: Structure-aware intelligent data routing in hadoop","authors":"Joe B. Buck, Noah Watkins, Greg Levin, A. Crume, Kleoni Ioannidou, S. Brandt, C. Maltzahn, N. Polyzotis, Aaron Torres","doi":"10.1145/2503210.2503241","DOIUrl":"https://doi.org/10.1145/2503210.2503241","url":null,"abstract":"The MapReduce framework is being extended for domains quite different from the web applications for which it was designed, including the processing of big structured data, e.g., scientific and financial data. Previous work using MapReduce to process scientific data ignores existing structure when assigning intermediate data and scheduling tasks. In this paper, we present a method for incorporating knowledge of the structure of scientific data and executing query into the MapReduce communication model. Built in SciHadoop, a version of the Hadoop MapReduce framework for scientific data, SIDR intelligently partitions and routes intermediate data, allowing it to: remove Hadoop's global barrier and execute Reduce tasks prior to all Map tasks completing; minimize intermediate key skew; and produce early, correct results. SIDR executes queries up to 2.5 times faster than Hadoop and 37% faster than SciHadoop; produces initial results with only 6% of the query completed; and produces dense, contiguous output.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133069711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems 将电力动态定价集成到高性能计算系统的能源感知调度中
Xu Yang, Zhou Zhou, Sean Wallace, Z. Lan, Wei Tang, S. Coghlan, M. Papka
The research literature to date mainly aimed at reducing energy consumption in HPC environments. In this paper we propose a job power aware scheduling mechanism to reduce HPC's electricity bill without degrading the system utilization. The novelty of our job scheduling mechanism is its ability to take the variation of electricity price into consideration as a means to make better decisions of the timing of scheduling jobs with diverse power profiles. We verified the effectiveness of our design by conducting trace-based experiments on an IBM Blue Gene/P and a cluster system as well as a case study on Argonne's 48-rack IBM Blue Gene/Q system. Our preliminary results show that our power aware algorithm can reduce electricity bill of HPC systems as much as 23%.
迄今为止的研究文献主要针对高性能计算环境下的能耗降低。本文提出了一种作业功率感知调度机制,在不降低系统利用率的前提下降低高性能计算的电费。本文的作业调度机制的新颖之处在于,它能够考虑电价的变化,从而更好地决定不同电力配置下作业调度的时间。我们通过在IBM Blue Gene/P和集群系统上进行基于跟踪的实验,以及在Argonne的48机架IBM Blue Gene/Q系统上进行案例研究,验证了我们设计的有效性。初步结果表明,该算法可使高性能计算系统的电费降低23%。
{"title":"Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems","authors":"Xu Yang, Zhou Zhou, Sean Wallace, Z. Lan, Wei Tang, S. Coghlan, M. Papka","doi":"10.1145/2503210.2503264","DOIUrl":"https://doi.org/10.1145/2503210.2503264","url":null,"abstract":"The research literature to date mainly aimed at reducing energy consumption in HPC environments. In this paper we propose a job power aware scheduling mechanism to reduce HPC's electricity bill without degrading the system utilization. The novelty of our job scheduling mechanism is its ability to take the variation of electricity price into consideration as a means to make better decisions of the timing of scheduling jobs with diverse power profiles. We verified the effectiveness of our design by conducting trace-based experiments on an IBM Blue Gene/P and a cluster system as well as a case study on Argonne's 48-rack IBM Blue Gene/Q system. Our preliminary results show that our power aware algorithm can reduce electricity bill of HPC systems as much as 23%.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"229 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115605538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 86
Algorithms for high-throughput disk-to-disk sorting 高吞吐量磁盘到磁盘排序算法
H. Sundar, D. Malhotra, K. Schulz
In this paper, we present a new out-of-core sort algorithm, designed for problems that are too large to fit into the aggregate RAM available on modern supercomputers. We analyze the performance including the cost of IO and demonstrate the fastest (to the best of our knowledge) reported throughput using the canonical sortBenchmark on a general-purpose, production HPC resource running Lustre. By clever use of available storage and a formulation of asynchronous data transfer mechanisms, we are able to almost completely hide the computation (sorting) behind the IO latency. This latency hiding enables us to achieve comparable execution times, including the additional temporary IO required, between a large sort problem (5TB) run as a single, in-RAM sort and our out-of-core approach using 1/10th the amount of RAM. In our largest run, sorting 100TB of records using 1792 hosts, we achieved an end-to-end throughput of 1.24TB/min using our general-purpose sorter, improving on the current Daytona record holder by 65%.
在本文中,我们提出了一种新的核外排序算法,该算法是为现代超级计算机上的总内存无法容纳的问题而设计的。我们分析了性能,包括IO成本,并在运行Lustre的通用生产HPC资源上使用规范的sortBenchmark演示了最快(据我们所知)报告的吞吐量。通过巧妙地使用可用存储和异步数据传输机制的公式,我们能够几乎完全隐藏IO延迟背后的计算(排序)。这种延迟隐藏使我们能够实现可比较的执行时间,包括所需的额外临时IO,在大型排序问题(5TB)作为单个RAM内排序和我们的外核方法之间运行,使用1/10的RAM量。在我们最大的一次运行中,使用1792台主机对100TB的记录进行排序,使用我们的通用排序器实现了1.24TB/min的端到端吞吐量,比当前的Daytona记录保持者提高了65%。
{"title":"Algorithms for high-throughput disk-to-disk sorting","authors":"H. Sundar, D. Malhotra, K. Schulz","doi":"10.1145/2503210.2503259","DOIUrl":"https://doi.org/10.1145/2503210.2503259","url":null,"abstract":"In this paper, we present a new out-of-core sort algorithm, designed for problems that are too large to fit into the aggregate RAM available on modern supercomputers. We analyze the performance including the cost of IO and demonstrate the fastest (to the best of our knowledge) reported throughput using the canonical sortBenchmark on a general-purpose, production HPC resource running Lustre. By clever use of available storage and a formulation of asynchronous data transfer mechanisms, we are able to almost completely hide the computation (sorting) behind the IO latency. This latency hiding enables us to achieve comparable execution times, including the additional temporary IO required, between a large sort problem (5TB) run as a single, in-RAM sort and our out-of-core approach using 1/10th the amount of RAM. In our largest run, sorting 100TB of records using 1792 hosts, we achieved an end-to-end throughput of 1.24TB/min using our general-purpose sorter, improving on the current Daytona record holder by 65%.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124754462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Using automated performance modeling to find scalability bugs in complex codes 使用自动性能建模来查找复杂代码中的可伸缩性错误
A. Calotoiu, T. Hoefler, Marius Poke, F. Wolf
Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made-a point where remediation can be difficult. However, creating analytical performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. In this paper, we show how both coverage and speed of this scalability analysis can be substantially improved. Generating an empirical performance model automatically for each part of a parallel program, we can easily identify those parts that will reduce performance at larger core counts. Using a climate simulation as an example, we demonstrate that scalability bugs are not confined to those routines usually chosen as kernels.
许多并行应用程序存在潜在的性能限制,这些限制可能会阻止它们扩展到更大的机器规模。通常,这种可伸缩性错误只有在实际尝试扩展代码时才会显现出来——此时很难进行补救。然而,创建分析性能模型以允许更早地确定此类问题是非常费力的,因此应用程序开发人员最多只能针对几个选定的内核进行尝试,从而冒着错过有害瓶颈的风险。在本文中,我们将展示如何有效地改进这种可伸缩性分析的覆盖范围和速度。为并行程序的每个部分自动生成一个经验性能模型,我们可以很容易地识别那些在较大的核心计数下会降低性能的部分。以气候模拟为例,我们证明了可扩展性缺陷并不局限于通常选择作为内核的那些例程。
{"title":"Using automated performance modeling to find scalability bugs in complex codes","authors":"A. Calotoiu, T. Hoefler, Marius Poke, F. Wolf","doi":"10.1145/2503210.2503277","DOIUrl":"https://doi.org/10.1145/2503210.2503277","url":null,"abstract":"Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made-a point where remediation can be difficult. However, creating analytical performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. In this paper, we show how both coverage and speed of this scalability analysis can be substantially improved. Generating an empirical performance model automatically for each part of a parallel program, we can easily identify those parts that will reduce performance at larger core counts. Using a climate simulation as an example, we demonstrate that scalability bugs are not confined to those routines usually chosen as kernels.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132266329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 137
GoldRush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution golddrush:资源高效的现场科学数据分析,使用细粒度干扰感知执行
F. Zheng, Hongfeng Yu, Can Hantas, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, S. Klasky
Severe I/O bottlenecks on High End Computing platforms call for running data analytics in situ. Demonstrating that there exist considerable resources in compute nodes un-used by typical high end scientific simulations, we leverage this fact by creating an agile runtime, termed GoldRush, that can harvest those otherwise wasted, idle resources to efficiently run in situ data analytics. GoldRush uses fine-grained scheduling to “steal” idle resources, in ways that minimize interference between the simulation and in situ analytics. This involves recognizing the potential causes of on-node resource contention and then using scheduling methods that prevent them. Experiments with representative science applications at large scales show that resources harvested on compute nodes can be leveraged to perform useful analytics, significantly improving resource efficiency, reducing data movement costs incurred by alternate solutions, and posing negligible impact on scientific simulations.
高端计算平台上严重的I/O瓶颈要求在原位运行数据分析。为了证明在计算节点中存在大量未被典型高端科学模拟使用的资源,我们通过创建一个称为GoldRush的敏捷运行时来利用这一事实,该运行时可以收集那些被浪费的空闲资源,以有效地运行现场数据分析。GoldRush使用细粒度调度来“窃取”闲置资源,从而最大限度地减少模拟和现场分析之间的干扰。这涉及到识别节点上资源争用的潜在原因,然后使用调度方法来防止它们。大规模代表性科学应用程序的实验表明,可以利用计算节点上收集的资源来执行有用的分析,显著提高资源效率,降低替代解决方案产生的数据移动成本,并且对科学模拟的影响可以忽略。
{"title":"GoldRush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution","authors":"F. Zheng, Hongfeng Yu, Can Hantas, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, S. Klasky","doi":"10.1145/2503210.2503279","DOIUrl":"https://doi.org/10.1145/2503210.2503279","url":null,"abstract":"Severe I/O bottlenecks on High End Computing platforms call for running data analytics in situ. Demonstrating that there exist considerable resources in compute nodes un-used by typical high end scientific simulations, we leverage this fact by creating an agile runtime, termed GoldRush, that can harvest those otherwise wasted, idle resources to efficiently run in situ data analytics. GoldRush uses fine-grained scheduling to “steal” idle resources, in ways that minimize interference between the simulation and in situ analytics. This involves recognizing the potential causes of on-node resource contention and then using scheduling methods that prevent them. Experiments with representative science applications at large scales show that resources harvested on compute nodes can be leveraged to perform useful analytics, significantly improving resource efficiency, reducing data movement costs incurred by alternate solutions, and posing negligible impact on scientific simulations.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122218397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 82
Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors 采用低通信算法和Intel®Xeon Phi™协处理器的万亿级1D FFT
Jongsoo Park, Ganesh Bikshandi, K. Vaidyanathan, P. T. P. Tang, P. Dubey, Daehyun Kim
This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nodes, which is 1.5× than achievable on a same number of Intel® Xeon® nodes. It is a challenge to fully utilize the compute capability presented by many-core wide-vector processors for bandwidth-bound FFT computation. We leverage a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running FFT on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging HPC systems that are increasingly communication limited.
本文展示了Intel®Xeon Phi™协处理器在1D FFT计算上的首个兆级性能。采用合理的算法选择、有效的性能模型和执行良好的优化的严格的性能编程方法,我们在仅64个Xeon Phi节点上打破了tera-flop标记,并在512个节点上达到6.7 TFLOPS,这是在相同数量的Intel®Xeon®节点上可实现的1.5倍。如何充分利用多核宽矢量处理器的计算能力来进行带宽受限的FFT计算是一个挑战。我们利用一种新的算法,兴趣段FFT,具有低节点间通信成本,并积极优化节点本地计算中的数据移动,利用缓存。我们对低通信算法和大规模并行架构的协调可扩展性能并不局限于在Xeon Phi上运行FFT;它可以为其他带宽受限的计算和通信日益受限的新兴HPC系统提供参考。
{"title":"Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors","authors":"Jongsoo Park, Ganesh Bikshandi, K. Vaidyanathan, P. T. P. Tang, P. Dubey, Daehyun Kim","doi":"10.1145/2503210.2503242","DOIUrl":"https://doi.org/10.1145/2503210.2503242","url":null,"abstract":"This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nodes, which is 1.5× than achievable on a same number of Intel® Xeon® nodes. It is a challenge to fully utilize the compute capability presented by many-core wide-vector processors for bandwidth-bound FFT computation. We leverage a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running FFT on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging HPC systems that are increasingly communication limited.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116624960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
期刊
2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1