首页 > 最新文献

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)最新文献

英文 中文
SDQuery DSI: Integrating data management support with a wide area data transfer protocol SDQuery DSI:将数据管理支持与广域数据传输协议集成
Yunde Su, Yi Wang, G. Agrawal, R. Kettimuthu
In many science areas where datasets need to be transferred or shared, rapid growth in dataset size, coupled with much slower increases in wide area data transfer bandwidths, is making it extremely hard for scientists to analyze the data. This paper addresses the current limitations by developing SDQuery DSI, a GridFTP plug-in that supports flexible server-side data subsetting. An existing GridFTP server is able to dynamically load this tool to support new functionality. Different queries types (query over dimensions, coordinates and values) are supported by our tool. A number of optimizations, like parallel indexing, performance model for data subsetting, and parallel streaming are also applied. We compare our SDQuery DSI with GridFTP default File DSI in different network environments, and show that our method can achieve better efficiency in almost all cases.
在许多需要传输或共享数据集的科学领域,数据集大小的快速增长,加上广域数据传输带宽的缓慢增长,使得科学家分析数据变得极其困难。本文通过开发SDQuery DSI(一个支持灵活的服务器端数据子集的GridFTP插件)来解决当前的限制。现有的GridFTP服务器能够动态加载此工具以支持新功能。我们的工具支持不同的查询类型(对维度、坐标和值的查询)。还应用了许多优化,如并行索引、数据子集的性能模型和并行流。我们将我们的SDQuery DSI与GridFTP默认文件DSI在不同的网络环境下进行了比较,并表明我们的方法在几乎所有情况下都能达到更好的效率。
{"title":"SDQuery DSI: Integrating data management support with a wide area data transfer protocol","authors":"Yunde Su, Yi Wang, G. Agrawal, R. Kettimuthu","doi":"10.1145/2503210.2503270","DOIUrl":"https://doi.org/10.1145/2503210.2503270","url":null,"abstract":"In many science areas where datasets need to be transferred or shared, rapid growth in dataset size, coupled with much slower increases in wide area data transfer bandwidths, is making it extremely hard for scientists to analyze the data. This paper addresses the current limitations by developing SDQuery DSI, a GridFTP plug-in that supports flexible server-side data subsetting. An existing GridFTP server is able to dynamically load this tool to support new functionality. Different queries types (query over dimensions, coordinates and values) are supported by our tool. A number of optimizations, like parallel indexing, performance model for data subsetting, and parallel streaming are also applied. We compare our SDQuery DSI with GridFTP default File DSI in different network environments, and show that our method can achieve better efficiency in almost all cases.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123662739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
There goes the neighborhood: Performance degradation due to nearby jobs 这就是邻里关系:由于附近的工作而导致性能下降
A. Bhatele, K. Mohror, S. Langer, Katherine E. Isaacs
Predictable performance is important for understanding and alleviating application performance issues; quantifying the effects of source code, compiler, or system software changes; estimating the time required for batch jobs; and determining the allocation requests for proposals. Our experiments show that on a Cray XE system, the execution time of a communication-heavy parallel application ranges from 28% faster to 41% slower than the average observed performance. Blue Gene systems, on the other hand, demonstrate no noticeable run-to-run variability. In this paper, we focus on Cray machines and investigate potential causes for performance variability such as OS jitter, shape of the allocated partition, and interference from other jobs sharing the same network links. Reducing such variability could improve overall throughput at a computer center and save energy costs.
可预测的性能对于理解和减轻应用程序性能问题非常重要;量化源代码、编译器或系统软件变更的影响;估计批处理作业所需的时间;确定提案的分配请求。我们的实验表明,在Cray XE系统上,通信密集型并行应用程序的执行时间比观察到的平均性能快28%到慢41%。另一方面,蓝色基因系统没有表现出明显的运行差异。在本文中,我们将重点放在Cray机器上,并研究性能变化的潜在原因,如操作系统抖动、分配分区的形状以及来自共享相同网络链接的其他作业的干扰。减少这种可变性可以提高计算机中心的总体吞吐量并节省能源成本。
{"title":"There goes the neighborhood: Performance degradation due to nearby jobs","authors":"A. Bhatele, K. Mohror, S. Langer, Katherine E. Isaacs","doi":"10.1145/2503210.2503247","DOIUrl":"https://doi.org/10.1145/2503210.2503247","url":null,"abstract":"Predictable performance is important for understanding and alleviating application performance issues; quantifying the effects of source code, compiler, or system software changes; estimating the time required for batch jobs; and determining the allocation requests for proposals. Our experiments show that on a Cray XE system, the execution time of a communication-heavy parallel application ranges from 28% faster to 41% slower than the average observed performance. Blue Gene systems, on the other hand, demonstrate no noticeable run-to-run variability. In this paper, we focus on Cray machines and investigate potential causes for performance variability such as OS jitter, shape of the allocated partition, and interference from other jobs sharing the same network links. Reducing such variability could improve overall throughput at a computer center and save energy costs.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114932355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 166
The origin of mass 质量的起源
P. Boyle, M. Buchoff, N. Christ, T. Izubuchi, C. Jung, T. Luu, R. Mawhinney, C. Schroeder, R. Soltz, P. Vranas, J. Wasem
The origin of mass is one of the deepest mysteries in science. Neutrons and protons, which account for almost all visible mass in the Universe, emerged from a primordial plasma through a cataclysmic phase transition microseconds after the Big Bang. However, most mass in the Universe is invisible. The existence of dark matter, which interacts with our world so weakly that it is essentially undetectable, has been established from its galactic-scale gravitational effects. Here we describe results from the first truly physical calculations of the cosmic phase transition and a groundbreaking first-principles investigation into composite dark matter, studies impossible with previous state-of-the-art methods and resources. By inventing a powerful new algorithm, “DSDR,” and implementing it effectively for contemporary supercomputers, we attain excellent strong scaling, perfect weak scaling to the LLNL BlueGene/Q two million cores, sustained speed of 7.2 petaflops, and time-to-solution speedup of more than 200 over the previous state-of-the-art.
质量的起源是科学中最深奥的谜团之一。中子和质子构成了宇宙中几乎所有可见的质量,它们是在宇宙大爆炸后几微秒内通过剧烈的相变从原始等离子体中产生的。然而,宇宙中的大部分物质是看不见的。暗物质与我们的世界相互作用非常微弱,基本上无法探测到,它的存在是从它的星系尺度引力效应中确立的。在这里,我们描述了第一次真正的宇宙相变物理计算的结果,以及对复合暗物质的开创性第一原理研究,这些研究用以前最先进的方法和资源是不可能的。通过发明一种强大的新算法,“DSDR”,并在当代超级计算机上有效地实现它,我们实现了出色的强缩放,完美的弱缩放到LLNL BlueGene/Q 200万核,持续速度为7.2千万亿次,解决时间比以前的最先进技术加快200多倍。
{"title":"The origin of mass","authors":"P. Boyle, M. Buchoff, N. Christ, T. Izubuchi, C. Jung, T. Luu, R. Mawhinney, C. Schroeder, R. Soltz, P. Vranas, J. Wasem","doi":"10.1145/2503210.2504561","DOIUrl":"https://doi.org/10.1145/2503210.2504561","url":null,"abstract":"The origin of mass is one of the deepest mysteries in science. Neutrons and protons, which account for almost all visible mass in the Universe, emerged from a primordial plasma through a cataclysmic phase transition microseconds after the Big Bang. However, most mass in the Universe is invisible. The existence of dark matter, which interacts with our world so weakly that it is essentially undetectable, has been established from its galactic-scale gravitational effects. Here we describe results from the first truly physical calculations of the cosmic phase transition and a groundbreaking first-principles investigation into composite dark matter, studies impossible with previous state-of-the-art methods and resources. By inventing a powerful new algorithm, “DSDR,” and implementing it effectively for contemporary supercomputers, we attain excellent strong scaling, perfect weak scaling to the LLNL BlueGene/Q two million cores, sustained speed of 7.2 petaflops, and time-to-solution speedup of more than 200 over the previous state-of-the-art.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123505187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Exploring portfolio scheduling for long-term execution of scientific workloads in IaaS clouds 探索在IaaS云中长期执行科学工作负载的投资组合调度
Kefeng Deng, Junqiang Song, Kaijun Ren, A. Iosup
Long-term execution of scientific applications often leads to dynamic workloads and varying application requirements. When the execution uses resources provisioned from IaaS clouds, and thus consumption-related payment, efficient and online scheduling algorithms must be found. Portfolio scheduling, which selects dynamically a suitable policy from a broad portfolio, may provide a solution to this problem. However, selecting online the right policy from possibly tens of alternatives remains challenging. In this work, we introduce an abstract model to explore this selection problem. Based on the model, we present a comprehensive portfolio scheduler that includes tens of provisioning and allocation policies. We propose an algorithm that can enlarge the chance of selecting the best policy in limited time, possibly online. Through trace-based simulation, we evaluate various aspects of our portfolio scheduler, and find performance improvements from 7% to 100% in comparison with the best constituent policies and high improvement for bursty workloads.
科学应用程序的长期执行通常会导致动态工作负载和不同的应用程序需求。当执行使用从IaaS云提供的资源时,因此与消费相关的支付,必须找到有效的在线调度算法。投资组合调度,从广泛的投资组合中动态地选择合适的策略,可能为这个问题提供一个解决方案。然而,从可能有数十种选择的在线政策中选择正确的政策仍然具有挑战性。在这项工作中,我们引入了一个抽象模型来探讨这个选择问题。在此模型的基础上,我们提出了一个综合的投资组合调度程序,其中包括数十个配置和分配策略。我们提出了一种算法,可以扩大在有限时间内(可能是在线)选择最佳策略的机会。通过基于跟踪的模拟,我们评估了投资组合调度程序的各个方面,并发现与最佳组成策略相比,性能提高了7%到100%,并且对突发工作负载有很高的改进。
{"title":"Exploring portfolio scheduling for long-term execution of scientific workloads in IaaS clouds","authors":"Kefeng Deng, Junqiang Song, Kaijun Ren, A. Iosup","doi":"10.1145/2503210.2503244","DOIUrl":"https://doi.org/10.1145/2503210.2503244","url":null,"abstract":"Long-term execution of scientific applications often leads to dynamic workloads and varying application requirements. When the execution uses resources provisioned from IaaS clouds, and thus consumption-related payment, efficient and online scheduling algorithms must be found. Portfolio scheduling, which selects dynamically a suitable policy from a broad portfolio, may provide a solution to this problem. However, selecting online the right policy from possibly tens of alternatives remains challenging. In this work, we introduce an abstract model to explore this selection problem. Based on the model, we present a comprehensive portfolio scheduler that includes tens of provisioning and allocation policies. We propose an algorithm that can enlarge the chance of selecting the best policy in limited time, possibly online. Through trace-based simulation, we evaluate various aspects of our portfolio scheduler, and find performance improvements from 7% to 100% in comparison with the best constituent policies and high improvement for bursty workloads.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124738016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Scalable matrix computations on large scale-free graphs using 2D graph partitioning
E. Boman, K. Devine, S. Rajamanickam
Scalable parallel computing is essential for processing large scale-free (power-law) graphs. The distribution of data across processes becomes important on distributed-memory computers with thousands of cores. It has been shown that two-dimensional layouts (edge partitioning) can have significant advantages over traditional one-dimensional layouts. However, simple 2D block distribution does not use the structure of the graph, and more advanced 2D partitioning methods are too expensive for large graphs. We propose a new two-dimensional partitioning algorithm that combines graph partitioning with 2D block distribution. The computational cost of the algorithm is essentially the same as 1D graph partitioning. We study the performance of sparse matrix-vector multiplication (SpMV) for scale-free graphs from the web and social networks using several different partitioners and both 1D and 2D data layouts. We show that SpMV run time is reduced by exploiting the graph's structure. Contrary to popular belief, we observe that current graph and hypergraph partitioners often yield relatively good partitions on scale-free graphs. We demonstrate that our new 2D partitioning method consistently outperforms the other methods considered, for both SpMV and an eigensolver, on matrices with up to 1.6 billion nonzeros using up to 16,384 cores.
可伸缩并行计算对于处理大型无标度(幂律)图是必不可少的。在具有数千个核的分布式内存计算机上,跨进程的数据分布变得非常重要。研究表明,二维布局(边缘划分)比传统的一维布局具有显著的优势。然而,简单的2D块分布不使用图的结构,更高级的2D分区方法对于大型图来说过于昂贵。本文提出了一种新的二维分区算法,该算法将图分区与二维块分布相结合。该算法的计算代价与一维图划分基本相同。我们研究了稀疏矩阵向量乘法(SpMV)对来自网络和社交网络的无标度图的性能,使用几种不同的分区器和一维和二维数据布局。我们展示了通过利用图的结构来减少SpMV的运行时间。与普遍的看法相反,我们观察到当前的图和超图分区器通常在无标度图上产生相对较好的分区。我们证明,对于SpMV和特征求解器,我们的新2D划分方法在使用多达16,384个内核的矩阵上具有多达16亿个非零的矩阵上始终优于其他考虑的方法。
{"title":"Scalable matrix computations on large scale-free graphs using 2D graph partitioning","authors":"E. Boman, K. Devine, S. Rajamanickam","doi":"10.1145/2503210.2503293","DOIUrl":"https://doi.org/10.1145/2503210.2503293","url":null,"abstract":"Scalable parallel computing is essential for processing large scale-free (power-law) graphs. The distribution of data across processes becomes important on distributed-memory computers with thousands of cores. It has been shown that two-dimensional layouts (edge partitioning) can have significant advantages over traditional one-dimensional layouts. However, simple 2D block distribution does not use the structure of the graph, and more advanced 2D partitioning methods are too expensive for large graphs. We propose a new two-dimensional partitioning algorithm that combines graph partitioning with 2D block distribution. The computational cost of the algorithm is essentially the same as 1D graph partitioning. We study the performance of sparse matrix-vector multiplication (SpMV) for scale-free graphs from the web and social networks using several different partitioners and both 1D and 2D data layouts. We show that SpMV run time is reduced by exploiting the graph's structure. Contrary to popular belief, we observe that current graph and hypergraph partitioners often yield relatively good partitions on scale-free graphs. We demonstrate that our new 2D partitioning method consistently outperforms the other methods considered, for both SpMV and an eigensolver, on matrices with up to 1.6 billion nonzeros using up to 16,384 cores.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125215027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 106
SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing SPBC:利用MPI HPC应用程序的特性来实现可伸缩的检查点
Thomas Ropars, Tatiana V. Martsinkevich, Amina Guermouche, A. Schiper, F. Cappello
The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any message-passing application but sudder from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints, and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.
预计未来超级计算机的高故障率需要设计新的容错解决方案。大多数检查点协议被设计为与任何消息传递应用程序一起工作,但在极端规模下会出现可伸缩性问题。我们采用了一种不同的方法:我们确定了许多HPC应用程序的一个共同属性,即通道确定性,并在这些应用程序的事件之间引入了一个新的偏序关系,称为“总是先于发生”关系。利用这两个概念,我们设计了一个结合了前所未有的功能集的协议。我们称为SPBC的协议以分层方式结合了协调的检查点和消息记录。它是第一个在不可靠地记录除进程检查点以外的任何信息的情况下提供故障遏制的协议,而且不会影响恢复性能。在一组具有代表性的HPC工作负载上运行的实验表明,我们的协议在无故障执行和恢复期间都具有良好的性能。
{"title":"SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing","authors":"Thomas Ropars, Tatiana V. Martsinkevich, Amina Guermouche, A. Schiper, F. Cappello","doi":"10.1145/2503210.2503271","DOIUrl":"https://doi.org/10.1145/2503210.2503271","url":null,"abstract":"The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any message-passing application but sudder from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints, and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126274232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
On the usefulness of object tracking techniques in performance analysis 论目标跟踪技术在性能分析中的作用
Germán Llort, Harald Servat, Juan Gonzalez, Judit Giménez, Jesús Labarta
Understanding the behavior of a parallel application is crucial if we are to tune it to achieve its maximum performance. Yet the behavior the application exhibits may change over time and depend on the actual execution scenario: particular inputs and program settings, the number of processes used, or hardware-specific problems. So beyond the details of a single experiment a far more interesting question arises: how does the application behavior respond to changes in the execution conditions? In this paper, we demonstrate that object tracking concepts from computer vision have huge potential to be applied in the context of performance analysis. We leverage tracking techniques to analyze how the behavior of a parallel application evolves through multiple scenarios where the execution conditions change. This method provides comprehensible insights on the influence of different parameters on the application behavior, enabling us to identify the most relevant code regions and their performance trends.
如果我们要调优并行应用程序以实现其最大性能,那么理解并行应用程序的行为是至关重要的。然而,应用程序表现出的行为可能会随着时间的推移而改变,这取决于实际的执行场景:特定的输入和程序设置、使用的进程数量或特定于硬件的问题。因此,除了单个实验的细节之外,一个更有趣的问题出现了:应用程序的行为如何响应执行条件的变化?在本文中,我们证明了计算机视觉中的目标跟踪概念在性能分析方面具有巨大的应用潜力。我们利用跟踪技术来分析并行应用程序的行为如何在执行条件发生变化的多个场景中演变。该方法提供了关于不同参数对应用程序行为的影响的可理解的见解,使我们能够确定最相关的代码区域及其性能趋势。
{"title":"On the usefulness of object tracking techniques in performance analysis","authors":"Germán Llort, Harald Servat, Juan Gonzalez, Judit Giménez, Jesús Labarta","doi":"10.1145/2503210.2503267","DOIUrl":"https://doi.org/10.1145/2503210.2503267","url":null,"abstract":"Understanding the behavior of a parallel application is crucial if we are to tune it to achieve its maximum performance. Yet the behavior the application exhibits may change over time and depend on the actual execution scenario: particular inputs and program settings, the number of processes used, or hardware-specific problems. So beyond the details of a single experiment a far more interesting question arises: how does the application behavior respond to changes in the execution conditions? In this paper, we demonstrate that object tracking concepts from computer vision have huge potential to be applied in the context of performance analysis. We leverage tracking techniques to analyze how the behavior of a parallel application evolves through multiple scenarios where the execution conditions change. This method provides comprehensible insights on the influence of different parameters on the application behavior, enabling us to identify the most relevant code regions and their performance trends.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"430 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120875476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Enabling highly-scalable remote memory access programming with MPI-3 one sided 通过MPI-3单侧实现高度可扩展的远程存储器访问编程
R. Gerstenberger, Maciej Besta, T. Hoefler
Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth, and message rate. We also demonstrate application performance improvements with comparable programming complexity.
现代互连提供远程直接内存访问(RDMA)功能。然而,大多数应用程序依赖于显式消息传递进行通信,尽管它们有不必要的开销。MPI-3.0标准定义了一个直接利用RDMA网络的编程接口,但其可扩展性和实用性需要在实践中得到验证。在这项工作中,我们开发了实现MPI-3.0规范的可扩展无缓冲协议。我们的协议支持扩展到数百万个内核,内存消耗可以忽略不计,同时提供最高的性能和最小的开销。为了武装程序员,我们为所有关键功能提供了一系列性能模型,并通过几个多达50万个进程的应用程序研究演示了我们的库和模型的可用性。我们表明,我们的设计在延迟、带宽和消息速率方面与UPC和Fortran阵列相当,甚至更好。我们还演示了具有相当编程复杂性的应用程序性能改进。
{"title":"Enabling highly-scalable remote memory access programming with MPI-3 one sided","authors":"R. Gerstenberger, Maciej Besta, T. Hoefler","doi":"10.1145/2503210.2503286","DOIUrl":"https://doi.org/10.1145/2503210.2503286","url":null,"abstract":"Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth, and message rate. We also demonstrate application performance improvements with comparable programming complexity.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131481407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 128
A data-centric profiler for parallel programs 一个以数据为中心的并行程序分析器
Xu Liu, J. Mellor-Crummey
It is difficult to manually identify opportunities for enhancing data locality. To address this problem, we extended the HPCToolkit performance tools to support data-centric profiling of scalable parallel programs. Our tool uses hardware counters to directly measure memory access latency and attributes latency metrics to both variables and instructions. Different hardware counters provide insight into different aspects of data locality (or lack thereof). Unlike prior tools for data-centric analysis, our tool employs scalable measurement, analysis, and presentation methods that enable it to analyze the memory access behavior of scalable parallel programs with low runtime and space overhead. We demonstrate the utility of HPCToolkit's new data-centric analysis capabilities with case studies of five well-known benchmarks. In each benchmark, we identify performance bottlenecks caused by poor data locality and demonstrate non-trivial performance optimizations enabled by this guidance.
很难手动识别增强数据局部性的机会。为了解决这个问题,我们扩展了HPCToolkit性能工具,以支持以数据为中心的可扩展并行程序分析。我们的工具使用硬件计数器直接测量内存访问延迟,并将延迟指标属性为变量和指令。不同的硬件计数器提供了对数据局部性(或缺乏数据局部性)不同方面的洞察。与以前的以数据为中心的分析工具不同,我们的工具采用可扩展的测量、分析和表示方法,使其能够以低运行时和空间开销分析可扩展并行程序的内存访问行为。我们通过对五个知名基准的案例研究来演示HPCToolkit新的以数据为中心的分析功能的实用性。在每个基准测试中,我们确定了由于数据局部性差而导致的性能瓶颈,并演示了本指南支持的重要性能优化。
{"title":"A data-centric profiler for parallel programs","authors":"Xu Liu, J. Mellor-Crummey","doi":"10.1145/2503210.2503297","DOIUrl":"https://doi.org/10.1145/2503210.2503297","url":null,"abstract":"It is difficult to manually identify opportunities for enhancing data locality. To address this problem, we extended the HPCToolkit performance tools to support data-centric profiling of scalable parallel programs. Our tool uses hardware counters to directly measure memory access latency and attributes latency metrics to both variables and instructions. Different hardware counters provide insight into different aspects of data locality (or lack thereof). Unlike prior tools for data-centric analysis, our tool employs scalable measurement, analysis, and presentation methods that enable it to analyze the memory access behavior of scalable parallel programs with low runtime and space overhead. We demonstrate the utility of HPCToolkit's new data-centric analysis capabilities with case studies of five well-known benchmarks. In each benchmark, we identify performance bottlenecks caused by poor data locality and demonstrate non-trivial performance optimizations enabled by this guidance.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131640436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
Exploring power behaviors and trade-offs of in-situ data analytics 探索电力行为和就地数据分析的权衡
Marc Gamell, I. Rodero, M. Parashar, Janine Bennett, H. Kolla, Jacqueline H. Chen, P. Bremer, Aaditya G. Landge, A. Gyulassy, P. McCormick, S. Pakin, Valerio Pascucci, S. Klasky
As scientific applications target exascale, challenges related to data and energy are becoming dominating concerns. For example, coupled simulation workflows are increasingly adopting in-situ data processing and analysis techniques to address costs and overheads due to data movement and I/O. However it is also critical to understand these overheads and associated trade-offs from an energy perspective. The goal of this paper is exploring data-related energy/performance trade-offs for end-to-end simulation workflows running at scale on current high-end computing systems. Specifically, this paper presents: (1) an analysis of the data-related behaviors of a combustion simulation workflow with an insitu data analytics pipeline, running on the Titan system at ORNL; (2) a power model based on system power and data exchange patterns, which is empirically validated; and (3) the use of the model to characterize the energy behavior of the workflow and to explore energy/performance tradeoffs on current as well as emerging systems.
随着科学应用的目标是百亿亿级,与数据和能源相关的挑战正成为人们关注的主要问题。例如,耦合仿真工作流越来越多地采用原位数据处理和分析技术来解决由于数据移动和I/O而产生的成本和开销。然而,从能源的角度理解这些开销和相关的权衡也是至关重要的。本文的目标是探索在当前高端计算系统上大规模运行的端到端模拟工作流中与数据相关的能量/性能权衡。具体而言,本文提出:(1)在ORNL的Titan系统上,利用现场数据分析管道对燃烧模拟工作流的数据相关行为进行了分析;(2)基于系统功率和数据交换模式的功率模型,并进行了实证验证;(3)使用该模型来表征工作流的能源行为,并探索当前以及新兴系统的能源/性能权衡。
{"title":"Exploring power behaviors and trade-offs of in-situ data analytics","authors":"Marc Gamell, I. Rodero, M. Parashar, Janine Bennett, H. Kolla, Jacqueline H. Chen, P. Bremer, Aaditya G. Landge, A. Gyulassy, P. McCormick, S. Pakin, Valerio Pascucci, S. Klasky","doi":"10.1145/2503210.2503303","DOIUrl":"https://doi.org/10.1145/2503210.2503303","url":null,"abstract":"As scientific applications target exascale, challenges related to data and energy are becoming dominating concerns. For example, coupled simulation workflows are increasingly adopting in-situ data processing and analysis techniques to address costs and overheads due to data movement and I/O. However it is also critical to understand these overheads and associated trade-offs from an energy perspective. The goal of this paper is exploring data-related energy/performance trade-offs for end-to-end simulation workflows running at scale on current high-end computing systems. Specifically, this paper presents: (1) an analysis of the data-related behaviors of a combustion simulation workflow with an insitu data analytics pipeline, running on the Titan system at ORNL; (2) a power model based on system power and data exchange patterns, which is empirically validated; and (3) the use of the model to characterize the energy behavior of the workflow and to explore energy/performance tradeoffs on current as well as emerging systems.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129485720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 52
期刊
2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1