首页 > 最新文献

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
Recovering logical structure from Charm++ event traces 从Charm++事件轨迹中恢复逻辑结构
Katherine E. Isaacs, A. Bhatele, J. Lifflander, David Böhme, T. Gamblin, M. Schulz, B. Hamann, P. Bremer
Asynchrony and non-determinism in Charm++ programs present a significant challenge in analyzing their event traces. We present a new framework to organize event traces of parallel programs written in Charm++. Our reorganization allows one to more easily explore and analyze such traces by providing context through logical structure. We describe several heuristics to compensate for missing dependencies between events that currently cannot be easily recorded. We introduce a new task ordering that recovers logical structure from the non-deterministic execution order. Using the logical structure, we define several metrics to help guide developers to performance problems. We demonstrate our approach through two proxy applications written in Charm++. Finally, we discuss the applicability of this framework to other task-based runtimes and provide guidelines for tracing to support this form of analysis.
在分析charm++程序的事件轨迹时,异步性和非确定性提出了一个重大挑战。我们提出了一个新的框架来组织用charm++编写的并行程序的事件轨迹。通过通过逻辑结构提供上下文,我们的重组允许人们更容易地探索和分析这些痕迹。我们描述了几种启发式方法,以补偿当前无法轻松记录的事件之间缺失的依赖关系。我们引入了一种新的任务排序,从不确定的执行顺序中恢复逻辑结构。使用逻辑结构,我们定义了几个指标来帮助指导开发人员解决性能问题。我们通过两个用Charm++编写的代理应用程序来演示我们的方法。最后,我们讨论了该框架对其他基于任务的运行时的适用性,并提供了支持这种分析形式的跟踪指南。
{"title":"Recovering logical structure from Charm++ event traces","authors":"Katherine E. Isaacs, A. Bhatele, J. Lifflander, David Böhme, T. Gamblin, M. Schulz, B. Hamann, P. Bremer","doi":"10.1145/2807591.2807634","DOIUrl":"https://doi.org/10.1145/2807591.2807634","url":null,"abstract":"Asynchrony and non-determinism in Charm++ programs present a significant challenge in analyzing their event traces. We present a new framework to organize event traces of parallel programs written in Charm++. Our reorganization allows one to more easily explore and analyze such traces by providing context through logical structure. We describe several heuristics to compensate for missing dependencies between events that currently cannot be easily recorded. We introduce a new task ordering that recovers logical structure from the non-deterministic execution order. Using the logical structure, we define several metrics to help guide developers to performance problems. We demonstrate our approach through two proxy applications written in Charm++. Finally, we discuss the applicability of this framework to other task-based runtimes and provide guidelines for tracing to support this form of analysis.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127337248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Full correlation matrix analysis of fMRI data on Intel® Xeon Phi™ coprocessors 在Intel®Xeon Phi™协处理器上对fMRI数据进行全相关矩阵分析
Yida Wang, Michael J. Anderson, J. Cohen, A. Heinecke, K. Li, N. Satish, N. Sundaram, N. Turk-Browne, Theodore L. Willke
Full correlation matrix analysis (FCMA) is an unbiased approach for exhaustively studying interactions among brain regions in functional magnetic resonance imaging (fMRI) data from human participants. In order to answer neuroscientific questions efficiently, we are developing a closed-loop analysis system with FCMA on a cluster of nodes with Intel® Xeon Phi™ coprocessors. Here we propose several ideas for data-driven algorithmic modification to improve the performance on the coprocessor. Our experiments with real datasets show that the optimized single-node code runs 5x-16x faster than the baseline implementation using the well-known Intel® MKL and LibSVM libraries, and that the cluster implementation achieves near linear speedup on 5760 cores.
全相关矩阵分析(FCMA)是一种无偏的方法,用于从人类参与者的功能磁共振成像(fMRI)数据中详尽地研究大脑区域之间的相互作用。为了有效地回答神经科学问题,我们正在基于Intel®Xeon Phi™协处理器的节点集群开发一种基于FCMA的闭环分析系统。在这里,我们提出了一些数据驱动算法修改的想法,以提高协处理器上的性能。我们在真实数据集上的实验表明,优化后的单节点代码比使用著名的Intel®MKL和LibSVM库的基线实现快5 -16倍,并且集群实现在5760核上实现了接近线性的加速。
{"title":"Full correlation matrix analysis of fMRI data on Intel® Xeon Phi™ coprocessors","authors":"Yida Wang, Michael J. Anderson, J. Cohen, A. Heinecke, K. Li, N. Satish, N. Sundaram, N. Turk-Browne, Theodore L. Willke","doi":"10.1145/2807591.2807631","DOIUrl":"https://doi.org/10.1145/2807591.2807631","url":null,"abstract":"Full correlation matrix analysis (FCMA) is an unbiased approach for exhaustively studying interactions among brain regions in functional magnetic resonance imaging (fMRI) data from human participants. In order to answer neuroscientific questions efficiently, we are developing a closed-loop analysis system with FCMA on a cluster of nodes with Intel® Xeon Phi™ coprocessors. Here we propose several ideas for data-driven algorithmic modification to improve the performance on the coprocessor. Our experiments with real datasets show that the optimized single-node code runs 5x-16x faster than the baseline implementation using the well-known Intel® MKL and LibSVM libraries, and that the cluster implementation achieves near linear speedup on 5760 cores.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130170428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Finding the limits of power-constrained application performance 发现受功率限制的应用程序性能的极限
Peter E. Bailey, Aniruddha Marathe, D. Lowenthal, B. Rountree, M. Schulz
As we approach exascale systems, power is turning from an optimization goal to a critical operating constraint. With power bounds imposed by both stakeholders and the limitations of existing infrastructure, we need to develop new techniques that work with limited power to extract maximum performance. In this paper, we explore this area and provide an approach to find the theoretical upper bound of computational performance on a per-application basis in hybrid MPI + OpenMP applications. We use a linear programming (LP) formulation to optimize application schedules under various power constraints, where a schedule consists of a DVFS state and number of OpenMP threads for each section of computation between consecutive MPI calls. We also provide a more flexible mixed integer-linear (ILP) formulation and show that the resulting schedules closely match schedules from the LP formulation. Across four applications, we use our LP-derived upper bounds to show that current approaches trail optimal, power-constrained performance by up to 41.1%. This demonstrates the untapped potential of current systems, and our LP formulation provides future optimization approaches with a quantitative optimization target.
当我们接近百亿亿级系统时,功率正从优化目标转变为关键的操作约束。由于利益相关者施加的权力限制和现有基础设施的局限性,我们需要开发新的技术,在有限的权力下工作,以获得最大的性能。在本文中,我们探索了这一领域,并提供了一种在混合MPI + OpenMP应用程序中找到基于每个应用程序的计算性能的理论上限的方法。我们使用线性规划(LP)公式来优化各种功率约束下的应用程序调度,其中调度由DVFS状态和连续MPI调用之间每个计算部分的OpenMP线程数组成。我们还提供了一个更灵活的混合整线性(ILP)公式,并表明所得的时间表与LP公式中的时间表密切匹配。在四个应用程序中,我们使用我们的lp推导的上界来显示当前方法落后于最优,功率限制的性能高达41.1%。这证明了当前系统尚未开发的潜力,我们的LP配方为未来的优化方法提供了定量优化目标。
{"title":"Finding the limits of power-constrained application performance","authors":"Peter E. Bailey, Aniruddha Marathe, D. Lowenthal, B. Rountree, M. Schulz","doi":"10.1145/2807591.2807637","DOIUrl":"https://doi.org/10.1145/2807591.2807637","url":null,"abstract":"As we approach exascale systems, power is turning from an optimization goal to a critical operating constraint. With power bounds imposed by both stakeholders and the limitations of existing infrastructure, we need to develop new techniques that work with limited power to extract maximum performance. In this paper, we explore this area and provide an approach to find the theoretical upper bound of computational performance on a per-application basis in hybrid MPI + OpenMP applications. We use a linear programming (LP) formulation to optimize application schedules under various power constraints, where a schedule consists of a DVFS state and number of OpenMP threads for each section of computation between consecutive MPI calls. We also provide a more flexible mixed integer-linear (ILP) formulation and show that the resulting schedules closely match schedules from the LP formulation. Across four applications, we use our LP-derived upper bounds to show that current approaches trail optimal, power-constrained performance by up to 41.1%. This demonstrates the untapped potential of current systems, and our LP formulation provides future optimization approaches with a quantitative optimization target.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133897709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Data partitioning strategies for graph workloads on heterogeneous clusters 异构集群上图形工作负载的数据分区策略
Michael LeBeane, Shuang Song, Reena Panda, Jee Ho Ryoo, L. John
Large scale graph analytics are an important class of problem in the modern data center. However, while data centers are trending towards a large number of heterogeneous processing nodes, graph analytics frameworks still operate under the assumption of uniform compute resources. In this paper, we develop heterogeneity-aware data ingress strategies for graph analytics workloads using the popular PowerGraph framework. We illustrate how simple estimates of relative node computational throughput can guide heterogeneity-aware data partitioning algorithms to provide balanced graph cutting decisions. Our work enhances five online data ingress strategies from a variety of sources to optimize application execution for throughput differences in heterogeneous data centers. The proposed partitioning algorithms improve the runtime of several popular machine learning and data mining applications by as much as a 65% and on average by 32% as compared to the default, balanced partitioning approaches.
大规模图分析是现代数据中心的一类重要问题。然而,在数据中心趋向于大量异构处理节点的同时,图分析框架仍然在统一计算资源的假设下运行。在本文中,我们使用流行的PowerGraph框架为图形分析工作负载开发了异构感知数据入口策略。我们说明了相对节点计算吞吐量的简单估计如何指导异构感知数据划分算法提供平衡的图切割决策。我们的工作增强了来自各种来源的五种在线数据入口策略,以优化异构数据中心中吞吐量差异的应用程序执行。与默认的均衡分区方法相比,所提出的分区算法将几种流行的机器学习和数据挖掘应用程序的运行时提高了65%,平均提高了32%。
{"title":"Data partitioning strategies for graph workloads on heterogeneous clusters","authors":"Michael LeBeane, Shuang Song, Reena Panda, Jee Ho Ryoo, L. John","doi":"10.1145/2807591.2807632","DOIUrl":"https://doi.org/10.1145/2807591.2807632","url":null,"abstract":"Large scale graph analytics are an important class of problem in the modern data center. However, while data centers are trending towards a large number of heterogeneous processing nodes, graph analytics frameworks still operate under the assumption of uniform compute resources. In this paper, we develop heterogeneity-aware data ingress strategies for graph analytics workloads using the popular PowerGraph framework. We illustrate how simple estimates of relative node computational throughput can guide heterogeneity-aware data partitioning algorithms to provide balanced graph cutting decisions. Our work enhances five online data ingress strategies from a variety of sources to optimize application execution for throughput differences in heterogeneous data centers. The proposed partitioning algorithms improve the runtime of several popular machine learning and data mining applications by as much as a 65% and on average by 32% as compared to the default, balanced partitioning approaches.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130651952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Performance of random sampling for computing low-rank approximations of a dense matrix on GPUs 在gpu上计算密集矩阵的低秩近似的随机抽样性能
Théo Mary, I. Yamazaki, J. Kurzak, P. Luszczek, S. Tomov, J. Dongarra
A low-rank approximation of a dense matrix plays an important role in many applications. To compute such an approximation, a common approach uses the QR factorization with column pivoting (QRCP). Though the reliability and efficiency of QRCP have been demonstrated, this deterministic approach requires costly communication at each step of the factorization. Since such communication is becoming increasingly expensive on modern computers, an alternative approach based on random sampling, which can be implemented using communication-optimal kernels, is becoming attractive. To study its potential, in this paper, we compare the performance of random sampling with that of QRCP on an NVIDIA Kepler GPU. Our performance results demonstrate that random sampling can be up to 12.8x faster than the deterministic approach for computing the approximation of the same accuracy. We also present the parallel scaling of the random sampling over multiple GPUs on a single compute node, showing a speedup of 3.8x over three Kepler GPUs. These results demonstrate the potential of the random sampling as an excellent computational tool for many applications, and its potential is likely to grow on the emerging computers with the increasing communication costs.
密集矩阵的低秩逼近在许多应用中起着重要的作用。为了计算这样的近似值,一种常用的方法是使用带有列枢轴的QR分解(QRCP)。虽然QRCP的可靠性和效率已被证明,但这种确定性方法在分解的每一步都需要昂贵的通信。由于这种通信在现代计算机上变得越来越昂贵,一种基于随机抽样的替代方法正变得越来越有吸引力,这种方法可以使用通信最优内核来实现。为了研究它的潜力,我们在NVIDIA Kepler GPU上比较了随机抽样和QRCP的性能。我们的性能结果表明,在计算相同精度的近似值时,随机抽样可以比确定性方法快12.8倍。我们还展示了在单个计算节点上多个gpu上随机抽样的并行缩放,显示出比三个Kepler gpu加快3.8倍的速度。这些结果证明了随机抽样作为一种优秀的计算工具在许多应用中的潜力,并且随着通信成本的增加,它的潜力可能会在新兴计算机上增长。
{"title":"Performance of random sampling for computing low-rank approximations of a dense matrix on GPUs","authors":"Théo Mary, I. Yamazaki, J. Kurzak, P. Luszczek, S. Tomov, J. Dongarra","doi":"10.1145/2807591.2807613","DOIUrl":"https://doi.org/10.1145/2807591.2807613","url":null,"abstract":"A low-rank approximation of a dense matrix plays an important role in many applications. To compute such an approximation, a common approach uses the QR factorization with column pivoting (QRCP). Though the reliability and efficiency of QRCP have been demonstrated, this deterministic approach requires costly communication at each step of the factorization. Since such communication is becoming increasingly expensive on modern computers, an alternative approach based on random sampling, which can be implemented using communication-optimal kernels, is becoming attractive. To study its potential, in this paper, we compare the performance of random sampling with that of QRCP on an NVIDIA Kepler GPU. Our performance results demonstrate that random sampling can be up to 12.8x faster than the deterministic approach for computing the approximation of the same accuracy. We also present the parallel scaling of the random sampling over multiple GPUs on a single compute node, showing a speedup of 3.8x over three Kepler GPUs. These results demonstrate the potential of the random sampling as an excellent computational tool for many applications, and its potential is likely to grow on the emerging computers with the increasing communication costs.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134478331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Monetary cost optimizations for MPI-based HPC applications on Amazon clouds: checkpoints and replicated execution Amazon云上基于mpi的HPC应用程序的货币成本优化:检查点和复制执行
Yifan Gong, Bingsheng He, Amelie Chi Zhou
In this paper, we propose monetary cost optimizations for MPI-based applications with deadline constraints on Amazon EC2. Particularly, we consider to utilize two kinds of Amazon EC2 instances (on-demand and spot instances). As a spot instance can fail at any time due to out-of-bid events, fault tolerant executions are necessary. Through detailed studies, we have found that two common fault tolerant mechanisms, i.e., checkpoints and replicated executions, are complementary for cost-effective MPI executions on spot instances. We formulate the optimization problem and propose a novel cost model to minimize the expected monetary cost. The experimental results with NPB benchmarks on Amazon EC2 demonstrate that 1) it is feasible to run MPI applications with performance constraints on spot instances, 2) our proposal achieves significant monetary cost reduction compared to the state-of-the-art algorithm and 3) it is necessary to adaptively choose checkpoint and replication techniques for cost-effective and reliable MPI executions on Amazon EC2.
在本文中,我们提出了基于mpi的应用程序的货币成本优化,这些应用程序在Amazon EC2上具有截止日期限制。特别是,我们考虑利用两种Amazon EC2实例(按需实例和现货实例)。由于现货实例随时可能由于超出出价的事件而失败,因此容错执行是必要的。通过详细的研究,我们发现两种常见的容错机制,即检查点和复制执行,对于在现场实例上执行具有成本效益的MPI是互补的。提出了一种新的成本模型,使期望货币成本最小化。在Amazon EC2上使用NPB基准测试的实验结果表明:1)在现场实例上运行具有性能约束的MPI应用程序是可行的;2)与最先进的算法相比,我们的建议实现了显著的货币成本降低;3)有必要自适应地选择检查点和复制技术,以便在Amazon EC2上高效可靠地执行MPI。
{"title":"Monetary cost optimizations for MPI-based HPC applications on Amazon clouds: checkpoints and replicated execution","authors":"Yifan Gong, Bingsheng He, Amelie Chi Zhou","doi":"10.1145/2807591.2807612","DOIUrl":"https://doi.org/10.1145/2807591.2807612","url":null,"abstract":"In this paper, we propose monetary cost optimizations for MPI-based applications with deadline constraints on Amazon EC2. Particularly, we consider to utilize two kinds of Amazon EC2 instances (on-demand and spot instances). As a spot instance can fail at any time due to out-of-bid events, fault tolerant executions are necessary. Through detailed studies, we have found that two common fault tolerant mechanisms, i.e., checkpoints and replicated executions, are complementary for cost-effective MPI executions on spot instances. We formulate the optimization problem and propose a novel cost model to minimize the expected monetary cost. The experimental results with NPB benchmarks on Amazon EC2 demonstrate that 1) it is feasible to run MPI applications with performance constraints on spot instances, 2) our proposal achieves significant monetary cost reduction compared to the state-of-the-art algorithm and 3) it is necessary to adaptively choose checkpoint and replication techniques for cost-effective and reliable MPI executions on Amazon EC2.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114111877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Multi-objective job placement in clusters 集群中的多目标就业安置
S. Blagodurov, Alexandra Fedorova, Evgeny Vinnik, Tyler Dwyer, Fabien Hermenier
One of the key decisions made by both MapReduce and HPC cluster management frameworks is the placement of jobs within a cluster. To make this decision, they consider factors like resource constraints within a node or the proximity of data to a process. However, they fail to account for the degree of collocation on the cluster's nodes. A tight process placement can create contention for the intra-node shared resources, such as shared caches, memory, disk, or network bandwidth. A loose placement would create less contention, but exacerbate network delays and increase cluster-wide power consumption. Finding the best job placement is challenging, because among many possible placements, we need to find one that gives us an acceptable trade-off between performance and power consumption. We propose to tackle the problem via multi-objective optimization. Our solution is able to balance conflicting objectives specified by the user and efficiently find a suitable job placement.
MapReduce和HPC集群管理框架所做的关键决策之一是在集群内放置作业。为了做出这个决定,他们会考虑节点内的资源约束或数据与流程的接近程度等因素。然而,它们没有考虑到集群节点上的搭配程度。紧凑的进程布局可能会导致对节点内部共享资源(如共享缓存、内存、磁盘或网络带宽)的争用。松散的布局会减少争用,但会加剧网络延迟并增加集群范围内的功耗。找到最佳的工作位置是具有挑战性的,因为在许多可能的位置中,我们需要找到一个在性能和功耗之间可以接受的平衡。我们提出通过多目标优化来解决这一问题。我们的解决方案能够平衡用户指定的冲突目标,并有效地找到合适的工作安置。
{"title":"Multi-objective job placement in clusters","authors":"S. Blagodurov, Alexandra Fedorova, Evgeny Vinnik, Tyler Dwyer, Fabien Hermenier","doi":"10.1145/2807591.2807636","DOIUrl":"https://doi.org/10.1145/2807591.2807636","url":null,"abstract":"One of the key decisions made by both MapReduce and HPC cluster management frameworks is the placement of jobs within a cluster. To make this decision, they consider factors like resource constraints within a node or the proximity of data to a process. However, they fail to account for the degree of collocation on the cluster's nodes. A tight process placement can create contention for the intra-node shared resources, such as shared caches, memory, disk, or network bandwidth. A loose placement would create less contention, but exacerbate network delays and increase cluster-wide power consumption. Finding the best job placement is challenging, because among many possible placements, we need to find one that gives us an acceptable trade-off between performance and power consumption. We propose to tackle the problem via multi-objective optimization. Our solution is able to balance conflicting objectives specified by the user and efficiently find a suitable job placement.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115659242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Clock delta compression for scalable order-replay of non-deterministic parallel applications 非确定性并行应用的可伸缩顺序重放时钟增量压缩
Kento Sato, D. Ahn, I. Laguna, Gregory L. Lee, M. Schulz
The ability to record and replay program execution helps significantly in debugging non-deterministic MPI applications by reproducing message-receive orders. However, the large amount of data that traditional record-and-reply techniques record precludes its practical applicability to massively parallel applications. In this paper, we propose a new compression algorithm, Clock Delta Compression (CDC), for scalable record and replay of non-deterministic MPI applications. CDC defines a reference order of message receives based on a totally ordered relation using Lamport clocks, and only records the differences between this reference logical-clock order and an observed order. Our evaluation shows that CDC significantly reduces the record data size. For example, when we apply CDC to Monte Carlo particle transport Benchmark (MCB), which represents common non-deterministic communication patterns, CDC reduces the record size by approximately two orders of magnitude compared to traditional techniques and incurs between 13.1% and 25.5% of runtime overhead.
记录和重播程序执行的能力通过再现消息接收命令,极大地帮助调试不确定的MPI应用程序。然而,传统的记录-回复技术记录的大量数据妨碍了其在大规模并行应用中的实际适用性。在本文中,我们提出了一种新的压缩算法,时钟增量压缩(CDC),用于非确定性MPI应用的可扩展记录和重放。CDC根据使用Lamport时钟的完全有序关系定义消息接收的引用顺序,并且只记录该引用逻辑时钟顺序与观察到的顺序之间的差异。我们的评估表明,CDC显著减少了记录数据的大小。例如,当我们将CDC应用于蒙特卡罗粒子传输基准(MCB)时,它代表了常见的非确定性通信模式,与传统技术相比,CDC将记录大小减少了大约两个数量级,并减少了13.1%到25.5%的运行时开销。
{"title":"Clock delta compression for scalable order-replay of non-deterministic parallel applications","authors":"Kento Sato, D. Ahn, I. Laguna, Gregory L. Lee, M. Schulz","doi":"10.1145/2807591.2807642","DOIUrl":"https://doi.org/10.1145/2807591.2807642","url":null,"abstract":"The ability to record and replay program execution helps significantly in debugging non-deterministic MPI applications by reproducing message-receive orders. However, the large amount of data that traditional record-and-reply techniques record precludes its practical applicability to massively parallel applications. In this paper, we propose a new compression algorithm, Clock Delta Compression (CDC), for scalable record and replay of non-deterministic MPI applications. CDC defines a reference order of message receives based on a totally ordered relation using Lamport clocks, and only records the differences between this reference logical-clock order and an observed order. Our evaluation shows that CDC significantly reduces the record data size. For example, when we apply CDC to Monte Carlo particle transport Benchmark (MCB), which represents common non-deterministic communication patterns, CDC reduces the record size by approximately two orders of magnitude compared to traditional techniques and incurs between 13.1% and 25.5% of runtime overhead.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127721996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Practical scalable consensus for pseudo-synchronous distributed systems 伪同步分布式系统的实用可扩展共识
T. Hérault, Aurélien Bouteiller, G. Bosilca, Marc Gamell, K. Teranishi, M. Parashar, J. Dongarra
The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process to resume its activity while guaranteeing strong progress. We prove the correctness of our ERA algorithm, and expose its logarithmic behavior, which is an extremely desirable property for any algorithm which targets future exascale platforms. We detail a practical implementation of this consensus algorithm in the context of an MPI library, and evaluate both its efficiency and scalability through a set of benchmarks and two fault tolerant scientific applications.
在分布式环境中一致地处理故障的能力需要在一小组基本例程中使用一种协议算法,允许存活的实体在一组有限的易失性资源之间达成一致的决策。本文提出了一种在伪同步系统中实现早返回协议(ERA)的算法,该算法在保证强进度的同时乐观地允许进程恢复其活动。我们证明了我们的ERA算法的正确性,并揭示了它的对数行为,这对于任何针对未来百亿亿级平台的算法来说都是一个非常理想的特性。我们详细介绍了该共识算法在MPI库中的实际实现,并通过一组基准测试和两个容错科学应用程序评估了其效率和可扩展性。
{"title":"Practical scalable consensus for pseudo-synchronous distributed systems","authors":"T. Hérault, Aurélien Bouteiller, G. Bosilca, Marc Gamell, K. Teranishi, M. Parashar, J. Dongarra","doi":"10.1145/2807591.2807665","DOIUrl":"https://doi.org/10.1145/2807591.2807665","url":null,"abstract":"The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process to resume its activity while guaranteeing strong progress. We prove the correctness of our ERA algorithm, and expose its logarithmic behavior, which is an extremely desirable property for any algorithm which targets future exascale platforms. We detail a practical implementation of this consensus algorithm in the context of an MPI library, and evaluate both its efficiency and scalability through a set of benchmarks and two fault tolerant scientific applications.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130399946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
GraphReduce: processing large-scale graphs on accelerator-based systems GraphReduce:在基于加速器的系统上处理大规模图形
D. Sengupta, S. Song, K. Agarwal, K. Schwan
Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device's internal memory capacity. GraphReduce adopts a combination of edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the host and device. GraphReduce-based programming is performed via device functions that include gatherMap, gatherReduce, apply, and scatter, implemented by programmers for the graph algorithms they wish to realize. Extensive experimental evaluations for a wide variety of graph inputs and algorithms demonstrate that GraphReduce significantly outperforms other competing out-of-memory approaches.
最近关于现实世界图形分析的工作试图利用GPU设备提供的大量并行性,但由于图形算法固有的不规则性和GPU驻留内存对存储大型图形的限制,挑战仍然存在。我们提出GraphReduce,这是一个高效且可扩展的基于gpu的框架,可以处理超出设备内部内存容量的图形。GraphReduce采用了以边缘为中心和以顶点为中心的集合-应用-分散编程模型的组合实现,并在多个异步GPU流上运行,以充分利用GPU的高度并行性,在主机和设备之间高效地移动图形数据。基于graphreduce的编程是通过包括gatherMap、gatherReduce、apply和scatter在内的设备函数来执行的,这些函数由程序员为他们希望实现的图算法实现。对各种图形输入和算法的广泛实验评估表明,GraphReduce明显优于其他竞争的内存不足方法。
{"title":"GraphReduce: processing large-scale graphs on accelerator-based systems","authors":"D. Sengupta, S. Song, K. Agarwal, K. Schwan","doi":"10.1145/2807591.2807655","DOIUrl":"https://doi.org/10.1145/2807591.2807655","url":null,"abstract":"Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the device's internal memory capacity. GraphReduce adopts a combination of edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the host and device. GraphReduce-based programming is performed via device functions that include gatherMap, gatherReduce, apply, and scatter, implemented by programmers for the graph algorithms they wish to realize. Extensive experimental evaluations for a wide variety of graph inputs and algorithms demonstrate that GraphReduce significantly outperforms other competing out-of-memory approaches.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132027567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 89
期刊
SC15: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1