2014 IEEE 28th International Parallel and Distributed Processing Symposium最新文献

英文中文

Parallel Mutual Information Based Construction of Whole-Genome Networks on the Intel (R) Xeon Phi (TM) Coprocessor 基于Intel (R) Xeon Phi (TM)协处理器的并行互信息全基因组网络构建

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.35

Sanchit Misra, K. Pamnany, S. Aluru

Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel (R) Xeon Phi (TM) coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel (R) Xeon (R) processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.

利用大规模基因表达数据构建全基因组网络是系统生物学中的一个重要问题。虽然已经开发了几种技术，但大多数技术无法处理全基因组规模的网络重建，而少数能够处理的技术则需要大型集群。在本文中，我们提出了一种基于Intel (R) Xeon Phi (TM)协处理器的解决方案，利用其多层次并行性，包括许多基于x86的内核，每核多个线程和矢量处理单元。我们还提出了一种基于Intel Xeon处理器的解决方案。我们的解决方案基于TINGe，这是一种快速并行网络重建技术，使用互信息和排列测试来评估统计显著性。我们首次在单芯片上推断植物全基因组调控网络，通过3137个微阵列实验构建了植物拟南芥的15575个基因网络，仅用了22分钟。此外，我们对英特尔Xeon Phi协处理器上并行互信息计算的优化提供了适用于其他领域的经验教训。

{"title":"Parallel Mutual Information Based Construction of Whole-Genome Networks on the Intel (R) Xeon Phi (TM) Coprocessor","authors":"Sanchit Misra, K. Pamnany, S. Aluru","doi":"10.1109/IPDPS.2014.35","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.35","url":null,"abstract":"Construction of whole-genome networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, most cannot handle network reconstruction at the whole-genome scale, and the few that can, require large clusters. In this paper, we present a solution on the Intel (R) Xeon Phi (TM) coprocessor, taking advantage of its multi-level parallelism including many x86-based cores, multiple threads per core, and vector processing units. We also present a solution on the Intel (R) Xeon (R) processor. Our solution is based on TINGe, a fast parallel network reconstruction technique that uses mutual information and permutation testing for assessing statistical significance. We demonstrate the first ever inference of a plant whole genome regulatory network on a single chip by constructing a 15,575 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in only 22 minutes. In addition, our optimization for parallelizing mutual information computation on the Intel Xeon Phi coprocessor holds out lessons that are applicable to other domains.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123469740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

s-Step Krylov Subspace Methods as Bottom Solvers for Geometric Multigrid 几何多重网格的s步Krylov子空间底解方法

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.119

Samuel Williams, M. Lijewski, A. Almgren, B. V. Straalen, E. Carson, Nicholas Knight, J. Demmel

Geometric multigrid solvers within adaptive mesh refinement (AMR) applications often reach a point where further coarsening of the grid becomes impractical as individual sub domain sizes approach unity. At this point the most common solution is to use a bottom solver, such as BiCGStab, to reduce the residual by a fixed factor at the coarsest level. Each iteration of BiCGStab requires multiple global reductions (MPI collectives). As the number of BiCGStab iterations required for convergence grows with problem size, and the time for each collective operation increases with machine scale, bottom solves in large-scale applications can constitute a significant fraction of the overall multigrid solve time. In this paper, we implement, evaluate, and optimize a communication-avoiding s-step formulation of BiCGStab (CABiCGStab for short) as a high-performance, distributed-memory bottom solver for geometric multigrid solvers. This is the first time s-step Krylov subspace methods have been leveraged to improve multigrid bottom solver performance. We use a synthetic benchmark for detailed analysis and integrate the best implementation into BoxLib in order to evaluate the benefit of a s-step Krylov subspace method on the multigrid solves found in the applications LMC and Nyx on up to 32,768 cores on the Cray XE6 at NERSC. Overall, we see bottom solver improvements of up to 4.2x on synthetic problems and up to 2.7x in real applications. This results in as much as a 1.5x improvement in solver performance in real applications.

在自适应网格细化(AMR)应用中，几何多网格求解器往往会遇到这样的情况:随着各个子域的尺寸趋于统一，进一步的网格粗化变得不切实际。在这一点上，最常见的解决方案是使用底部求解器，例如BiCGStab，在最粗糙的级别上通过固定因子减少残差。BiCGStab的每次迭代都需要多个全局缩减(MPI集合)。由于收敛所需的BiCGStab迭代次数随着问题规模的增加而增加，并且每个集合操作的时间随着机器规模的增加而增加，因此大规模应用程序中的底部求解可能占整个多网格求解时间的很大一部分。在本文中，我们实现，评估和优化了BiCGStab(简称CABiCGStab)的通信避免s步公式，作为几何多网格求解器的高性能，分布式内存底部求解器。这是第一次利用s步Krylov子空间方法来提高多网格底部求解器的性能。我们使用综合基准进行详细分析，并将最佳实现集成到BoxLib中，以评估s步Krylov子空间方法在多网格解决方案上的优势，这些解决方案在NERSC的Cray XE6上高达32,768个内核的LMC和Nyx应用程序中找到。总的来说，我们看到底部求解器在综合问题上的改进高达4.2倍，在实际应用中高达2.7倍。这使得求解器在实际应用中的性能提高了1.5倍。

{"title":"s-Step Krylov Subspace Methods as Bottom Solvers for Geometric Multigrid","authors":"Samuel Williams, M. Lijewski, A. Almgren, B. V. Straalen, E. Carson, Nicholas Knight, J. Demmel","doi":"10.1109/IPDPS.2014.119","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.119","url":null,"abstract":"Geometric multigrid solvers within adaptive mesh refinement (AMR) applications often reach a point where further coarsening of the grid becomes impractical as individual sub domain sizes approach unity. At this point the most common solution is to use a bottom solver, such as BiCGStab, to reduce the residual by a fixed factor at the coarsest level. Each iteration of BiCGStab requires multiple global reductions (MPI collectives). As the number of BiCGStab iterations required for convergence grows with problem size, and the time for each collective operation increases with machine scale, bottom solves in large-scale applications can constitute a significant fraction of the overall multigrid solve time. In this paper, we implement, evaluate, and optimize a communication-avoiding s-step formulation of BiCGStab (CABiCGStab for short) as a high-performance, distributed-memory bottom solver for geometric multigrid solvers. This is the first time s-step Krylov subspace methods have been leveraged to improve multigrid bottom solver performance. We use a synthetic benchmark for detailed analysis and integrate the best implementation into BoxLib in order to evaluate the benefit of a s-step Krylov subspace method on the multigrid solves found in the applications LMC and Nyx on up to 32,768 cores on the Cray XE6 at NERSC. Overall, we see bottom solver improvements of up to 4.2x on synthetic problems and up to 2.7x in real applications. This results in as much as a 1.5x improvement in solver performance in real applications.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114953312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

New Effective Multithreaded Matching Algorithms 新的高效多线程匹配算法

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.61

F. Manne, M. Halappanavar

Matching is an important combinatorial problem with a number of applications in areas such as community detection, sparse linear algebra, and network alignment. Since computing optimal matchings can be very time consuming, several fast approximation algorithms, both sequential and parallel, have been suggested. Common to the algorithms giving the best solutions is that they tend to be sequential by nature, while algorithms more suitable for parallel computation give solutions of lower quality. We present a new simple 1/2-approximation algorithm for the weighted matching problem. This algorithm is both faster than any other suggested sequential 1/2-approximation algorithm on almost all inputs and when parallelized also scales better than previous multithreaded algorithms. We further extend this to a general scalable multithreaded algorithm that computes matchings of weight comparable with the best sequential deterministic algorithms. The performance of the suggested algorithms is documented through extensive experiments on different multithreaded architectures.

匹配是一个重要的组合问题，在社区检测、稀疏线性代数和网络对齐等领域有着广泛的应用。由于计算最优匹配可能非常耗时，因此提出了几种快速近似算法，包括顺序和并行算法。给出最佳解的算法的共同点是它们本质上是顺序的，而更适合并行计算的算法给出的解质量较低。针对加权匹配问题，提出了一种新的简单的1/2逼近算法。该算法在几乎所有输入上都比任何其他建议的顺序1/2近似算法快，并且在并行化时也比以前的多线程算法具有更好的可伸缩性。我们进一步将其扩展到一个通用的可扩展多线程算法，该算法计算与最佳顺序确定性算法相当的权重匹配。通过在不同的多线程体系结构上进行大量实验，证明了所建议算法的性能。

引用次数: 45

LFTI: A New Performance Metric for Assessing Interconnect Designs for Extreme-Scale HPC Systems LFTI:一种评估超大规模HPC系统互连设计的新性能指标

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.38

Xin Yuan, S. Mahapatra, M. Lang, S. Pakin

Traditionally, interconnect performance is either characterized by simple topological parameters such as bisection bandwidth or studied through simulation that gives detailed performance information for the scenarios simulated. Neither of these approaches provides a good performance overview for extreme-scale interconnects. The topological parameters are not directly related to application level communication performance while the simulation complexity limits the number of scenarios that can be investigated. In this work, we propose a new performance metric, called LANL-FSU Throughput Indices (LFTI), for characterizing the throughput performance of interconnect designs. LFTI combines the simplicity of topological parameters and the accuracy of simulation: like topological parameters, LFTI can be derived from interconnect specification, at the same time, it directly reflects the application level communication performance. Moreover, in cases when the theoretical throughput for each communication pattern can be modeled efficiently for an interconnect, LFTI for the interconnect can be computed efficiently. These features potentially allow LFTI to be used for rapid and comprehensive evaluation and comparison of extreme-scale interconnect designs. We demonstrate the effectiveness of LFTI by using it to evaluate and explore the design space of a number of large-scale interconnect designs.

传统上，互连性能要么以简单的拓扑参数(如对分带宽)为特征，要么通过模拟研究，为模拟场景提供详细的性能信息。这两种方法都不能很好地概述极端规模互连的性能。拓扑参数与应用级通信性能没有直接关系，而仿真复杂性限制了可以研究的场景的数量。在这项工作中，我们提出了一个新的性能指标，称为LANL-FSU吞吐量指数(LFTI)，用于表征互连设计的吞吐量性能。LFTI结合了拓扑参数的简单性和仿真的准确性:与拓扑参数一样，LFTI可以从互连规范中导出，同时，它直接反映应用层的通信性能。此外，当每个通信模式的理论吞吐量可以有效地建模时，可以有效地计算互连的LFTI。这些特性可能使LFTI用于快速和全面的评估和比较极端规模互连设计。我们通过使用LFTI来评估和探索许多大规模互连设计的设计空间来证明LFTI的有效性。

{"title":"LFTI: A New Performance Metric for Assessing Interconnect Designs for Extreme-Scale HPC Systems","authors":"Xin Yuan, S. Mahapatra, M. Lang, S. Pakin","doi":"10.1109/IPDPS.2014.38","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.38","url":null,"abstract":"Traditionally, interconnect performance is either characterized by simple topological parameters such as bisection bandwidth or studied through simulation that gives detailed performance information for the scenarios simulated. Neither of these approaches provides a good performance overview for extreme-scale interconnects. The topological parameters are not directly related to application level communication performance while the simulation complexity limits the number of scenarios that can be investigated. In this work, we propose a new performance metric, called LANL-FSU Throughput Indices (LFTI), for characterizing the throughput performance of interconnect designs. LFTI combines the simplicity of topological parameters and the accuracy of simulation: like topological parameters, LFTI can be derived from interconnect specification, at the same time, it directly reflects the application level communication performance. Moreover, in cases when the theoretical throughput for each communication pattern can be modeled efficiently for an interconnect, LFTI for the interconnect can be computed efficiently. These features potentially allow LFTI to be used for rapid and comprehensive evaluation and comparison of extreme-scale interconnect designs. We demonstrate the effectiveness of LFTI by using it to evaluate and explore the design space of a number of large-scale interconnect designs.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125884726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Computational Co-design of a Multiscale Plasma Application: A Process and Initial Results 多尺度等离子体应用的计算协同设计:过程和初步结果

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.114

J. Payne, D. Knoll, A. McPherson, W. Taitano, L. Chacón, Guangye Chen, S. Pakin

As computer architectures become increasingly heterogeneous the need for algorithms and applications that can exploit these new architectures grows more pressing. This paper demonstrates that co-designing a multi-architecture, multi-scale, highly optimized framework with its associated plasma-physics application can provide both portability across CPUs and accelerators and high performance. Our framework utilizes multiple abstraction layers in order to maximize code reuse between architectures while providing low-level abstractions to incorporate architecture-specific optimizations such as vectorization or hardware fused multiply-add. We describe a co-design process used to enable a plasma physics application to scale well to large systems while also improving on both the accuracy and speed of the simulations. Optimized multi-core results will be presented to demonstrate ability to isolate large amounts of computational work with minimal communication.

随着计算机体系结构变得越来越异构，对能够利用这些新体系结构的算法和应用程序的需求变得更加迫切。本文论证了协同设计一个多架构、多尺度、高度优化的框架及其相关的等离子体物理应用程序可以提供跨cpu和加速器的可移植性和高性能。我们的框架利用多个抽象层来最大限度地提高体系结构之间的代码重用，同时提供低级抽象来结合特定于体系结构的优化，如向量化或硬件融合乘加。我们描述了一个协同设计过程，用于使等离子体物理应用程序能够很好地扩展到大型系统，同时也提高了模拟的准确性和速度。优化的多核结果将展示以最小的通信隔离大量计算工作的能力。

引用次数: 1

RCMP: Enabling Efficient Recomputation Based Failure Resilience for Big Data Analytics RCMP:为大数据分析实现基于故障恢复的高效重计算

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.102

Florin Dinu, T. Ng

Data replication, the main failure resilience strategy used for big data analytics jobs, can be unnecessarily inefficient. It can cause serious performance degradation when applied to intermediate job outputs in multi-job computations. For instance, for I/O-intensive big data jobs, data replication is especially expensive because very large datasets need to be replicated. Reducing the number of replicas is not a satisfactory solution as it only aggravates a fundamental limitation of data replication: its failure resilience guarantees are limited by the number of available replicas. When all replicas of some piece of intermediate job output are lost, cascading job recomputations may be required for recovery. In this paper we show how job recomputation can be made a first-order failure resilience strategy for big data analytics. The need for data replication can thus be significantly reduced. We present RCMP, a system that performs efficient job recomputation. RCMP can persist task outputs across jobs and leverage them to minimize the work performed during job recomputations. More importantly, RCMP addresses two important challenges that appear during job recomputations. The first is efficiently utilizing the available compute node parallelism. The second is dealing with hot-spots. RCMP handles both by switching to a finer-grained task scheduling granularity for recomputations. Our experiments show that RCMP's benefits hold across two different clusters, for job inputs as small as 40GB or as large as 1.2TB. Compared to RCMP, data replication is 30%-100% worse during failure-free periods. More importantly, by efficiently performing recomputations, RCMP is comparable or better even under single and double data loss events.

数据复制是用于大数据分析工作的主要故障恢复策略，可能会产生不必要的低效率。当应用于多作业计算中的中间作业输出时，它可能导致严重的性能下降。例如，对于I/ o密集型大数据作业，数据复制的成本特别高，因为需要复制非常大的数据集。减少副本的数量并不是一个令人满意的解决方案，因为它只会加剧数据复制的一个基本限制:它的故障弹性保证受到可用副本数量的限制。当某些中间作业输出的所有副本丢失时，可能需要级联作业重新计算以进行恢复。在本文中，我们展示了如何将作业重计算作为大数据分析的一阶故障恢复策略。因此，可以大大减少对数据复制的需求。提出了一种高效的作业重计算系统RCMP。RCMP可以跨作业持久化任务输出，并利用它们最小化作业重新计算期间执行的工作。更重要的是，RCMP解决了在工作重新计算过程中出现的两个重要挑战。首先是有效地利用可用的计算节点并行性。二是处理热点问题。RCMP通过切换到更细粒度的任务调度粒度进行重新计算来处理这两种情况。我们的实验表明，RCMP的优势适用于两个不同的集群，作业输入小至40GB，大至1.2TB。与RCMP相比，无故障期间的数据复制性能要差30%-100%。更重要的是，通过有效地执行重新计算，RCMP即使在单次和双次数据丢失事件下也可以媲美或更好。

{"title":"RCMP: Enabling Efficient Recomputation Based Failure Resilience for Big Data Analytics","authors":"Florin Dinu, T. Ng","doi":"10.1109/IPDPS.2014.102","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.102","url":null,"abstract":"Data replication, the main failure resilience strategy used for big data analytics jobs, can be unnecessarily inefficient. It can cause serious performance degradation when applied to intermediate job outputs in multi-job computations. For instance, for I/O-intensive big data jobs, data replication is especially expensive because very large datasets need to be replicated. Reducing the number of replicas is not a satisfactory solution as it only aggravates a fundamental limitation of data replication: its failure resilience guarantees are limited by the number of available replicas. When all replicas of some piece of intermediate job output are lost, cascading job recomputations may be required for recovery. In this paper we show how job recomputation can be made a first-order failure resilience strategy for big data analytics. The need for data replication can thus be significantly reduced. We present RCMP, a system that performs efficient job recomputation. RCMP can persist task outputs across jobs and leverage them to minimize the work performed during job recomputations. More importantly, RCMP addresses two important challenges that appear during job recomputations. The first is efficiently utilizing the available compute node parallelism. The second is dealing with hot-spots. RCMP handles both by switching to a finer-grained task scheduling granularity for recomputations. Our experiments show that RCMP's benefits hold across two different clusters, for job inputs as small as 40GB or as large as 1.2TB. Compared to RCMP, data replication is 30%-100% worse during failure-free periods. More importantly, by efficiently performing recomputations, RCMP is comparable or better even under single and double data loss events.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129463872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

ReDHiP: Recalibrating Deep Hierarchy Prediction for Energy Efficiency ReDHiP:重新校准能源效率的深度层次预测

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.98

Xun Li, D. Franklin, R. Bianchini, F. Chong

Recent hardware trends point to increasingly deeper cache hierarchies. In such hierarchies, accesses that lookup and miss in every cache involve significant energy consumption and degraded performance. To mitigate these problems, in this paper we propose Recalibrating Deep Hierarchy Prediction (ReDHiP), an architectural mechanism that predicts last-level cache (LLC) misses in advance. An LLC miss means that all cache levels need not be accessed at all. Our design for ReDHiP focuses on a simple, compact prediction table that can be efficiently recalibrated over time. We find that a simpler scheme, while sacrificing accuracy, can be more accurate per bit than more complex schemes through recalibration. Our evaluation shows that ReDHiP achieves an average of 22% cache energy savings and 8% performance improvement for a wide range of benchmarks. ReDHiP achieves these benefits at a hardware cost of less than 1% of the LLC. We also demonstrate how ReDHiP can be used to reduce the energy overhead of hardware data prefetching while being able to further improve the performance.

最近的硬件趋势指向越来越深的缓存层次结构。在这样的层次结构中，在每个缓存中查找和丢失的访问涉及大量的能量消耗和性能下降。为了缓解这些问题，在本文中，我们提出了重新校准深度层次预测(ReDHiP)，这是一种提前预测最后一级缓存(LLC)缺失的体系结构机制。LLC miss意味着根本不需要访问所有缓存级别。我们为ReDHiP设计的重点是一个简单，紧凑的预测表，可以随着时间的推移有效地重新校准。我们发现一个简单的方案，在牺牲精度的情况下，通过重新校准可以比更复杂的方案更精确。我们的评估表明，在广泛的基准测试中，ReDHiP实现了平均22%的缓存节能和8%的性能提升。ReDHiP以不到1%的硬件成本实现了这些优势。我们还演示了如何使用ReDHiP来减少硬件数据预取的能量开销，同时能够进一步提高性能。

引用次数: 1

How Well Do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis 图形处理平台的性能如何?实证绩效评价与分析

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.49

Yong Guo, M. Biczak, A. Varbanescu, A. Iosup, Claudio Martella, Theodore L. Willke

Graph-processing platforms are increasingly used in a variety of domains. Although both industry and academia are developing and tuning graph-processing algorithms and platforms, the performance of graph-processing platforms has never been explored or compared in-depth. Thus, users face the daunting challenge of selecting an appropriate platform for their specific application. To alleviate this challenge, we propose an empirical method for benchmarking graph-processing platforms. We define a comprehensive process, and a selection of representative metrics, datasets, and algorithmic classes. We implement a benchmarking suite of five classes of algorithms and seven diverse graphs. Our suite reports on basic (user-lever) performance, resource utilization, scalability, and various overhead. We use our benchmarking suite to analyze and compare six platforms. We gain valuable insights for each platform and present the first comprehensive comparison of graph-processing platforms.

图形处理平台越来越多地应用于各种领域。尽管业界和学术界都在开发和优化图形处理算法和平台，但图形处理平台的性能从未被深入探索或比较。因此，用户面临着为其特定应用程序选择合适平台的艰巨挑战。为了缓解这一挑战，我们提出了一种对图形处理平台进行基准测试的经验方法。我们定义了一个全面的过程，并选择了代表性的指标，数据集和算法类。我们实现了一个由五类算法和七个不同图组成的基准测试套件。我们的套件报告基本(用户级)性能、资源利用率、可伸缩性和各种开销。我们使用我们的基准测试套件来分析和比较六个平台。我们对每个平台都获得了有价值的见解，并首次对图形处理平台进行了全面比较。

引用次数: 118

Remote Invalidation: Optimizing the Critical Path of Memory Transactions 远程失效:优化内存事务的关键路径

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.30

Ahmed Hassan, R. Palmieri, B. Ravindran

Software Transactional Memory (STM) systems are increasingly emerging as a promising alternative to traditional locking algorithms for implementing generic concurrent applications. To achieve generality, STM systems incur overheads to the normal sequential execution path, including those due to spin locking, validation (or invalidation), and commit/abort routines. We propose a new STM algorithm called Remote Invalidation (or RInval) that reduces these overheads and improves STM performance. RInval's main idea is to execute commit and invalidation routines on remote server threads that run on dedicated cores, and use cache-aligned communication between application's transactional threads and the server routines. By remote execution of commit and invalidation routines and cache-aligned communication, RInval reduces the overhead of spin locking and cache misses on shared locks. By running commit and invalidation on separate cores, they become independent of each other, increasing commit concurrency. We implemented RInval in the Rochester STM framework. Our experimental studies on micro-benchmarks and the STAMP benchmark reveal that RInval outperforms InvalSTM, the corresponding non-remote invalidation algorithm, by as much as an order of magnitude. Additionally, RInval obtains competitive performance to validation-based STM algorithms such as NOrec, yielding up to 2x performance improvement.

软件事务性内存(STM)系统正日益成为实现通用并发应用程序的传统锁定算法的一个有前途的替代方案。为了实现通用性，STM系统会导致常规顺序执行路径的开销，包括由于自旋锁定、验证(或无效)和提交/中止例程造成的开销。我们提出了一种新的STM算法，称为远程无效(RInval)，它减少了这些开销并提高了STM的性能。RInval的主要思想是在专用内核上运行的远程服务器线程上执行提交和失效例程，并在应用程序的事务线程和服务器例程之间使用缓存对齐的通信。通过远程执行提交和失效例程以及与缓存对齐的通信，RInval减少了共享锁上自旋锁定和缓存丢失的开销。通过在单独的核上运行提交和无效，它们变得相互独立，从而增加了提交并发性。我们在Rochester STM框架中实现了RInval。我们在微基准测试和STAMP基准测试上的实验研究表明，RInval比InvalSTM(相应的非远程失效算法)的性能高出一个数量级。此外，RInval获得了与基于验证的STM算法(如NOrec)相比具有竞争力的性能，性能提高了2倍。

{"title":"Remote Invalidation: Optimizing the Critical Path of Memory Transactions","authors":"Ahmed Hassan, R. Palmieri, B. Ravindran","doi":"10.1109/IPDPS.2014.30","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.30","url":null,"abstract":"Software Transactional Memory (STM) systems are increasingly emerging as a promising alternative to traditional locking algorithms for implementing generic concurrent applications. To achieve generality, STM systems incur overheads to the normal sequential execution path, including those due to spin locking, validation (or invalidation), and commit/abort routines. We propose a new STM algorithm called Remote Invalidation (or RInval) that reduces these overheads and improves STM performance. RInval's main idea is to execute commit and invalidation routines on remote server threads that run on dedicated cores, and use cache-aligned communication between application's transactional threads and the server routines. By remote execution of commit and invalidation routines and cache-aligned communication, RInval reduces the overhead of spin locking and cache misses on shared locks. By running commit and invalidation on separate cores, they become independent of each other, increasing commit concurrency. We implemented RInval in the Rochester STM framework. Our experimental studies on micro-benchmarks and the STAMP benchmark reveal that RInval outperforms InvalSTM, the corresponding non-remote invalidation algorithm, by as much as an order of magnitude. Additionally, RInval obtains competitive performance to validation-based STM algorithms such as NOrec, yielding up to 2x performance improvement.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129002098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Analytically Modeling Application Execution for Software-Hardware Co-design 面向软硬件协同设计的应用执行分析建模

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.56

Jichi Guo, Jiayuan Meng, Qing Yi, V. Morozov, Kalyan Kumaran

Software-hardware co-design has become increasingly important as the scale and complexity of both are reaching an unprecedented level. To predict and understand application behavior on emerging or conceptual systems, existing research has mostly relied on cycle-accurate micro-architecture simulators, which are known to be time-consuming and are oblivious to workloads' control flow structure. As a result, simulations are often limited to small kernels, and the first step in the co-design process is often to extract important kernels, construct mini-applications, and identify potential hardware limitations. This requires a high level understanding about the full applications' potential behavior on a future system, e.g. the most time-consuming regions, the performance bottlenecks for these regions, etc. Unfortunately, such application knowledge gained from one system may not hold true on a future system. One solution is to instrument the full application with timers and simulate it with a reasonable input size, which can be a daunting task in itself. We propose an alternative approach to gain first-order insights into hardware-dependent application behavior by trading off the accuracy of analysis for improved efficiency. By modeling the execution flows of user applications and analyzing it using target hardware's performance models, our technique requires no cycle-accurate simulation on a prospective system. In fact, our technique's analysis time does not increase with the input data size.

随着软硬件协同设计的规模和复杂性达到前所未有的水平，软硬件协同设计变得越来越重要。为了预测和理解新兴系统或概念系统上的应用程序行为，现有的研究大多依赖于周期精确的微架构模拟器，这是众所周知的耗时且忽略工作负载的控制流结构。因此，模拟通常仅限于小内核，而协同设计过程的第一步通常是提取重要的内核、构建小型应用程序和识别潜在的硬件限制。这需要对应用程序在未来系统上的潜在行为有一个高层次的理解，例如，最耗时的区域，这些区域的性能瓶颈，等等。不幸的是，从一个系统中获得的应用程序知识可能不适用于未来的系统。一种解决方案是用计时器检测整个应用程序，并用合理的输入大小模拟它，这本身可能是一项艰巨的任务。我们提出了一种替代方法，通过牺牲分析的准确性来提高效率，从而获得对依赖硬件的应用程序行为的一阶洞察。通过对用户应用程序的执行流进行建模，并使用目标硬件的性能模型对其进行分析，我们的技术不需要对预期系统进行周期精确的仿真。实际上，我们的技术的分析时间并不随着输入数据的大小而增加。

{"title":"Analytically Modeling Application Execution for Software-Hardware Co-design","authors":"Jichi Guo, Jiayuan Meng, Qing Yi, V. Morozov, Kalyan Kumaran","doi":"10.1109/IPDPS.2014.56","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.56","url":null,"abstract":"Software-hardware co-design has become increasingly important as the scale and complexity of both are reaching an unprecedented level. To predict and understand application behavior on emerging or conceptual systems, existing research has mostly relied on cycle-accurate micro-architecture simulators, which are known to be time-consuming and are oblivious to workloads' control flow structure. As a result, simulations are often limited to small kernels, and the first step in the co-design process is often to extract important kernels, construct mini-applications, and identify potential hardware limitations. This requires a high level understanding about the full applications' potential behavior on a future system, e.g. the most time-consuming regions, the performance bottlenecks for these regions, etc. Unfortunately, such application knowledge gained from one system may not hold true on a future system. One solution is to instrument the full application with timers and simulate it with a reasonable input size, which can be a daunting task in itself. We propose an alternative approach to gain first-order insights into hardware-dependent application behavior by trading off the accuracy of analysis for improved efficiency. By modeling the execution flows of user applications and analyzing it using target hardware's performance models, our technique requires no cycle-accurate simulation on a prospective system. In fact, our technique's analysis time does not increase with the input data size.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115155677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀