首页 > 最新文献

2014 IEEE 28th International Parallel and Distributed Processing Symposium最新文献

英文 中文
Petascale Application of a Coupled CPU-GPU Algorithm for Simulation and Analysis of Multiphase Flow Solutions in Porous Medium Systems CPU-GPU耦合算法在多孔介质系统多相流解模拟与分析中的千兆级应用
Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.67
J. McClure, Hao Wang, J. Prins, Cass T. Miller, Wu-chun Feng
Large-scale simulation can provide a wide range of information needed to develop and validate theoretical models for multiphase flow in porous medium systems. In this paper, we consider a coupled solution in which a multiphase flow simulator is coupled to an analysis approach used to extract the interfacial geometries as the flow evolves. This has been implemented using MPI to target heterogeneous nodes equipped with GPUs. The GPUs evolve the multiphase flow solution using the lattice Boltzmann method while the CPUs compute up scaled measures of the morphology and topology of the phase distributions and their rate of evolution. Our approach is demonstrated to scale to 4,096 GPUs and 65,536 CPU cores to achieve a maximum performance of 244,754 million-lattice-node updates per second (MLUPS) in double precision execution on Titan. In turn, this approach increases the size of systems that can be considered by an order of magnitude compared with previous work and enables detailed in situ tracking of averaged flow quantities at temporal resolutions that were previously impossible. Furthermore, it virtually eliminates the need for post-processing and intensive I/O and mitigates the potential loss of data associated with node failures.
大规模模拟可以为建立和验证多孔介质体系多相流理论模型提供广泛的信息。在本文中,我们考虑了一种耦合解决方案,其中多相流模拟器与用于提取流动演变界面几何形状的分析方法相耦合。这已经使用MPI来实现,目标是配备gpu的异构节点。gpu使用晶格玻尔兹曼方法对多相流解进行演化,而cpu则计算相分布的形态和拓扑及其演化速率的放大度量。我们的方法被证明可以扩展到4,096个gpu和65,536个CPU内核,在Titan上实现每秒244,754百万格节点更新(MLUPS)的双精度执行的最大性能。与以前的工作相比,这种方法增加了系统的大小,可以考虑一个数量级,并且可以在以前不可能的时间分辨率下对平均流量进行详细的原位跟踪。此外,它实际上消除了后处理和密集I/O的需要,并减轻了与节点故障相关的潜在数据丢失。
{"title":"Petascale Application of a Coupled CPU-GPU Algorithm for Simulation and Analysis of Multiphase Flow Solutions in Porous Medium Systems","authors":"J. McClure, Hao Wang, J. Prins, Cass T. Miller, Wu-chun Feng","doi":"10.1109/IPDPS.2014.67","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.67","url":null,"abstract":"Large-scale simulation can provide a wide range of information needed to develop and validate theoretical models for multiphase flow in porous medium systems. In this paper, we consider a coupled solution in which a multiphase flow simulator is coupled to an analysis approach used to extract the interfacial geometries as the flow evolves. This has been implemented using MPI to target heterogeneous nodes equipped with GPUs. The GPUs evolve the multiphase flow solution using the lattice Boltzmann method while the CPUs compute up scaled measures of the morphology and topology of the phase distributions and their rate of evolution. Our approach is demonstrated to scale to 4,096 GPUs and 65,536 CPU cores to achieve a maximum performance of 244,754 million-lattice-node updates per second (MLUPS) in double precision execution on Titan. In turn, this approach increases the size of systems that can be considered by an order of magnitude compared with previous work and enables detailed in situ tracking of averaged flow quantities at temporal resolutions that were previously impossible. Furthermore, it virtually eliminates the need for post-processing and intensive I/O and mitigates the potential loss of data associated with node failures.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128734712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Using Multiple Threads to Accelerate Single Thread Performance 使用多线程加速单线程性能
Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.104
Zehra Sura, K. O'Brien, J. Brunheroto
Computing systems are being designed with an increasing number of hardware cores. To effectively use these cores, applications need to maximize the amount of parallel processing and minimize the time spent in sequential execution. In this work, we aim to exploit fine-grained parallelism beyond the parallelism already encoded in an application. We define an execution model using a primary core and some number of secondary cores that collaborate to speed up the execution of sequential code regions. This execution model relies on cores that are physically close to each other and have fast communication paths between them. For this purpose, we introduce dedicated hardware queues for low-latency transfer of values between cores, and define special "enque" and "deque" instructions to use the queues. Further, we develop compiler analyses and transformations to automatically derive fine-grained parallel code from sequential code regions. We implemented this model for exploiting fine-grained parallelization in the IBM XL compiler framework and in a simulator for the Blue Gene/Q system. We also studied the Sequoia benchmarks to determine code sections where our techniques are applicable. We evaluated our work using these code sections, and observed an average speedup of 1.32 on 2 cores, and an average speedup of 2.05 on 4 cores. Since these code sections are otherwise sequentially executed, we conclude that our approach is useful for accelerating single thread performance.
计算机系统正在设计越来越多的硬件核心。为了有效地使用这些核心,应用程序需要最大化并行处理的数量,并最小化顺序执行所花费的时间。在这项工作中,我们的目标是在应用程序中已经编码的并行性之外利用细粒度的并行性。我们使用一个主核和一些辅助核来定义一个执行模型,这些辅助核协作来加速顺序代码区域的执行。这种执行模型依赖于物理上彼此靠近并在它们之间具有快速通信路径的核心。为此,我们引入了专用的硬件队列,用于在内核之间进行低延迟的值传输,并定义了特殊的“enque”和“deque”指令来使用队列。此外,我们开发了编译器分析和转换,以从顺序代码区域自动派生细粒度并行代码。我们实现这个模型是为了在IBM XL编译器框架和Blue Gene/Q系统的模拟器中利用细粒度并行化。我们还研究了Sequoia基准测试,以确定我们的技术适用的代码部分。我们使用这些代码段评估了我们的工作,并观察到2核上的平均加速速度为1.32,4核上的平均加速速度为2.05。由于这些代码段是依次执行的,因此我们得出结论,我们的方法对于加速单线程性能很有用。
{"title":"Using Multiple Threads to Accelerate Single Thread Performance","authors":"Zehra Sura, K. O'Brien, J. Brunheroto","doi":"10.1109/IPDPS.2014.104","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.104","url":null,"abstract":"Computing systems are being designed with an increasing number of hardware cores. To effectively use these cores, applications need to maximize the amount of parallel processing and minimize the time spent in sequential execution. In this work, we aim to exploit fine-grained parallelism beyond the parallelism already encoded in an application. We define an execution model using a primary core and some number of secondary cores that collaborate to speed up the execution of sequential code regions. This execution model relies on cores that are physically close to each other and have fast communication paths between them. For this purpose, we introduce dedicated hardware queues for low-latency transfer of values between cores, and define special \"enque\" and \"deque\" instructions to use the queues. Further, we develop compiler analyses and transformations to automatically derive fine-grained parallel code from sequential code regions. We implemented this model for exploiting fine-grained parallelization in the IBM XL compiler framework and in a simulator for the Blue Gene/Q system. We also studied the Sequoia benchmarks to determine code sections where our techniques are applicable. We evaluated our work using these code sections, and observed an average speedup of 1.32 on 2 cores, and an average speedup of 2.05 on 4 cores. Since these code sections are otherwise sequentially executed, we conclude that our approach is useful for accelerating single thread performance.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116530275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
EDM: An Endurance-Aware Data Migration Scheme for Load Balancing in SSD Storage Clusters EDM:基于SSD存储集群负载均衡的持久性数据迁移方案
Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.86
Jiaxin Ou, J. Shu, Youyou Lu, Letian Yi, Wei Wang
Data migration schemes are critical to balance the load in storage clusters for performance improvement. However, as NAND flash based SSDs are widely deployed in storage systems, extending the lifespan of SSD storage clusters becomes a new challenge for data migration. Prior approaches designed for HDD storage clusters, however, are inefficient due to excessive write amplification during data migration, which significantly decrease the lifespan of SSD storage clusters. To overcome this problem, we propose EDM, an endurance aware data migration scheme with careful data placement and movement to minimize the data migrated, so as to limit the worn-out of SSDs while improving the performance. Based on the observation that performance degradation is dominated by the wear speed of an SSD, which is affected by both the storage utilization and the write intensity, two complementary data migration policies are designed to explore the trade-offs among throughput, response time during migration, and lifetime of SSD storage clusters. Moreover, we design an SSD wear model and quantitatively calculate the amount of data migrated as well as the sources and destinations of the migration, so as to reduce the write amplification caused by migration. Results on a real storage cluster using real-world traces show that EDM performs favorably versus existing HDD based migration techniques, reducing cluster-wide aggregate erase count by up to 40%. In the meantime, it improves the performance by 25% on average compared to the baseline system which achieves almost the same effectiveness of performance improvement as previous migration techniques.
数据迁移方案是实现存储集群负载均衡和性能提升的关键。然而,随着基于NAND闪存的SSD硬盘在存储系统中的广泛应用,延长SSD存储集群的寿命成为数据迁移的新挑战。然而,先前为HDD存储集群设计的方法由于数据迁移过程中过度的写放大而效率低下,这大大降低了SSD存储集群的寿命。为了克服这个问题,我们提出了EDM,这是一种耐力感知的数据迁移方案,通过仔细的数据放置和移动来最小化迁移的数据,从而在提高性能的同时限制ssd的磨损。由于SSD的磨损速度主要影响性能下降,而磨损速度又受存储利用率和写强度的影响,因此设计了两种互补的数据迁移策略,以探索SSD存储集群的吞吐量、迁移响应时间和生命周期之间的权衡。此外,我们设计了SSD磨损模型,定量计算迁移的数据量以及迁移的来源和目的地,以减少迁移带来的写放大。在真实存储集群上使用真实跟踪的结果表明,EDM比现有的基于HDD的迁移技术性能更好,可将集群范围内的总擦除计数减少多达40%。与此同时,与基线系统相比,它平均提高了25%的性能,基线系统实现了与以前的迁移技术几乎相同的性能改进效果。
{"title":"EDM: An Endurance-Aware Data Migration Scheme for Load Balancing in SSD Storage Clusters","authors":"Jiaxin Ou, J. Shu, Youyou Lu, Letian Yi, Wei Wang","doi":"10.1109/IPDPS.2014.86","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.86","url":null,"abstract":"Data migration schemes are critical to balance the load in storage clusters for performance improvement. However, as NAND flash based SSDs are widely deployed in storage systems, extending the lifespan of SSD storage clusters becomes a new challenge for data migration. Prior approaches designed for HDD storage clusters, however, are inefficient due to excessive write amplification during data migration, which significantly decrease the lifespan of SSD storage clusters. To overcome this problem, we propose EDM, an endurance aware data migration scheme with careful data placement and movement to minimize the data migrated, so as to limit the worn-out of SSDs while improving the performance. Based on the observation that performance degradation is dominated by the wear speed of an SSD, which is affected by both the storage utilization and the write intensity, two complementary data migration policies are designed to explore the trade-offs among throughput, response time during migration, and lifetime of SSD storage clusters. Moreover, we design an SSD wear model and quantitatively calculate the amount of data migrated as well as the sources and destinations of the migration, so as to reduce the write amplification caused by migration. Results on a real storage cluster using real-world traces show that EDM performs favorably versus existing HDD based migration techniques, reducing cluster-wide aggregate erase count by up to 40%. In the meantime, it improves the performance by 25% on average compared to the baseline system which achieves almost the same effectiveness of performance improvement as previous migration techniques.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126999138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications 面向大规模高性能计算应用的多级检查点模型优化
Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.122
S. Di, M. Bouguerra, L. Bautista-Gomez, F. Cappello
HPC community projects that future extreme scale systems will be much less stable than current Petascale systems, thus requiring sophisticated fault tolerance to guarantee the completion of large scale numerical computations. Execution failures may occur due to multiple factors with different scales, from transient uncorrectable memory errors localized in processes to massive system outages. Multi-level checkpoint/restart is a promising model that provides an elastic response to tolerate different types of failures. It stores checkpoints at different levels: e.g., local memory, remote memory, using a software RAID, local SSD, remote file system. In this paper, we respond to two open questions: 1) how to optimize the selection of checkpoint levels based on failure distributions observed in a system, 2) how to compute the optimal checkpoint intervals for each of these levels. The contribution is three-fold. (1) We build a mathematical model to fit the multi-level checkpoint/restart mechanism with large scale applications regarding various types of failures. (2) We theoretically optimize the entire execution performance for each parallel application by selecting the best checkpoint level combination and corresponding checkpoint intervals. (3) We characterize checkpoint overheads on different checkpoint levels in a real cluster environment, and evaluate our optimal solutions using both simulation with millions of cores and real environment with real-world MPI programs running on hundreds of cores. Experiments show that optimized selections of levels associated with optimal checkpoint intervals at each level outperforms other state-of-the-art solutions by 5-50 percent.
高性能计算社区预测,未来的极端规模系统将远不如当前的千万亿级系统稳定,因此需要复杂的容错能力来保证大规模数值计算的完成。执行失败可能是由于多种不同规模的因素造成的,从进程局部的暂时的不可纠正的内存错误到大规模的系统中断。多级检查点/重启是一种很有前途的模型,它提供弹性响应以容忍不同类型的故障。它将检查点存储在不同的级别:例如,本地内存、远程内存、使用软件RAID、本地SSD、远程文件系统。在本文中,我们回答了两个开放性问题:1)如何根据系统中观察到的故障分布优化检查点级别的选择,2)如何计算每个这些级别的最佳检查点间隔。贡献有三方面。(1)针对不同类型的故障,建立了适合大规模应用的多层次检查点/重启机制的数学模型。(2)通过选择最佳的检查点级别组合和相应的检查点间隔,从理论上优化每个并行应用程序的整体执行性能。(3)我们在真实集群环境中描述了不同检查点级别上的检查点开销,并使用具有数百万内核的模拟和具有数百个内核上运行的真实MPI程序的真实环境来评估我们的最佳解决方案。实验表明,与每个关卡的最佳检查点间隔相关的优化关卡选择比其他最先进的解决方案要好5- 50%。
{"title":"Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications","authors":"S. Di, M. Bouguerra, L. Bautista-Gomez, F. Cappello","doi":"10.1109/IPDPS.2014.122","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.122","url":null,"abstract":"HPC community projects that future extreme scale systems will be much less stable than current Petascale systems, thus requiring sophisticated fault tolerance to guarantee the completion of large scale numerical computations. Execution failures may occur due to multiple factors with different scales, from transient uncorrectable memory errors localized in processes to massive system outages. Multi-level checkpoint/restart is a promising model that provides an elastic response to tolerate different types of failures. It stores checkpoints at different levels: e.g., local memory, remote memory, using a software RAID, local SSD, remote file system. In this paper, we respond to two open questions: 1) how to optimize the selection of checkpoint levels based on failure distributions observed in a system, 2) how to compute the optimal checkpoint intervals for each of these levels. The contribution is three-fold. (1) We build a mathematical model to fit the multi-level checkpoint/restart mechanism with large scale applications regarding various types of failures. (2) We theoretically optimize the entire execution performance for each parallel application by selecting the best checkpoint level combination and corresponding checkpoint intervals. (3) We characterize checkpoint overheads on different checkpoint levels in a real cluster environment, and evaluate our optimal solutions using both simulation with millions of cores and real environment with real-world MPI programs running on hundreds of cores. Experiments show that optimized selections of levels associated with optimal checkpoint intervals at each level outperforms other state-of-the-art solutions by 5-50 percent.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130637900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 92
Designing Bit-Reproducible Portable High-Performance Applications 设计位可复制的便携式高性能应用程序
Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.127
Andrea Arteaga, O. Fuhrer, T. Hoefler
Bit-reproducibility has many advantages in the context of high-performance computing. Besides simplifying and making more accurate the process of debugging and testing the code, it can allow the deployment of applications on heterogeneous systems, maintaining the consistency of the computations. In this work we analyze the basic operations performed by scientific applications and identify the possible sources of non-reproducibility. In particular, we consider the tasks of evaluating transcendental functions and performing reductions using non-associative operators. We present a set of techniques to achieve reproducibility and we propose improvements over existing algorithms to perform reproducible computations in a portable way, at the same time obtaining good performance and accuracy. By applying these techniques to more complex tasks we show that bit-reproducibility can be achieved on a broad range of scientific applications.
位再现性在高性能计算环境中具有许多优点。除了简化和更准确地调试和测试代码的过程外,它还允许在异构系统上部署应用程序,保持计算的一致性。在这项工作中,我们分析了科学应用程序执行的基本操作,并确定了不可再现性的可能来源。特别地,我们考虑了计算超越函数和使用非关联算子执行约简的任务。我们提出了一套实现可重复性的技术,并提出了对现有算法的改进,以一种可移植的方式进行可重复性计算,同时获得良好的性能和精度。通过将这些技术应用于更复杂的任务,我们表明可以在广泛的科学应用中实现位可重复性。
{"title":"Designing Bit-Reproducible Portable High-Performance Applications","authors":"Andrea Arteaga, O. Fuhrer, T. Hoefler","doi":"10.1109/IPDPS.2014.127","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.127","url":null,"abstract":"Bit-reproducibility has many advantages in the context of high-performance computing. Besides simplifying and making more accurate the process of debugging and testing the code, it can allow the deployment of applications on heterogeneous systems, maintaining the consistency of the computations. In this work we analyze the basic operations performed by scientific applications and identify the possible sources of non-reproducibility. In particular, we consider the tasks of evaluating transcendental functions and performing reductions using non-associative operators. We present a set of techniques to achieve reproducibility and we propose improvements over existing algorithms to perform reproducible computations in a portable way, at the same time obtaining good performance and accuracy. By applying these techniques to more complex tasks we show that bit-reproducibility can be achieved on a broad range of scientific applications.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131269544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Runtime-Guided Cache Coherence Optimizations in Multi-core Architectures 运行时导向的多核架构缓存一致性优化
Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.71
M. Manivannan, P. Stenström
Emerging task-based parallel programming models shield programmers from the daunting task of parallelism management by delegating the responsibility of mapping and scheduling of individual tasks to the runtime system. The runtime system can use semantic information about task dependencies supplied by the programmer and the mapping information of tasks to enable optimizations like data-flow based execution and locality-aware scheduling of tasks. However, should the cache coherence substrate have access to this information from the runtime system, it would enable aggressive optimizations of prevailing access patterns such as one-to-many producer-consumer sharing and migratory sharing. Such linkage has however not been studied before. We present a family of runtime guided cache coherence optimizations enabled by linking dependency and mapping information from the runtime system to the cache coherence substrate. By making this information available to the cache coherence substrate, we show that optimizations, such as downgrading and self-invalidation, that help reducing overheads associated with producer-consumer and migratory sharing can be supported with reasonable extensions to the baseline cache coherence protocol. Our experimental results establish that each optimization provides significant performance gain in isolation and can provide additional gains when combined. Finally, we evaluate these optimizations in the context of earlier proposed runtime-guided prefetching schemes and show that they can have synergistic effects.
新兴的基于任务的并行编程模型通过将单个任务的映射和调度的责任委托给运行时系统,使程序员免受并行性管理的艰巨任务。运行时系统可以使用程序员提供的关于任务依赖关系的语义信息和任务的映射信息来实现诸如基于数据流的执行和任务的位置感知调度之类的优化。然而,如果缓存一致性底层可以从运行时系统访问这些信息,那么它将支持对主流访问模式的积极优化,例如一对多的生产者-消费者共享和迁移共享。然而,这种联系以前没有被研究过。我们提出了一系列运行时引导的缓存一致性优化,通过将依赖关系和映射信息从运行时系统链接到缓存一致性基板来实现。通过将这些信息提供给缓存一致性底层,我们表明,可以通过对基线缓存一致性协议的合理扩展来支持优化,例如降级和自我失效,这些优化有助于减少与生产者-消费者和迁移共享相关的开销。我们的实验结果表明,每个优化单独提供了显著的性能增益,并且在组合时可以提供额外的增益。最后,我们在先前提出的运行时导向预取方案的背景下评估了这些优化,并表明它们可以具有协同效应。
{"title":"Runtime-Guided Cache Coherence Optimizations in Multi-core Architectures","authors":"M. Manivannan, P. Stenström","doi":"10.1109/IPDPS.2014.71","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.71","url":null,"abstract":"Emerging task-based parallel programming models shield programmers from the daunting task of parallelism management by delegating the responsibility of mapping and scheduling of individual tasks to the runtime system. The runtime system can use semantic information about task dependencies supplied by the programmer and the mapping information of tasks to enable optimizations like data-flow based execution and locality-aware scheduling of tasks. However, should the cache coherence substrate have access to this information from the runtime system, it would enable aggressive optimizations of prevailing access patterns such as one-to-many producer-consumer sharing and migratory sharing. Such linkage has however not been studied before. We present a family of runtime guided cache coherence optimizations enabled by linking dependency and mapping information from the runtime system to the cache coherence substrate. By making this information available to the cache coherence substrate, we show that optimizations, such as downgrading and self-invalidation, that help reducing overheads associated with producer-consumer and migratory sharing can be supported with reasonable extensions to the baseline cache coherence protocol. Our experimental results establish that each optimization provides significant performance gain in isolation and can provide additional gains when combined. Finally, we evaluate these optimizations in the context of earlier proposed runtime-guided prefetching schemes and show that they can have synergistic effects.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132388028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
MIC-SVM: Designing a Highly Efficient Support Vector Machine for Advanced Modern Multi-core and Many-Core Architectures MIC-SVM:为先进的现代多核和多核架构设计一个高效的支持向量机
Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.88
Yang You, S. Song, H. Fu, A. Márquez, M. Dehnavi, K. Barker, K. Cameron, A. Randles, Guangwen Yang
Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. Advanced multi- and many-core architectures offer massive parallelism with complex memory hierarchies which can make runtime training possible, but form a barrier to efficient parallel SVM design. To address the challenges above, we designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multi-core and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools. MIC-SVM achieves 4.4-84x and 18-47x speedups against the popular LIBSVM, on MIC and Ivy Bridge CPUs respectively, for several real-world data-mining datasets. Even compared with GPUSVM, run on a top of the line NVIDIA k20x GPU, the performance of our MIC-SVM is competitive. We also conduct a cross-platform performance comparison analysis, focusing on Ivy Bridge CPUs, MIC and GPUs, and provide insights on how to select the most suitable advanced architectures for specific algorithms and input data patterns.
随着现代商业数据库对分析能力的日益重视,支持向量机在数据挖掘和大数据应用中得到了广泛的应用。近年来,支持向量机被应用于高性能计算领域,用于功率/性能预测、自动调优和运行时调度。然而,即使冒着由于运行时信息不足而失去预测精度的风险,研究人员也只能采用离线模型训练来避免显著的运行时训练开销。先进的多核和多核架构提供了具有复杂内存层次结构的大规模并行性,这可以使运行时训练成为可能,但对有效的并行支持向量机设计构成了障碍。为了解决上述挑战,我们设计并实现了MIC-SVM,这是一种基于x86的多核和多核架构(如Intel Ivy Bridge cpu和Intel Xeon Phi协处理器(MIC))的高效并行SVM。我们提出了各种新的分析方法和优化技术,以充分利用这些架构提供的多层并行性,并作为其他机器学习工具的通用优化方法。对于几个真实世界的数据挖掘数据集,MIC- svm在MIC和Ivy Bridge cpu上分别比流行的LIBSVM实现了4.4-84x和18-47x的加速。即使与运行在顶级NVIDIA k20x GPU上的GPUSVM相比,我们的MIC-SVM的性能也具有竞争力。我们还对Ivy Bridge cpu、MIC和gpu进行了跨平台性能比较分析,并就如何为特定算法和输入数据模式选择最合适的高级架构提供了见解。
{"title":"MIC-SVM: Designing a Highly Efficient Support Vector Machine for Advanced Modern Multi-core and Many-Core Architectures","authors":"Yang You, S. Song, H. Fu, A. Márquez, M. Dehnavi, K. Barker, K. Cameron, A. Randles, Guangwen Yang","doi":"10.1109/IPDPS.2014.88","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.88","url":null,"abstract":"Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. Advanced multi- and many-core architectures offer massive parallelism with complex memory hierarchies which can make runtime training possible, but form a barrier to efficient parallel SVM design. To address the challenges above, we designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multi-core and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools. MIC-SVM achieves 4.4-84x and 18-47x speedups against the popular LIBSVM, on MIC and Ivy Bridge CPUs respectively, for several real-world data-mining datasets. Even compared with GPUSVM, run on a top of the line NVIDIA k20x GPU, the performance of our MIC-SVM is competitive. We also conduct a cross-platform performance comparison analysis, focusing on Ivy Bridge CPUs, MIC and GPUs, and provide insights on how to select the most suitable advanced architectures for specific algorithms and input data patterns.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115045055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Traversing Trillions of Edges in Real Time: Graph Exploration on Large-Scale Parallel Machines 实时遍历数万亿条边:大规模并行机器上的图探索
Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.52
Fabio Checconi, F. Petrini
The world of Big Data is changing dramatically right before our eyes-from the amount of data being produced to the way in which it is structured and used. The trend of "big data growth" presents enormous challenges, but it also presents incredible scientific and business opportunities. Together with the data explosion, we are also witnessing a dramatic increase in data processing capabilities, thanks to new powerful parallel computer architectures and more sophisticated algorithms. In this paper we describe the algorithmic design and the optimization techniques that led to the unprecedented processing rate of 15.3 trillion edges per second on 64 thousand Blue Gene/Q nodes, that allowed the in-memory exploration of a petabyte-scale graph in just a few seconds. This paper provides insight into our parallelization and optimization techniques. We believe that these techniques can be successfully applied to a broader class of graph algorithms.
大数据的世界正在我们眼前发生着巨大的变化——从产生的数据量到数据的结构和使用方式。“大数据增长”的趋势带来了巨大的挑战,但也带来了令人难以置信的科学和商业机会。随着数据的爆炸式增长,我们也见证了数据处理能力的急剧增长,这要归功于新的强大的并行计算机架构和更复杂的算法。在本文中,我们描述了算法设计和优化技术,这些技术使64000个Blue Gene/Q节点的处理速度达到前所未有的每秒15.3万亿个边,这使得在几秒钟内就可以在内存中探索一个pb级的图。本文介绍了我们的并行化和优化技术。我们相信这些技术可以成功地应用于更广泛的图算法。
{"title":"Traversing Trillions of Edges in Real Time: Graph Exploration on Large-Scale Parallel Machines","authors":"Fabio Checconi, F. Petrini","doi":"10.1109/IPDPS.2014.52","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.52","url":null,"abstract":"The world of Big Data is changing dramatically right before our eyes-from the amount of data being produced to the way in which it is structured and used. The trend of \"big data growth\" presents enormous challenges, but it also presents incredible scientific and business opportunities. Together with the data explosion, we are also witnessing a dramatic increase in data processing capabilities, thanks to new powerful parallel computer architectures and more sophisticated algorithms. In this paper we describe the algorithmic design and the optimization techniques that led to the unprecedented processing rate of 15.3 trillion edges per second on 64 thousand Blue Gene/Q nodes, that allowed the in-memory exploration of a petabyte-scale graph in just a few seconds. This paper provides insight into our parallelization and optimization techniques. We believe that these techniques can be successfully applied to a broader class of graph algorithms.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115081309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
Overcoming the Limitations Posed by TCR-beta Repertoire Modeling through a GPU-Based In-Silico DNA Recombination Algorithm 基于gpu的DNA重组算法克服TCR-beta库建模的局限性
Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.34
Gregory M. Striemer, Harsha Krovi, A. Akoglu, B. Vincent, Benjamin Hopson, J. Frelinger, Adam Buntzman
The DNA recombination process known as V(D)J recombination is the central mechanism for generating diversity among antigen receptors such as T-cell receptors (TCRs). This diversity is crucial for the development of the adaptive immune system. However, modeling of all the α β TCR sequences is encumbered by the enormity of the potential repertoire, which has been predicted to exceed 1015 sequences. Prior modeling efforts have, therefore, been limited to extrapolations based on the analysis of minor subsets of the overall TCRbeta repertoire. In this study, we map the recombination process completely onto the graphics processing unit (GPU) hardware architecture using the CUDA programming environment to circumvent prior limitations. For the first time, we present a model of the mouse TCRbeta repertoire to an extent which enabled us to evaluate the Convergent Recombination Hypothesis (CRH) comprehensively at peta-scale level on a single GPU.
被称为V(D)J重组的DNA重组过程是抗原受体如t细胞受体(tcr)之间产生多样性的主要机制。这种多样性对适应性免疫系统的发展至关重要。然而,所有α β TCR序列的建模都受到潜在库的巨大影响,预计超过1015个序列。因此,先前的建模工作仅限于基于整体TCRbeta曲目的次要子集的分析的外推。在本研究中,我们使用CUDA编程环境将重组过程完全映射到图形处理单元(GPU)硬件架构上,以绕过先前的限制。我们首次提出了小鼠TCRbeta库的模型,使我们能够在单个GPU上全面评估收敛重组假说(CRH)的peta级水平。
{"title":"Overcoming the Limitations Posed by TCR-beta Repertoire Modeling through a GPU-Based In-Silico DNA Recombination Algorithm","authors":"Gregory M. Striemer, Harsha Krovi, A. Akoglu, B. Vincent, Benjamin Hopson, J. Frelinger, Adam Buntzman","doi":"10.1109/IPDPS.2014.34","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.34","url":null,"abstract":"The DNA recombination process known as V(D)J recombination is the central mechanism for generating diversity among antigen receptors such as T-cell receptors (TCRs). This diversity is crucial for the development of the adaptive immune system. However, modeling of all the α β TCR sequences is encumbered by the enormity of the potential repertoire, which has been predicted to exceed 1015 sequences. Prior modeling efforts have, therefore, been limited to extrapolations based on the analysis of minor subsets of the overall TCRbeta repertoire. In this study, we map the recombination process completely onto the graphics processing unit (GPU) hardware architecture using the CUDA programming environment to circumvent prior limitations. For the first time, we present a model of the mouse TCRbeta repertoire to an extent which enabled us to evaluate the Convergent Recombination Hypothesis (CRH) comprehensively at peta-scale level on a single GPU.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115310343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Locating Parallelization Potential in Object-Oriented Data Structures 定位面向对象数据结构中的并行化潜力
Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.106
Korbinian Molitorisz, Thomas Karcher, Alexander Biele, W. Tichy
The free lunch of ever increasing single-processor performance is over. Software engineers have to parallelize software to gain performance improvements. But not every software engineer is a parallel expert and with millions of lines of code that have not been developed with multicore in mind, we have to find ways to assist in identifying parallelization potential. This paper makes three contributions: 1) An empirical study of more than 900,000 lines of code reveals five use cases in the runtime profile of object-oriented data structures that carry parallelization potential. 2) The study also points out frequently used data structures in realistic software in which these use cases can be found. 3) We developed DSspy, an automatic dynamic profiler that locates these use cases and makes recommendations on how to parallelize them. Our evaluation shows that DSspy reduces the search space for parallelization by up to 77% and engineers only need to consider 23% of all data structure instances for parallelization.
不断提高单处理器性能的免费午餐已经结束了。软件工程师必须并行化软件以获得性能改进。但是并不是每个软件工程师都是并行专家,并且由于数百万行代码还没有考虑到多核,我们必须找到方法来帮助识别并行化的潜力。本文做出了三个贡献:1)对超过900,000行代码的实证研究揭示了在运行时配置文件中具有并行化潜力的面向对象数据结构的五个用例。2)该研究还指出了在现实软件中经常使用的数据结构,这些用例可以在其中找到。3)我们开发了DSspy,这是一个自动动态分析器,可以定位这些用例并就如何并行化它们提出建议。我们的评估表明,DSspy将并行化的搜索空间减少了77%,工程师只需要考虑23%的数据结构实例进行并行化。
{"title":"Locating Parallelization Potential in Object-Oriented Data Structures","authors":"Korbinian Molitorisz, Thomas Karcher, Alexander Biele, W. Tichy","doi":"10.1109/IPDPS.2014.106","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.106","url":null,"abstract":"The free lunch of ever increasing single-processor performance is over. Software engineers have to parallelize software to gain performance improvements. But not every software engineer is a parallel expert and with millions of lines of code that have not been developed with multicore in mind, we have to find ways to assist in identifying parallelization potential. This paper makes three contributions: 1) An empirical study of more than 900,000 lines of code reveals five use cases in the runtime profile of object-oriented data structures that carry parallelization potential. 2) The study also points out frequently used data structures in realistic software in which these use cases can be found. 3) We developed DSspy, an automatic dynamic profiler that locates these use cases and makes recommendations on how to parallelize them. Our evaluation shows that DSspy reduces the search space for parallelization by up to 77% and engineers only need to consider 23% of all data structure instances for parallelization.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"162 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123948782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2014 IEEE 28th International Parallel and Distributed Processing Symposium
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1