2014 IEEE 28th International Parallel and Distributed Processing Symposium最新文献

英文中文

Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems 异构多核系统的高能效时分复用混合开关NoC

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.40

Jieming Yin, Pingqiang Zhou, S. Sapatnekar, Antonia Zhai

NoCs are an integral part of modern multicore processors, they must continuously support high-throughput low-latency on-chip data communication under a stringent energy budget when system size scales up. Heterogeneous multicore systems further push the limit of NoC design by integrating cores with diverse performance requirements onto the same die. Traditional packet-switched NoCs, which have the flexibility of connecting diverse computation and storage devices, are facing great challenges to meet the performance requirements within the energy budget due to latency and energy consumption associated with buffering and routing at each router. In this paper, we take advantage of the diversity in performance requirements of on-chip heterogeneous computing devices by designing, implementing, and evaluating a hybrid-switched network that allows the packet-switched and circuit-switched messages to share the same communication fabric by partitioning the network through time-division multiplexing (TDM). In the proposed hybrid-switched network, circuit-switched paths are established along frequently communicating nodes. Our experiments show that utilizing these paths can improve system performance by reducing communication latency and alleviating network congestion. Furthermore, better energy efficiency is achieved by reducing buffering in routers and in turn enabling aggressive power gating.

noc是现代多核处理器不可分割的一部分，当系统规模扩大时，它们必须在严格的能量预算下持续支持高吞吐量低延迟片上数据通信。异构多核系统通过将具有不同性能要求的内核集成到同一芯片上，进一步推动了NoC设计的极限。传统的分组交换noc具有连接各种计算和存储设备的灵活性，但由于每个路由器的缓冲和路由相关的延迟和能耗，在满足能量预算范围内的性能要求方面面临巨大挑战。在本文中，我们通过设计、实现和评估一个混合交换网络，利用片上异构计算设备的性能要求的多样性，该网络允许分组交换和电路交换消息通过时分多路复用(TDM)划分网络共享相同的通信结构。在所提出的混合交换网络中，电路交换路径沿着频繁通信的节点建立。我们的实验表明，利用这些路径可以通过减少通信延迟和缓解网络拥塞来提高系统性能。此外，通过减少路由器中的缓冲并反过来启用主动电源门控，可以实现更好的能源效率。

{"title":"Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems","authors":"Jieming Yin, Pingqiang Zhou, S. Sapatnekar, Antonia Zhai","doi":"10.1109/IPDPS.2014.40","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.40","url":null,"abstract":"NoCs are an integral part of modern multicore processors, they must continuously support high-throughput low-latency on-chip data communication under a stringent energy budget when system size scales up. Heterogeneous multicore systems further push the limit of NoC design by integrating cores with diverse performance requirements onto the same die. Traditional packet-switched NoCs, which have the flexibility of connecting diverse computation and storage devices, are facing great challenges to meet the performance requirements within the energy budget due to latency and energy consumption associated with buffering and routing at each router. In this paper, we take advantage of the diversity in performance requirements of on-chip heterogeneous computing devices by designing, implementing, and evaluating a hybrid-switched network that allows the packet-switched and circuit-switched messages to share the same communication fabric by partitioning the network through time-division multiplexing (TDM). In the proposed hybrid-switched network, circuit-switched paths are established along frequently communicating nodes. Our experiments show that utilizing these paths can improve system performance by reducing communication latency and alleviating network congestion. Furthermore, better energy efficiency is achieved by reducing buffering in routers and in turn enabling aggressive power gating.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114552681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

A New Scalable Parallel Algorithm for Fock Matrix Construction Fock矩阵构造的一种新的可扩展并行算法

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.97

Xing Liu, Aftab Patel, Edmond Chow

Hartree-Fock (HF) or self-consistent field (SCF) calculations are widely used in quantum chemistry, and are the starting point for accurate electronic correlation methods. Existing algorithms and software, however, may fail to scale for large numbers of cores of a distributed machine, particularly in the simulation of moderately-sized molecules. In existing codes, HF calculations are divided into tasks. Fine-grained tasks are better for load balance, but coarse-grained tasks require less communication. In this paper, we present a new parallelization of HF calculations that addresses this trade-off: we use fine grained tasks to balance the computation among large numbers of cores, but we also use a scheme to assign tasks to processes to reduce communication. We specifically focus on the distributed construction of the Fock matrix arising in the HF algorithm, and describe the data access patterns in detail. For our test molecules, our implementation shows better scalability than NWChem for constructing the Fock matrix.

Hartree-Fock (HF)或自洽场(SCF)计算在量子化学中广泛应用，是精确电子相关方法的起点。然而，现有的算法和软件可能无法对分布式机器的大量核心进行扩展，特别是在模拟中等大小的分子时。在现有代码中，高频计算被划分为多个任务。细粒度任务更适合于负载平衡，但粗粒度任务需要较少的通信。在本文中，我们提出了一种新的HF计算并行化，解决了这种权衡:我们使用细粒度任务来平衡大量核心之间的计算，但我们也使用一种方案将任务分配给进程以减少通信。重点讨论了高频算法中出现的Fock矩阵的分布式构造，并详细描述了其数据访问模式。对于我们的测试分子，我们的实现在构建Fock矩阵方面表现出比NWChem更好的可扩展性。

引用次数: 23

Enabling In-Situ Data Analysis for Large Protein-Folding Trajectory Datasets 实现大型蛋白质折叠轨迹数据集的原位数据分析

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.33

Boyu Zhang, Trilce Estrada, Pietro Cicotti, M. Taufer

This paper presents a one-pass, distributed method that enables in-situ data analysis for large protein folding trajectory datasets by executing sufficiently fast, avoiding moving trajectory data, and limiting the memory usage. First, the method extracts the geometric shape features of each protein conformation in parallel. Then, it classifies sets of consecutive conformations into meta-stable and transition stages using a probabilistic hierarchical clustering method. Lastly, it rebuilds the global knowledge necessary for the intraand inter-trajectory analysis through a reduction operation. The comparison of our method with a traditional approach for a villin headpiece sub domain shows that our method generates significant improvements in execution time, memory usage, and data movement. Specifically, to analyze the same trajectory consisting of 20,000 protein conformations, our method runs in 41.5 seconds while the traditional approach takes approximately 3 hours, uses 6.9MB memory per core while the traditional method uses 16GB on one single node where the analysis is performed, and communicates only 4.4KB while the traditional method moves the entire dataset of 539MB. The overall results in this paper support our claim that our method is suitable for in-situ data analysis of folding trajectories.

本文提出了一种一次性、分布式的方法，通过执行足够快、避免移动轨迹数据和限制内存使用，实现了对大型蛋白质折叠轨迹数据集的原位数据分析。该方法首先并行提取每个蛋白质构象的几何形状特征;然后，利用概率层次聚类方法将连续构象集划分为亚稳定和过渡阶段;最后，通过约简运算重建轨迹内和轨迹间分析所需的全局知识。将我们的方法与传统方法进行vilin headpiece子域的比较表明，我们的方法在执行时间、内存使用和数据移动方面产生了显著的改进。具体来说，为了分析由2万个蛋白质构象组成的相同轨迹，我们的方法运行时间为41.5秒，而传统方法大约需要3小时;每核使用6.9MB内存，而传统方法在单个节点上使用16GB内存进行分析;通信仅4.4KB，而传统方法移动整个数据集539MB。本文的总体结果支持我们的说法，即我们的方法适用于折叠轨迹的原位数据分析。

{"title":"Enabling In-Situ Data Analysis for Large Protein-Folding Trajectory Datasets","authors":"Boyu Zhang, Trilce Estrada, Pietro Cicotti, M. Taufer","doi":"10.1109/IPDPS.2014.33","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.33","url":null,"abstract":"This paper presents a one-pass, distributed method that enables in-situ data analysis for large protein folding trajectory datasets by executing sufficiently fast, avoiding moving trajectory data, and limiting the memory usage. First, the method extracts the geometric shape features of each protein conformation in parallel. Then, it classifies sets of consecutive conformations into meta-stable and transition stages using a probabilistic hierarchical clustering method. Lastly, it rebuilds the global knowledge necessary for the intraand inter-trajectory analysis through a reduction operation. The comparison of our method with a traditional approach for a villin headpiece sub domain shows that our method generates significant improvements in execution time, memory usage, and data movement. Specifically, to analyze the same trajectory consisting of 20,000 protein conformations, our method runs in 41.5 seconds while the traditional approach takes approximately 3 hours, uses 6.9MB memory per core while the traditional method uses 16GB on one single node where the analysis is performed, and communicates only 4.4KB while the traditional method moves the entire dataset of 539MB. The overall results in this paper support our claim that our method is suitable for in-situ data analysis of folding trajectories.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124759066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Performance and Energy Analysis of the Restricted Transactional Memory Implementation on Haswell Haswell上受限事务性内存实现的性能和能量分析

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.70

Bhavishya Goel, J. Gil, A. Negi, S. Mckee, P. Stenström

Hardware transactional memory implementations are becoming increasingly available. For instance, the Intel Core i7 4770 implements Restricted Transactional Memory (RTM) support for Intel Transactional Synchronization Extensions (TSX). In this paper, we present a detailed evaluation of RTM performance and energy expenditure. We compare RTM behavior to that of the TinySTM software transactional memory system, first by running micro benchmarks, and then by running the STAMP benchmark suite. We find that which system performs better depends heavily on the workload characteristics. We then conduct a case study of two STAMP applications to assess the impact of programming style on RTM performance and to investigate what kinds of software optimizations can help overcome RTM's hardware limitations.

硬件事务性内存实现变得越来越可用。例如，Intel酷睿i7 4770实现了对Intel事务性同步扩展(TSX)的受限事务性内存(RTM)支持。在本文中，我们提出了RTM性能和能量消耗的详细评估。我们首先通过运行微基准测试，然后通过运行STAMP基准测试套件，将RTM的行为与TinySTM软件事务性内存系统的行为进行比较。我们发现哪个系统性能更好在很大程度上取决于工作负载特征。然后，我们对两个STAMP应用程序进行了案例研究，以评估编程风格对RTM性能的影响，并调查哪些类型的软件优化可以帮助克服RTM的硬件限制。

引用次数: 46

F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability F-SEFI:用于分析应用程序漏洞的细粒度软错误故障注入工具

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.128

Qiang Guan, Nathan Debardeleben, S. Blanchard, Song Fu

As the high performance computing (HPC) community continues to push towards exascale computing, resilience remains a serious challenge. With the expected decrease of both feature size and operating voltage, we expect a significant increase in hardware soft errors. HPC applications of today are only affected by soft errors to a small degree but we expect that this will become a more serious issue as HPC systems grow. We propose F-SEFI, a Fine-grained Soft Error Fault Injector, as a tool for profiling software robustness against soft errors. In this paper we utilize soft error injection to mimic the impact of errors on logic circuit behavior. Leveraging the open source virtual machine hypervisor QEMU, F-SEFI enables users to modify emulated machine instructions to introduce soft errors. F-SEFI can control what application, which sub-function, when and how to inject soft errors with different granularities, without interference to other applications that share the same environment. F-SEFI does this without requiring revisions to the application source code, compilers or operating systems. We discuss the design constraints for F-SEFI and the specifics of our implementation. We demonstrate use cases of F-SEFI on several benchmark applications to show how data corruption can propagate to incorrect results.

随着高性能计算(HPC)社区继续向百亿亿级计算推进，弹性仍然是一个严峻的挑战。随着特征尺寸和工作电压的预期减小，我们预计硬件软错误将显著增加。今天的HPC应用程序只受到很小程度的软错误的影响，但我们预计随着HPC系统的发展，这将成为一个更严重的问题。我们提出了F-SEFI，一个细粒度软错误故障注入器，作为分析软件对软错误的鲁棒性的工具。在本文中，我们利用软错误注入来模拟错误对逻辑电路行为的影响。利用开源虚拟机管理程序QEMU, F-SEFI使用户能够修改模拟的机器指令以引入软错误。F-SEFI可以控制什么应用程序，哪个子功能，何时以及如何注入不同粒度的软错误，而不会干扰共享同一环境的其他应用程序。F-SEFI不需要修改应用程序源代码、编译器或操作系统。我们讨论了F-SEFI的设计约束和具体的实现。我们在几个基准应用程序上演示了F-SEFI的用例，以显示数据损坏如何传播到不正确的结果。

{"title":"F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability","authors":"Qiang Guan, Nathan Debardeleben, S. Blanchard, Song Fu","doi":"10.1109/IPDPS.2014.128","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.128","url":null,"abstract":"As the high performance computing (HPC) community continues to push towards exascale computing, resilience remains a serious challenge. With the expected decrease of both feature size and operating voltage, we expect a significant increase in hardware soft errors. HPC applications of today are only affected by soft errors to a small degree but we expect that this will become a more serious issue as HPC systems grow. We propose F-SEFI, a Fine-grained Soft Error Fault Injector, as a tool for profiling software robustness against soft errors. In this paper we utilize soft error injection to mimic the impact of errors on logic circuit behavior. Leveraging the open source virtual machine hypervisor QEMU, F-SEFI enables users to modify emulated machine instructions to introduce soft errors. F-SEFI can control what application, which sub-function, when and how to inject soft errors with different granularities, without interference to other applications that share the same environment. F-SEFI does this without requiring revisions to the application source code, compilers or operating systems. We discuss the design constraints for F-SEFI and the specifics of our implementation. We demonstrate use cases of F-SEFI on several benchmark applications to show how data corruption can propagate to incorrect results.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"12 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124127250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

Interactive Program Debugging and Optimization for Directive-Based, Efficient GPU Computing 基于指令的高效GPU计算的交互式程序调试与优化

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.57

Seyong Lee, Dong Li, J. Vetter

Directive-based GPU programming models are gaining momentum, since they transparently relieve programmers from dealing with complexity of low-level GPU programming, which often reflects the underlying architecture. However, too much abstraction in directive models puts a significant burden on programmers for debugging applications and tuning performance. In this paper, we propose a directive-based, interactive program debugging and optimization system. This system enables intuitive and synergistic interaction among programmers, compilers, and runtimes for more productive and efficient GPU computing. We have designed and implemented a series of prototype tools within our new open source compiler framework, called Open Accelerator Research Compiler (Open ARC), Open ARC supports the full feature set of Opencast V1.0. Our evaluation on twelve Open ACC benchmarks demonstrates that our prototype debugging and optimization system can detect a variety of translation errors. Additionally, the optimization provided by our prototype minimizes memory transfers, when compared to a fully manual memory management scheme.

基于指令的GPU编程模型正在获得动力，因为它们透明地将程序员从处理低级GPU编程的复杂性中解放出来，低级GPU编程通常反映底层架构。然而，指令模型中太多的抽象给程序员调试应用程序和调优性能带来了沉重的负担。本文提出了一种基于指令的交互式程序调试与优化系统。该系统支持程序员、编译器和运行时之间的直观和协同交互，以实现更高效的GPU计算。我们在新的开源编译器框架中设计并实现了一系列原型工具，称为open Accelerator Research compiler (open ARC)， open ARC支持Opencast V1.0的全部特性集。我们对12个Open ACC基准测试的评估表明，我们的原型调试和优化系统可以检测到各种翻译错误。此外，与完全手动的内存管理方案相比，我们的原型提供的优化最大限度地减少了内存传输。

{"title":"Interactive Program Debugging and Optimization for Directive-Based, Efficient GPU Computing","authors":"Seyong Lee, Dong Li, J. Vetter","doi":"10.1109/IPDPS.2014.57","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.57","url":null,"abstract":"Directive-based GPU programming models are gaining momentum, since they transparently relieve programmers from dealing with complexity of low-level GPU programming, which often reflects the underlying architecture. However, too much abstraction in directive models puts a significant burden on programmers for debugging applications and tuning performance. In this paper, we propose a directive-based, interactive program debugging and optimization system. This system enables intuitive and synergistic interaction among programmers, compilers, and runtimes for more productive and efficient GPU computing. We have designed and implemented a series of prototype tools within our new open source compiler framework, called Open Accelerator Research Compiler (Open ARC), Open ARC supports the full feature set of Opencast V1.0. Our evaluation on twelve Open ACC benchmarks demonstrates that our prototype debugging and optimization system can detect a variety of translation errors. Additionally, the optimization provided by our prototype minimizes memory transfers, when compared to a fully manual memory management scheme.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126378932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Scibox: Online Sharing of Scientific Data via the Cloud Scibox:通过云进行科学数据的在线共享

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.26

Jian Huang, Xuechen Zhang, G. Eisenhauer, K. Schwan, M. Wolf, S. Ethier, S. Klasky

Collaborative science demands global sharing of scientific data. But it cannot leverage universally accessible cloud-based infrastructures like Drop Box, as those offer limited interfaces and inadequate levels of access bandwidth. We present the Scibox cloud facility for online sharing scientific data. It uses standard cloud storage solutions, but offers a usage model in which high end codes can write/read data to/from the cloud via the APIs they already use for their I/O actions. With Scibox, data upload/download volumes are controlled via Data Reduction-functions stated by end users and applied at the data source, before data is moved, with further gains in efficiency obtained by combining DR-functions to move exactly what is needed by current data consumers. We evaluate Scibox with science applications and their representative data analytics - the GTS fusion and the combustion image processing - demonstrating the potential for ubiquitous data access with substantial reductions in network traffic.

协作科学要求全球共享科学数据。但它无法利用Drop Box等普遍可访问的基于云的基础设施，因为这些基础设施提供的接口有限，访问带宽水平不足。我们提出了用于在线共享科学数据的Scibox云设施。它使用标准的云存储解决方案，但提供了一个使用模型，在这个模型中，高端代码可以通过它们已经用于I/O操作的api向云写入/读取数据。在Scibox中，数据上传/下载量通过最终用户指定的数据还原功能进行控制，并在数据移动之前应用于数据源，通过结合dr功能来移动当前数据消费者所需的数据，进一步提高了效率。我们用科学应用及其代表性数据分析(GTS融合和燃烧图像处理)来评估Scibox，展示了在大幅减少网络流量的情况下无处不在的数据访问的潜力。

引用次数: 7

DataMPI: Extending MPI to Hadoop-Like Big Data Computing DataMPI:将MPI扩展到类似hadoop的大数据计算

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.90

Xiaoyi Lu, Fan Liang, Bing Wang, L. Zha, Zhiwei Xu

MPI has been widely used in High Performance Computing. In contrast, such efficient communication support is lacking in the field of Big Data Computing, where communication is realized by time consuming techniques such as HTTP/RPC. This paper takes a step in bridging these two fields by extending MPI to support Hadoop-like Big Data Computing jobs, where processing and communication of a large number of key-value pair instances are needed through distributed computation models such as MapReduce, Iteration, and Streaming. We abstract the characteristics of key-value communication patterns into a bipartite communication model, which reveals four distinctions from MPI: Dichotomic, Dynamic, Data-centric, and Diversified features. Utilizing this model, we propose the specification of a minimalistic extension to MPI. An open source communication library, DataMPI, is developed to implement this specification. Performance experiments show that DataMPI has significant advantages in performance and flexibility, while maintaining high productivity, scalability, and fault tolerance of Hadoop.

MPI在高性能计算中得到了广泛的应用。相比之下，在大数据计算领域缺乏这种高效的通信支持，在大数据计算领域，通信是通过HTTP/RPC等耗时的技术实现的。本文通过扩展MPI来支持类似hadoop的大数据计算作业，从而在这两个领域之间架起了桥梁。在这些作业中，需要通过分布式计算模型(如MapReduce、Iteration和Streaming)来处理和通信大量的键值对实例。我们将键值通信模式的特征抽象为一个二部通信模型，该模型揭示了与MPI的四个区别:二分类特征、动态特征、数据中心特征和多样化特征。利用这个模型，我们提出了MPI的简约扩展规范。开发了一个开源通信库DataMPI来实现该规范。性能实验表明，DataMPI在性能和灵活性方面具有明显优势，同时保持了Hadoop的高生产率、可扩展性和容错性。

引用次数: 64

TBPoint: Reducing Simulation Time for Large-Scale GPGPU Kernels 减少大规模GPGPU内核的仿真时间

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.53

Jen-Cheng Huang, Lifeng Nai, Hyesoon Kim, H. Lee

Architecture simulation for GPGPU kernels can take a significant amount of time, especially for large-scale GPGPU kernels. This paper presents TBPoint, an infrastructure based on profiling-based sampling for GPGPU kernels to reduce the cycle-level simulation time. Compared to existing approaches, TBPoint provides a flexible and architecture-independent way to take samples. For the evaluated 12 kernels, the geometric means of sampling errors of TBPoint, Ideal-Simpoint, and random sampling are 0.47%, 1.74%, and 7.95%, respectively, while the geometric means of the total sample size of TBPoint, Ideal-Simpoint, and random sampling are 2.6%, 5.4%, and 10%, respectively. TBPoint narrows the speed gap between hardware and GPGPU simulators, enabling more and more large-scale GPGPU kernels to be analyzed using detailed timing simulations.

GPGPU内核的体系结构模拟可能会花费大量的时间，特别是对于大规模的GPGPU内核。TBPoint是一种基于性能分析采样的GPGPU内核架构，可减少周期级仿真时间。与现有的方法相比，TBPoint提供了一种灵活的、与体系结构无关的采样方法。对于评估的12个核，TBPoint、Ideal-Simpoint和随机抽样的抽样误差几何均值分别为0.47%、1.74%和7.95%，TBPoint、Ideal-Simpoint和随机抽样的总样本量几何均值分别为2.6%、5.4%和10%。TBPoint缩小了硬件和GPGPU模拟器之间的速度差距，使越来越多的大规模GPGPU内核能够使用详细的时序模拟进行分析。

引用次数: 17

BFS and Coloring-Based Parallel Algorithms for Strongly Connected Components and Related Problems 强连通分量的BFS和着色并行算法及相关问题

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.64

George M. Slota, S. Rajamanickam, Kamesh Madduri

Finding the strongly connected components (SCCs) of a directed graph is a fundamental graph-theoretic problem. Tarjan's algorithm is an efficient serial algorithm to find SCCs, but relies on the hard-to-parallelize depth-first search (DFS). We observe that implementations of several parallel SCC detection algorithms show poor parallel performance on modern multicore platforms and large-scale networks. This paper introduces the Multistep method, a new approach that avoids work inefficiencies seen in prior SCC approaches. It does not rely on DFS, but instead uses a combination of breadth-first search (BFS) and a parallel graph coloring routine. We show that the Multistep method scales well on several real-world graphs, with performance fairly independent of topological properties such as the size of the largest SCC and the total number of SCCs. On a 16-core Intel Xeon platform, our algorithm achieves a 20X speedup over the serial approach on a 2 billion edge graph, fully decomposing it in under two seconds. For our collection of test networks, we observe that the Multistep method is 1.92X faster (mean speedup) than the state-of-the-art Hong et al. SCC method. In addition, we modify the Multistep method to find connected and weakly connected components, as well as introduce a novel algorithm for determining articulation vertices of biconnected components. These approaches all utilize the same underlying BFS and coloring routines.

求有向图的强连通分量是一个基本的图论问题。Tarjan算法是一种高效的串行scc查找算法，但依赖于难以并行化的深度优先搜索(DFS)。我们观察到几种并行SCC检测算法的实现在现代多核平台和大规模网络上表现出较差的并行性能。本文介绍了多步骤方法，这是一种新的方法，可以避免以前的SCC方法中出现的工作效率低下。它不依赖于DFS，而是使用宽度优先搜索(BFS)和并行图着色例程的组合。我们表明，Multistep方法在几个现实世界的图上可以很好地扩展，其性能与拓扑属性(如最大SCC的大小和SCC的总数)相当独立。在16核英特尔至强平台上，我们的算法在20亿个边缘图上实现了比串行方法20倍的加速，在两秒钟内完全分解它。对于我们的测试网络集合，我们观察到Multistep方法比最先进的Hong等人快1.92倍(平均加速)。鳞状细胞癌的方法。此外，我们改进了多步方法来寻找连接和弱连接的组件，并引入了一种确定双连接组件的连接顶点的新算法。这些方法都使用相同的底层BFS和着色例程。

{"title":"BFS and Coloring-Based Parallel Algorithms for Strongly Connected Components and Related Problems","authors":"George M. Slota, S. Rajamanickam, Kamesh Madduri","doi":"10.1109/IPDPS.2014.64","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.64","url":null,"abstract":"Finding the strongly connected components (SCCs) of a directed graph is a fundamental graph-theoretic problem. Tarjan's algorithm is an efficient serial algorithm to find SCCs, but relies on the hard-to-parallelize depth-first search (DFS). We observe that implementations of several parallel SCC detection algorithms show poor parallel performance on modern multicore platforms and large-scale networks. This paper introduces the Multistep method, a new approach that avoids work inefficiencies seen in prior SCC approaches. It does not rely on DFS, but instead uses a combination of breadth-first search (BFS) and a parallel graph coloring routine. We show that the Multistep method scales well on several real-world graphs, with performance fairly independent of topological properties such as the size of the largest SCC and the total number of SCCs. On a 16-core Intel Xeon platform, our algorithm achieves a 20X speedup over the serial approach on a 2 billion edge graph, fully decomposing it in under two seconds. For our collection of test networks, we observe that the Multistep method is 1.92X faster (mean speedup) than the state-of-the-art Hong et al. SCC method. In addition, we modify the Multistep method to find connected and weakly connected components, as well as introduce a novel algorithm for determining articulation vertices of biconnected components. These approaches all utilize the same underlying BFS and coloring routines.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133460571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 85

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀