Jieming Yin, Pingqiang Zhou, S. Sapatnekar, Antonia Zhai
NoCs are an integral part of modern multicore processors, they must continuously support high-throughput low-latency on-chip data communication under a stringent energy budget when system size scales up. Heterogeneous multicore systems further push the limit of NoC design by integrating cores with diverse performance requirements onto the same die. Traditional packet-switched NoCs, which have the flexibility of connecting diverse computation and storage devices, are facing great challenges to meet the performance requirements within the energy budget due to latency and energy consumption associated with buffering and routing at each router. In this paper, we take advantage of the diversity in performance requirements of on-chip heterogeneous computing devices by designing, implementing, and evaluating a hybrid-switched network that allows the packet-switched and circuit-switched messages to share the same communication fabric by partitioning the network through time-division multiplexing (TDM). In the proposed hybrid-switched network, circuit-switched paths are established along frequently communicating nodes. Our experiments show that utilizing these paths can improve system performance by reducing communication latency and alleviating network congestion. Furthermore, better energy efficiency is achieved by reducing buffering in routers and in turn enabling aggressive power gating.
{"title":"Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems","authors":"Jieming Yin, Pingqiang Zhou, S. Sapatnekar, Antonia Zhai","doi":"10.1109/IPDPS.2014.40","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.40","url":null,"abstract":"NoCs are an integral part of modern multicore processors, they must continuously support high-throughput low-latency on-chip data communication under a stringent energy budget when system size scales up. Heterogeneous multicore systems further push the limit of NoC design by integrating cores with diverse performance requirements onto the same die. Traditional packet-switched NoCs, which have the flexibility of connecting diverse computation and storage devices, are facing great challenges to meet the performance requirements within the energy budget due to latency and energy consumption associated with buffering and routing at each router. In this paper, we take advantage of the diversity in performance requirements of on-chip heterogeneous computing devices by designing, implementing, and evaluating a hybrid-switched network that allows the packet-switched and circuit-switched messages to share the same communication fabric by partitioning the network through time-division multiplexing (TDM). In the proposed hybrid-switched network, circuit-switched paths are established along frequently communicating nodes. Our experiments show that utilizing these paths can improve system performance by reducing communication latency and alleviating network congestion. Furthermore, better energy efficiency is achieved by reducing buffering in routers and in turn enabling aggressive power gating.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114552681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hartree-Fock (HF) or self-consistent field (SCF) calculations are widely used in quantum chemistry, and are the starting point for accurate electronic correlation methods. Existing algorithms and software, however, may fail to scale for large numbers of cores of a distributed machine, particularly in the simulation of moderately-sized molecules. In existing codes, HF calculations are divided into tasks. Fine-grained tasks are better for load balance, but coarse-grained tasks require less communication. In this paper, we present a new parallelization of HF calculations that addresses this trade-off: we use fine grained tasks to balance the computation among large numbers of cores, but we also use a scheme to assign tasks to processes to reduce communication. We specifically focus on the distributed construction of the Fock matrix arising in the HF algorithm, and describe the data access patterns in detail. For our test molecules, our implementation shows better scalability than NWChem for constructing the Fock matrix.
{"title":"A New Scalable Parallel Algorithm for Fock Matrix Construction","authors":"Xing Liu, Aftab Patel, Edmond Chow","doi":"10.1109/IPDPS.2014.97","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.97","url":null,"abstract":"Hartree-Fock (HF) or self-consistent field (SCF) calculations are widely used in quantum chemistry, and are the starting point for accurate electronic correlation methods. Existing algorithms and software, however, may fail to scale for large numbers of cores of a distributed machine, particularly in the simulation of moderately-sized molecules. In existing codes, HF calculations are divided into tasks. Fine-grained tasks are better for load balance, but coarse-grained tasks require less communication. In this paper, we present a new parallelization of HF calculations that addresses this trade-off: we use fine grained tasks to balance the computation among large numbers of cores, but we also use a scheme to assign tasks to processes to reduce communication. We specifically focus on the distributed construction of the Fock matrix arising in the HF algorithm, and describe the data access patterns in detail. For our test molecules, our implementation shows better scalability than NWChem for constructing the Fock matrix.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122219306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Boyu Zhang, Trilce Estrada, Pietro Cicotti, M. Taufer
This paper presents a one-pass, distributed method that enables in-situ data analysis for large protein folding trajectory datasets by executing sufficiently fast, avoiding moving trajectory data, and limiting the memory usage. First, the method extracts the geometric shape features of each protein conformation in parallel. Then, it classifies sets of consecutive conformations into meta-stable and transition stages using a probabilistic hierarchical clustering method. Lastly, it rebuilds the global knowledge necessary for the intraand inter-trajectory analysis through a reduction operation. The comparison of our method with a traditional approach for a villin headpiece sub domain shows that our method generates significant improvements in execution time, memory usage, and data movement. Specifically, to analyze the same trajectory consisting of 20,000 protein conformations, our method runs in 41.5 seconds while the traditional approach takes approximately 3 hours, uses 6.9MB memory per core while the traditional method uses 16GB on one single node where the analysis is performed, and communicates only 4.4KB while the traditional method moves the entire dataset of 539MB. The overall results in this paper support our claim that our method is suitable for in-situ data analysis of folding trajectories.
{"title":"Enabling In-Situ Data Analysis for Large Protein-Folding Trajectory Datasets","authors":"Boyu Zhang, Trilce Estrada, Pietro Cicotti, M. Taufer","doi":"10.1109/IPDPS.2014.33","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.33","url":null,"abstract":"This paper presents a one-pass, distributed method that enables in-situ data analysis for large protein folding trajectory datasets by executing sufficiently fast, avoiding moving trajectory data, and limiting the memory usage. First, the method extracts the geometric shape features of each protein conformation in parallel. Then, it classifies sets of consecutive conformations into meta-stable and transition stages using a probabilistic hierarchical clustering method. Lastly, it rebuilds the global knowledge necessary for the intraand inter-trajectory analysis through a reduction operation. The comparison of our method with a traditional approach for a villin headpiece sub domain shows that our method generates significant improvements in execution time, memory usage, and data movement. Specifically, to analyze the same trajectory consisting of 20,000 protein conformations, our method runs in 41.5 seconds while the traditional approach takes approximately 3 hours, uses 6.9MB memory per core while the traditional method uses 16GB on one single node where the analysis is performed, and communicates only 4.4KB while the traditional method moves the entire dataset of 539MB. The overall results in this paper support our claim that our method is suitable for in-situ data analysis of folding trajectories.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124759066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bhavishya Goel, J. Gil, A. Negi, S. Mckee, P. Stenström
Hardware transactional memory implementations are becoming increasingly available. For instance, the Intel Core i7 4770 implements Restricted Transactional Memory (RTM) support for Intel Transactional Synchronization Extensions (TSX). In this paper, we present a detailed evaluation of RTM performance and energy expenditure. We compare RTM behavior to that of the TinySTM software transactional memory system, first by running micro benchmarks, and then by running the STAMP benchmark suite. We find that which system performs better depends heavily on the workload characteristics. We then conduct a case study of two STAMP applications to assess the impact of programming style on RTM performance and to investigate what kinds of software optimizations can help overcome RTM's hardware limitations.
{"title":"Performance and Energy Analysis of the Restricted Transactional Memory Implementation on Haswell","authors":"Bhavishya Goel, J. Gil, A. Negi, S. Mckee, P. Stenström","doi":"10.1109/IPDPS.2014.70","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.70","url":null,"abstract":"Hardware transactional memory implementations are becoming increasingly available. For instance, the Intel Core i7 4770 implements Restricted Transactional Memory (RTM) support for Intel Transactional Synchronization Extensions (TSX). In this paper, we present a detailed evaluation of RTM performance and energy expenditure. We compare RTM behavior to that of the TinySTM software transactional memory system, first by running micro benchmarks, and then by running the STAMP benchmark suite. We find that which system performs better depends heavily on the workload characteristics. We then conduct a case study of two STAMP applications to assess the impact of programming style on RTM performance and to investigate what kinds of software optimizations can help overcome RTM's hardware limitations.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123823623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiang Guan, Nathan Debardeleben, S. Blanchard, Song Fu
As the high performance computing (HPC) community continues to push towards exascale computing, resilience remains a serious challenge. With the expected decrease of both feature size and operating voltage, we expect a significant increase in hardware soft errors. HPC applications of today are only affected by soft errors to a small degree but we expect that this will become a more serious issue as HPC systems grow. We propose F-SEFI, a Fine-grained Soft Error Fault Injector, as a tool for profiling software robustness against soft errors. In this paper we utilize soft error injection to mimic the impact of errors on logic circuit behavior. Leveraging the open source virtual machine hypervisor QEMU, F-SEFI enables users to modify emulated machine instructions to introduce soft errors. F-SEFI can control what application, which sub-function, when and how to inject soft errors with different granularities, without interference to other applications that share the same environment. F-SEFI does this without requiring revisions to the application source code, compilers or operating systems. We discuss the design constraints for F-SEFI and the specifics of our implementation. We demonstrate use cases of F-SEFI on several benchmark applications to show how data corruption can propagate to incorrect results.
{"title":"F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability","authors":"Qiang Guan, Nathan Debardeleben, S. Blanchard, Song Fu","doi":"10.1109/IPDPS.2014.128","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.128","url":null,"abstract":"As the high performance computing (HPC) community continues to push towards exascale computing, resilience remains a serious challenge. With the expected decrease of both feature size and operating voltage, we expect a significant increase in hardware soft errors. HPC applications of today are only affected by soft errors to a small degree but we expect that this will become a more serious issue as HPC systems grow. We propose F-SEFI, a Fine-grained Soft Error Fault Injector, as a tool for profiling software robustness against soft errors. In this paper we utilize soft error injection to mimic the impact of errors on logic circuit behavior. Leveraging the open source virtual machine hypervisor QEMU, F-SEFI enables users to modify emulated machine instructions to introduce soft errors. F-SEFI can control what application, which sub-function, when and how to inject soft errors with different granularities, without interference to other applications that share the same environment. F-SEFI does this without requiring revisions to the application source code, compilers or operating systems. We discuss the design constraints for F-SEFI and the specifics of our implementation. We demonstrate use cases of F-SEFI on several benchmark applications to show how data corruption can propagate to incorrect results.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"12 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124127250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Directive-based GPU programming models are gaining momentum, since they transparently relieve programmers from dealing with complexity of low-level GPU programming, which often reflects the underlying architecture. However, too much abstraction in directive models puts a significant burden on programmers for debugging applications and tuning performance. In this paper, we propose a directive-based, interactive program debugging and optimization system. This system enables intuitive and synergistic interaction among programmers, compilers, and runtimes for more productive and efficient GPU computing. We have designed and implemented a series of prototype tools within our new open source compiler framework, called Open Accelerator Research Compiler (Open ARC), Open ARC supports the full feature set of Opencast V1.0. Our evaluation on twelve Open ACC benchmarks demonstrates that our prototype debugging and optimization system can detect a variety of translation errors. Additionally, the optimization provided by our prototype minimizes memory transfers, when compared to a fully manual memory management scheme.
基于指令的GPU编程模型正在获得动力,因为它们透明地将程序员从处理低级GPU编程的复杂性中解放出来,低级GPU编程通常反映底层架构。然而,指令模型中太多的抽象给程序员调试应用程序和调优性能带来了沉重的负担。本文提出了一种基于指令的交互式程序调试与优化系统。该系统支持程序员、编译器和运行时之间的直观和协同交互,以实现更高效的GPU计算。我们在新的开源编译器框架中设计并实现了一系列原型工具,称为open Accelerator Research compiler (open ARC), open ARC支持Opencast V1.0的全部特性集。我们对12个Open ACC基准测试的评估表明,我们的原型调试和优化系统可以检测到各种翻译错误。此外,与完全手动的内存管理方案相比,我们的原型提供的优化最大限度地减少了内存传输。
{"title":"Interactive Program Debugging and Optimization for Directive-Based, Efficient GPU Computing","authors":"Seyong Lee, Dong Li, J. Vetter","doi":"10.1109/IPDPS.2014.57","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.57","url":null,"abstract":"Directive-based GPU programming models are gaining momentum, since they transparently relieve programmers from dealing with complexity of low-level GPU programming, which often reflects the underlying architecture. However, too much abstraction in directive models puts a significant burden on programmers for debugging applications and tuning performance. In this paper, we propose a directive-based, interactive program debugging and optimization system. This system enables intuitive and synergistic interaction among programmers, compilers, and runtimes for more productive and efficient GPU computing. We have designed and implemented a series of prototype tools within our new open source compiler framework, called Open Accelerator Research Compiler (Open ARC), Open ARC supports the full feature set of Opencast V1.0. Our evaluation on twelve Open ACC benchmarks demonstrates that our prototype debugging and optimization system can detect a variety of translation errors. Additionally, the optimization provided by our prototype minimizes memory transfers, when compared to a fully manual memory management scheme.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126378932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Huang, Xuechen Zhang, G. Eisenhauer, K. Schwan, M. Wolf, S. Ethier, S. Klasky
Collaborative science demands global sharing of scientific data. But it cannot leverage universally accessible cloud-based infrastructures like Drop Box, as those offer limited interfaces and inadequate levels of access bandwidth. We present the Scibox cloud facility for online sharing scientific data. It uses standard cloud storage solutions, but offers a usage model in which high end codes can write/read data to/from the cloud via the APIs they already use for their I/O actions. With Scibox, data upload/download volumes are controlled via Data Reduction-functions stated by end users and applied at the data source, before data is moved, with further gains in efficiency obtained by combining DR-functions to move exactly what is needed by current data consumers. We evaluate Scibox with science applications and their representative data analytics - the GTS fusion and the combustion image processing - demonstrating the potential for ubiquitous data access with substantial reductions in network traffic.
{"title":"Scibox: Online Sharing of Scientific Data via the Cloud","authors":"Jian Huang, Xuechen Zhang, G. Eisenhauer, K. Schwan, M. Wolf, S. Ethier, S. Klasky","doi":"10.1109/IPDPS.2014.26","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.26","url":null,"abstract":"Collaborative science demands global sharing of scientific data. But it cannot leverage universally accessible cloud-based infrastructures like Drop Box, as those offer limited interfaces and inadequate levels of access bandwidth. We present the Scibox cloud facility for online sharing scientific data. It uses standard cloud storage solutions, but offers a usage model in which high end codes can write/read data to/from the cloud via the APIs they already use for their I/O actions. With Scibox, data upload/download volumes are controlled via Data Reduction-functions stated by end users and applied at the data source, before data is moved, with further gains in efficiency obtained by combining DR-functions to move exactly what is needed by current data consumers. We evaluate Scibox with science applications and their representative data analytics - the GTS fusion and the combustion image processing - demonstrating the potential for ubiquitous data access with substantial reductions in network traffic.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132489084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoyi Lu, Fan Liang, Bing Wang, L. Zha, Zhiwei Xu
MPI has been widely used in High Performance Computing. In contrast, such efficient communication support is lacking in the field of Big Data Computing, where communication is realized by time consuming techniques such as HTTP/RPC. This paper takes a step in bridging these two fields by extending MPI to support Hadoop-like Big Data Computing jobs, where processing and communication of a large number of key-value pair instances are needed through distributed computation models such as MapReduce, Iteration, and Streaming. We abstract the characteristics of key-value communication patterns into a bipartite communication model, which reveals four distinctions from MPI: Dichotomic, Dynamic, Data-centric, and Diversified features. Utilizing this model, we propose the specification of a minimalistic extension to MPI. An open source communication library, DataMPI, is developed to implement this specification. Performance experiments show that DataMPI has significant advantages in performance and flexibility, while maintaining high productivity, scalability, and fault tolerance of Hadoop.
{"title":"DataMPI: Extending MPI to Hadoop-Like Big Data Computing","authors":"Xiaoyi Lu, Fan Liang, Bing Wang, L. Zha, Zhiwei Xu","doi":"10.1109/IPDPS.2014.90","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.90","url":null,"abstract":"MPI has been widely used in High Performance Computing. In contrast, such efficient communication support is lacking in the field of Big Data Computing, where communication is realized by time consuming techniques such as HTTP/RPC. This paper takes a step in bridging these two fields by extending MPI to support Hadoop-like Big Data Computing jobs, where processing and communication of a large number of key-value pair instances are needed through distributed computation models such as MapReduce, Iteration, and Streaming. We abstract the characteristics of key-value communication patterns into a bipartite communication model, which reveals four distinctions from MPI: Dichotomic, Dynamic, Data-centric, and Diversified features. Utilizing this model, we propose the specification of a minimalistic extension to MPI. An open source communication library, DataMPI, is developed to implement this specification. Performance experiments show that DataMPI has significant advantages in performance and flexibility, while maintaining high productivity, scalability, and fault tolerance of Hadoop.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132622308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Architecture simulation for GPGPU kernels can take a significant amount of time, especially for large-scale GPGPU kernels. This paper presents TBPoint, an infrastructure based on profiling-based sampling for GPGPU kernels to reduce the cycle-level simulation time. Compared to existing approaches, TBPoint provides a flexible and architecture-independent way to take samples. For the evaluated 12 kernels, the geometric means of sampling errors of TBPoint, Ideal-Simpoint, and random sampling are 0.47%, 1.74%, and 7.95%, respectively, while the geometric means of the total sample size of TBPoint, Ideal-Simpoint, and random sampling are 2.6%, 5.4%, and 10%, respectively. TBPoint narrows the speed gap between hardware and GPGPU simulators, enabling more and more large-scale GPGPU kernels to be analyzed using detailed timing simulations.
{"title":"TBPoint: Reducing Simulation Time for Large-Scale GPGPU Kernels","authors":"Jen-Cheng Huang, Lifeng Nai, Hyesoon Kim, H. Lee","doi":"10.1109/IPDPS.2014.53","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.53","url":null,"abstract":"Architecture simulation for GPGPU kernels can take a significant amount of time, especially for large-scale GPGPU kernels. This paper presents TBPoint, an infrastructure based on profiling-based sampling for GPGPU kernels to reduce the cycle-level simulation time. Compared to existing approaches, TBPoint provides a flexible and architecture-independent way to take samples. For the evaluated 12 kernels, the geometric means of sampling errors of TBPoint, Ideal-Simpoint, and random sampling are 0.47%, 1.74%, and 7.95%, respectively, while the geometric means of the total sample size of TBPoint, Ideal-Simpoint, and random sampling are 2.6%, 5.4%, and 10%, respectively. TBPoint narrows the speed gap between hardware and GPGPU simulators, enabling more and more large-scale GPGPU kernels to be analyzed using detailed timing simulations.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130929103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Finding the strongly connected components (SCCs) of a directed graph is a fundamental graph-theoretic problem. Tarjan's algorithm is an efficient serial algorithm to find SCCs, but relies on the hard-to-parallelize depth-first search (DFS). We observe that implementations of several parallel SCC detection algorithms show poor parallel performance on modern multicore platforms and large-scale networks. This paper introduces the Multistep method, a new approach that avoids work inefficiencies seen in prior SCC approaches. It does not rely on DFS, but instead uses a combination of breadth-first search (BFS) and a parallel graph coloring routine. We show that the Multistep method scales well on several real-world graphs, with performance fairly independent of topological properties such as the size of the largest SCC and the total number of SCCs. On a 16-core Intel Xeon platform, our algorithm achieves a 20X speedup over the serial approach on a 2 billion edge graph, fully decomposing it in under two seconds. For our collection of test networks, we observe that the Multistep method is 1.92X faster (mean speedup) than the state-of-the-art Hong et al. SCC method. In addition, we modify the Multistep method to find connected and weakly connected components, as well as introduce a novel algorithm for determining articulation vertices of biconnected components. These approaches all utilize the same underlying BFS and coloring routines.
{"title":"BFS and Coloring-Based Parallel Algorithms for Strongly Connected Components and Related Problems","authors":"George M. Slota, S. Rajamanickam, Kamesh Madduri","doi":"10.1109/IPDPS.2014.64","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.64","url":null,"abstract":"Finding the strongly connected components (SCCs) of a directed graph is a fundamental graph-theoretic problem. Tarjan's algorithm is an efficient serial algorithm to find SCCs, but relies on the hard-to-parallelize depth-first search (DFS). We observe that implementations of several parallel SCC detection algorithms show poor parallel performance on modern multicore platforms and large-scale networks. This paper introduces the Multistep method, a new approach that avoids work inefficiencies seen in prior SCC approaches. It does not rely on DFS, but instead uses a combination of breadth-first search (BFS) and a parallel graph coloring routine. We show that the Multistep method scales well on several real-world graphs, with performance fairly independent of topological properties such as the size of the largest SCC and the total number of SCCs. On a 16-core Intel Xeon platform, our algorithm achieves a 20X speedup over the serial approach on a 2 billion edge graph, fully decomposing it in under two seconds. For our collection of test networks, we observe that the Multistep method is 1.92X faster (mean speedup) than the state-of-the-art Hong et al. SCC method. In addition, we modify the Multistep method to find connected and weakly connected components, as well as introduce a novel algorithm for determining articulation vertices of biconnected components. These approaches all utilize the same underlying BFS and coloring routines.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133460571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}