2014 IEEE 28th International Parallel and Distributed Processing Symposium最新文献_第2页

An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data 非规则数据的高效GPU通用稀疏矩阵-矩阵乘法

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.47

Weifeng Liu, B. Vinter

General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method, breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM algorithm has to handle extra irregularity from three aspects: (1) the number of the nonzero entries in the result sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the result sparse matrix dominate the execution time, and (3) load balancing must account for sparse data in both input matrices. Recent work on GPU SpGEMM has demonstrated rather good both time and space complexity, but works best for fairly regular matrices. In this work we present a GPU SpGEMM algorithm that particularly focuses on the above three problems. Memory pre-allocation for the result matrix is organized by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited on-chip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load balancing builds on the number of the necessary arithmetic operations on the nonzero entries and is guaranteed in all stages. Compared with the state-of-the-art GPU SpGEMM methods in the CUSPARSE library and the CUSP library and the latest CPU SpGEMM method in the Intel Math Kernel Library, our approach delivers excellent absolute performance and relative speedups on a benchmark suite composed of 23 matrices with diverse sparsity structures.

广义稀疏矩阵-矩阵乘法(SpGEMM)是代数多重网格法、宽度优先搜索和最短路径问题等众多应用的基本组成部分。与其他稀疏BLAS例程相比，高效的并行SpGEMM算法必须从三个方面处理额外的不规则性:(1)结果稀疏矩阵中非零条目的数量是事先未知的;(2)在结果稀疏矩阵中随机位置进行非常昂贵的并行插入操作，占据了执行时间;(3)负载平衡必须考虑到两个输入矩阵中的稀疏数据。最近在GPU SpGEMM上的工作已经证明了相当好的时间和空间复杂性，但最适合于相当规则的矩阵。在这项工作中，我们提出了一个GPU SpGEMM算法，特别关注上述三个问题。结果矩阵的内存预分配采用混合方法组织，既节省了大量的全局内存空间，又有效地利用了有限的片上刮板内存。通过GPU合并路径算法实现非零项的并行插入操作，实验证明该算法是最快的GPU合并方法。负载平衡建立在非零项上必要的算术运算的数量上，并且在所有阶段都得到保证。与CUSPARSE库和CUSP库中最先进的GPU SpGEMM方法以及英特尔数学内核库中最新的CPU SpGEMM方法相比，我们的方法在由23个具有不同稀疏结构的矩阵组成的基准套件上提供了出色的绝对性能和相对加速。

{"title":"An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data","authors":"Weifeng Liu, B. Vinter","doi":"10.1109/IPDPS.2014.47","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.47","url":null,"abstract":"General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method, breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM algorithm has to handle extra irregularity from three aspects: (1) the number of the nonzero entries in the result sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the result sparse matrix dominate the execution time, and (3) load balancing must account for sparse data in both input matrices. Recent work on GPU SpGEMM has demonstrated rather good both time and space complexity, but works best for fairly regular matrices. In this work we present a GPU SpGEMM algorithm that particularly focuses on the above three problems. Memory pre-allocation for the result matrix is organized by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited on-chip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load balancing builds on the number of the necessary arithmetic operations on the nonzero entries and is guaranteed in all stages. Compared with the state-of-the-art GPU SpGEMM methods in the CUSPARSE library and the CUSP library and the latest CPU SpGEMM method in the Intel Math Kernel Library, our approach delivers excellent absolute performance and relative speedups on a benchmark suite composed of 23 matrices with diverse sparsity structures.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127402338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 94

PAGE: A Framework for Easy PArallelization of GEnomic Applications PAGE:基因组应用的简单并行化框架

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.19

Mucahid Kutlu, G. Agrawal

With the availability of high-throughput and low-cost sequencing technologies, an increasing amount of genetic data is becoming available to researchers. There is clearly a potential for significant new scientific and medical advances by analysis of such data, however, it is imperative to exploit parallelism and achieve effective utilization of the computing resources to be able to handle massive datasets. Thus, frameworks that can help researchers develop parallel applications without dealing with low-level details of parallel coding are very important for advances in genetic research. In this study, we develop a middleware, PAGE, which supports 'map reduce-like' processing, but with significant differences from a system like Hadoop, to be useful and effective for parallelizing analysis of genomic data. Particularly, it can work with map functions written in any language, thus allowing utilization of existing serial tools (even those for which only an executable is available) as map functions. Thus, it can greatly simplify parallel application development for scenarios where complex data formats and/or nuanced serial algorithms are involved, as is often the case for genomic data. It allows parallelization by partitioning by-locus or partitioning by-chromosome, provides different scheduling schemes, and execution models, to match the nature of algorithms common in genetic research. We have evaluated the middleware system using four popular genomic applications, including VarScan, Unified Genotyper, Realigner Target Creator, and Indel Realigner, and compared the achieved performance against with two popular frameworks (Hadoop and GATK). We show that our middleware outperforms GATK and Hadoop and it is able to achieve high parallel efficiency and scalability.

随着高通量和低成本测序技术的出现，研究人员可以获得越来越多的遗传数据。显然，通过分析这些数据有可能取得重大的新的科学和医学进步，但是，必须利用并行性并实现对计算资源的有效利用，以便能够处理大量数据集。因此，可以帮助研究人员开发并行应用程序而无需处理并行编码的底层细节的框架对于遗传研究的进展非常重要。在这项研究中，我们开发了一个中间件，PAGE，它支持“类似于地图还原”的处理，但与Hadoop等系统有很大的不同，对于基因组数据的并行分析是有用和有效的。特别是，它可以使用用任何语言编写的映射函数，从而允许利用现有的串行工具(甚至是那些只有可执行文件可用的工具)作为映射函数。因此，它可以极大地简化涉及复杂数据格式和/或细微的串行算法的场景的并行应用程序开发，通常是基因组数据的情况。它允许通过按基因座划分或按染色体划分进行并行化，提供不同的调度方案和执行模型，以匹配遗传研究中常见算法的性质。我们使用四种流行的基因组应用程序(包括VarScan、Unified genotype、Realigner Target Creator和Indel Realigner)对中间件系统进行了评估，并将实现的性能与两种流行的框架(Hadoop和GATK)进行了比较。我们证明了我们的中间件优于GATK和Hadoop，它能够实现高并行效率和可扩展性。

{"title":"PAGE: A Framework for Easy PArallelization of GEnomic Applications","authors":"Mucahid Kutlu, G. Agrawal","doi":"10.1109/IPDPS.2014.19","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.19","url":null,"abstract":"With the availability of high-throughput and low-cost sequencing technologies, an increasing amount of genetic data is becoming available to researchers. There is clearly a potential for significant new scientific and medical advances by analysis of such data, however, it is imperative to exploit parallelism and achieve effective utilization of the computing resources to be able to handle massive datasets. Thus, frameworks that can help researchers develop parallel applications without dealing with low-level details of parallel coding are very important for advances in genetic research. In this study, we develop a middleware, PAGE, which supports 'map reduce-like' processing, but with significant differences from a system like Hadoop, to be useful and effective for parallelizing analysis of genomic data. Particularly, it can work with map functions written in any language, thus allowing utilization of existing serial tools (even those for which only an executable is available) as map functions. Thus, it can greatly simplify parallel application development for scenarios where complex data formats and/or nuanced serial algorithms are involved, as is often the case for genomic data. It allows parallelization by partitioning by-locus or partitioning by-chromosome, provides different scheduling schemes, and execution models, to match the nature of algorithms common in genetic research. We have evaluated the middleware system using four popular genomic applications, including VarScan, Unified Genotyper, Realigner Target Creator, and Indel Realigner, and compared the achieved performance against with two popular frameworks (Hadoop and GATK). We show that our middleware outperforms GATK and Hadoop and it is able to achieve high parallel efficiency and scalability.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114811195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Spatio-temporal Coupling Method to Reduce the Time-to-Solution of Cardiovascular Simulations 一种缩短心血管仿真求解时间的时空耦合方法

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.68

A. Randles, E. Kaxiras

We present a new parallel-in-time method designed to reduce the overall time-to-solution of a patient-specific cardiovascular flow simulation. Using a modified Para real algorithm, our approach extends strong scalability beyond spatial parallelism with fully controllable accuracy and no decrease in stability. We discuss the coupling of spatial and temporal domain decompositions used in our implementation, and showcase the use of the method on a study of blood flow through the aorta. We observe an additional 40% reduction in overall wall clock time with no significant loss of accuracy, in agreement with a predictive performance model.

我们提出了一种新的并行时间方法，旨在减少患者特定心血管血流模拟的总体解决时间。利用改进的Para - real算法，我们的方法扩展了超越空间并行的强大可扩展性，具有完全可控的精度和不降低的稳定性。我们讨论了在我们的实现中使用的空间和时间域分解的耦合，并展示了该方法在主动脉血流研究中的应用。我们观察到整体挂钟时间减少了40%，但没有明显的准确性损失，与预测性能模型一致。

引用次数: 5

F2C2-STM: Flux-Based Feedback-Driven Concurrency Control for STMs F2C2-STM:基于通量的反馈驱动stm并发控制

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.99

K. Ravichandran, S. Pande

Software Transactional Memory (STM) systems provide an easy to use programming model for concurrent code and have been found suitable for parallelizing many applications providing performance gains with minimal programmer effort. With increasing core counts on modern processors one would expect increasing benefits. However, we observe that running STM applications on higher core counts is sometimes, in fact, detrimental to performance. This is due to the larger number of conflicts that arise with a larger number of parallel cores. As the number of cores available on processors steadily rise, a larger number of applications are beginning to exhibit these characteristics. In this paper we propose a novel dynamic concurrency control technique which can significantly improve performance (up to 50%) as well as resource utilization (up to 85%) for these applications at higher core counts. Our technique uses ideas borrowed from TCP's network congestion control algorithm and uses self-induced concurrency fluctuations to dynamically monitor and match varying concurrency levels in applications while minimizing global synchronization. Our flux-based feedback-driven concurrency control technique is capable of fully recovering the performance of the best statically chosen concurrency specification (as chosen by an oracle) regardless of the initial specification for several real world applications. Further, our technique can actually improve upon the performance of the oracle chosen specification by more than 10% for certain applications through dynamic adaptation to available parallelism. We demonstrate our approach on the STAMP benchmark suite while reporting significant performance and resource utilization benefits. We also demonstrate significantly better performance when comparing against state of the art concurrency control and scheduling techniques. Further, our technique is programmer friendly as it requires no changes to application code and no offline phases.

软件事务性内存(Software Transactional Memory, STM)系统为并发代码提供了一种易于使用的编程模型，并且已被发现适合于并行化许多应用程序，从而以最少的程序员工作量获得性能提升。随着现代处理器上核心数量的增加，人们会期望得到越来越多的好处。然而，我们观察到，在更高的核数上运行STM应用程序有时实际上对性能有害。这是由于大量的并行核产生了大量的冲突。随着处理器上可用的内核数量稳步增加，越来越多的应用程序开始表现出这些特征。在本文中，我们提出了一种新的动态并发控制技术，该技术可以显着提高这些应用程序在更高核数下的性能(高达50%)和资源利用率(高达85%)。我们的技术借鉴了TCP网络拥塞控制算法的思想，并使用自诱导并发波动来动态监控和匹配应用程序中的不同并发级别，同时最小化全局同步。我们基于流量的反馈驱动并发控制技术能够完全恢复静态选择的最佳并发规范(由oracle选择)的性能，而不考虑几个实际应用程序的初始规范。此外，对于某些应用程序，我们的技术实际上可以通过动态适应可用的并行性，将所选oracle规范的性能提高10%以上。我们在STAMP基准测试套件上演示了我们的方法，同时报告了显著的性能和资源利用优势。与最先进的并发控制和调度技术相比，我们还展示了明显更好的性能。此外，我们的技术是程序员友好的，因为它不需要更改应用程序代码，也不需要离线阶段。

{"title":"F2C2-STM: Flux-Based Feedback-Driven Concurrency Control for STMs","authors":"K. Ravichandran, S. Pande","doi":"10.1109/IPDPS.2014.99","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.99","url":null,"abstract":"Software Transactional Memory (STM) systems provide an easy to use programming model for concurrent code and have been found suitable for parallelizing many applications providing performance gains with minimal programmer effort. With increasing core counts on modern processors one would expect increasing benefits. However, we observe that running STM applications on higher core counts is sometimes, in fact, detrimental to performance. This is due to the larger number of conflicts that arise with a larger number of parallel cores. As the number of cores available on processors steadily rise, a larger number of applications are beginning to exhibit these characteristics. In this paper we propose a novel dynamic concurrency control technique which can significantly improve performance (up to 50%) as well as resource utilization (up to 85%) for these applications at higher core counts. Our technique uses ideas borrowed from TCP's network congestion control algorithm and uses self-induced concurrency fluctuations to dynamically monitor and match varying concurrency levels in applications while minimizing global synchronization. Our flux-based feedback-driven concurrency control technique is capable of fully recovering the performance of the best statically chosen concurrency specification (as chosen by an oracle) regardless of the initial specification for several real world applications. Further, our technique can actually improve upon the performance of the oracle chosen specification by more than 10% for certain applications through dynamic adaptation to available parallelism. We demonstrate our approach on the STAMP benchmark suite while reporting significant performance and resource utilization benefits. We also demonstrate significantly better performance when comparing against state of the art concurrency control and scheduling techniques. Further, our technique is programmer friendly as it requires no changes to application code and no offline phases.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117043505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Balancing CPU-GPU Collaborative High-Order CFD Simulations on the Tianhe-1A Supercomputer 在天河1a超级计算机上平衡CPU-GPU协同高阶CFD仿真

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.80

Chuanfu Xu, Lilun Zhang, Xiaogang Deng, Jianbin Fang, Guang-Xiong Wang, Wei Cao, Yonggang Che, Yongxian Wang, Wei Liu

HOSTA is an in-house high-order CFD software that can simulate complex flows with complex geometries. Large scale high-order CFD simulations using HOSTA require massive HPC resources, thus motivating us to port it onto modern GPU accelerated supercomputers like Tianhe-1A. To achieve a greater speedup and fully tap the potential of Tianhe-1A, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present multiple novel techniques to balance the loads between the store-poor GPU and the store-rich CPU, and overlap the collaborative computation and communication as far as possible. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per Tianhe-1A node for HOSTA by 2.3X, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 Tianhe-1A nodes. With our method, we have successfully simulated China's large civil airplane configuration C919 containing 150M grid cells. To our best knowledge, this is the first paper that reports a CPUGPU collaborative high-order accurate aerodynamic simulation result with such a complex grid geometry.

HOSTA是一个内部高阶CFD软件，可以模拟复杂几何形状的复杂流动。使用HOSTA进行大规模高阶CFD模拟需要大量HPC资源，因此促使我们将其移植到像天河1a这样的现代GPU加速超级计算机上。为了实现更高的加速并充分挖掘天河1a的潜力，我们将CPU和GPU协作用于HOSTA，而不是使用单纯的GPU方法。我们提出了多种新颖的技术来平衡存储能力差的GPU和存储能力强的CPU之间的负载，并尽可能地实现协同计算和通信的重叠。考虑到CPU和GPU的负载平衡，我们将HOSTA的每个天河1a节点的最大模拟问题大小提高了2.3倍，同时与仅使用GPU的方法相比，协作方法可以提高约45%的性能。可扩展性测试表明，HOSTA可以在1024个天河1a节点上实现60%以上的并行效率。利用该方法，我们成功地模拟了包含150M网格单元的中国大型民用飞机C919。据我们所知，这是第一篇报道如此复杂网格几何的CPUGPU协同高阶精确气动仿真结果的论文。

{"title":"Balancing CPU-GPU Collaborative High-Order CFD Simulations on the Tianhe-1A Supercomputer","authors":"Chuanfu Xu, Lilun Zhang, Xiaogang Deng, Jianbin Fang, Guang-Xiong Wang, Wei Cao, Yonggang Che, Yongxian Wang, Wei Liu","doi":"10.1109/IPDPS.2014.80","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.80","url":null,"abstract":"HOSTA is an in-house high-order CFD software that can simulate complex flows with complex geometries. Large scale high-order CFD simulations using HOSTA require massive HPC resources, thus motivating us to port it onto modern GPU accelerated supercomputers like Tianhe-1A. To achieve a greater speedup and fully tap the potential of Tianhe-1A, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present multiple novel techniques to balance the loads between the store-poor GPU and the store-rich CPU, and overlap the collaborative computation and communication as far as possible. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per Tianhe-1A node for HOSTA by 2.3X, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 Tianhe-1A nodes. With our method, we have successfully simulated China's large civil airplane configuration C919 containing 150M grid cells. To our best knowledge, this is the first paper that reports a CPUGPU collaborative high-order accurate aerodynamic simulation result with such a complex grid geometry.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117282724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

A Case for a Flexible Scalar Unit in SIMT Architecture SIMT体系结构中柔性标量单元的一种情况

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.21

Yi Yang, Ping Xiang, Mike Mantor, Norman Rubin, Lisa R. Hsu, Qunfeng Dong, Huiyang Zhou

The wide availability and the Single-Instruction Multiple-Thread (SIMT)-style programming model have made graphics processing units (GPUs) a promising choice for high performance computing. However, because of the SIMT style processing, an instruction will be executed in every thread even if the operands are identical for all the threads. To overcome this inefficiency, the AMD's latest Graphics Core Next (GCN) architecture integrates a scalar unit into a SIMT unit. In GCN, both the SIMT unit and the scalar unit share a single SIMT style instruction stream. Depending on its type, an instruction is issued to either a scalar or a SIMT unit. In this paper, we propose to extend the scalar unit so that it can either share the instruction stream with the SIMT unit or execute a separate instruction stream. The program to be executed by the scalar unit is referred to as a scalar program and its purpose is to assist SIMT-unit execution. The scalar programs are either generated from SIMT programs automatically by the compiler or manually developed by expert developers. We make a case for our proposed flexible scalar unit through three collaborative execution paradigms: data prefetching, control divergence elimination, and scalar-workload extraction. Our experimental results show that significant performance gains can be achieved using our proposed approaches compared to the state-of-art SIMT style processing.

图形处理单元(gpu)的广泛可用性和单指令多线程(SIMT)风格的编程模型使其成为高性能计算的一个有前途的选择。但是，由于SIMT风格的处理，即使所有线程的操作数相同，也会在每个线程中执行一条指令。为了克服这种低效率，AMD最新的图形核心下一代(GCN)架构将标量单元集成到SIMT单元中。在GCN中，SIMT单元和标量单元共享一个SIMT风格的指令流。根据指令类型的不同，指令可以发出给标量或SIMT单元。在本文中，我们建议扩展标量单元，使其既可以与SIMT单元共享指令流，也可以执行单独的指令流。由标量单元执行的程序称为标量程序，其目的是协助simt单元执行。标量程序可以由编译器自动从SIMT程序生成，也可以由专业开发人员手动开发。我们通过三种协同执行范例为我们提出的灵活标量单元提供了一个案例:数据预取、控制分歧消除和标量工作负载提取。我们的实验结果表明，与最先进的SIMT风格处理相比，使用我们提出的方法可以获得显着的性能提升。

{"title":"A Case for a Flexible Scalar Unit in SIMT Architecture","authors":"Yi Yang, Ping Xiang, Mike Mantor, Norman Rubin, Lisa R. Hsu, Qunfeng Dong, Huiyang Zhou","doi":"10.1109/IPDPS.2014.21","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.21","url":null,"abstract":"The wide availability and the Single-Instruction Multiple-Thread (SIMT)-style programming model have made graphics processing units (GPUs) a promising choice for high performance computing. However, because of the SIMT style processing, an instruction will be executed in every thread even if the operands are identical for all the threads. To overcome this inefficiency, the AMD's latest Graphics Core Next (GCN) architecture integrates a scalar unit into a SIMT unit. In GCN, both the SIMT unit and the scalar unit share a single SIMT style instruction stream. Depending on its type, an instruction is issued to either a scalar or a SIMT unit. In this paper, we propose to extend the scalar unit so that it can either share the instruction stream with the SIMT unit or execute a separate instruction stream. The program to be executed by the scalar unit is referred to as a scalar program and its purpose is to assist SIMT-unit execution. The scalar programs are either generated from SIMT programs automatically by the compiler or manually developed by expert developers. We make a case for our proposed flexible scalar unit through three collaborative execution paradigms: data prefetching, control divergence elimination, and scalar-workload extraction. Our experimental results show that significant performance gains can be achieved using our proposed approaches compared to the state-of-art SIMT style processing.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128606805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

MobiStreams: A Reliable Distributed Stream Processing System for Mobile Devices 移动流:一个可靠的分布式流处理系统的移动设备

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.17

Huayong Wang, L. Peh

Multi-core phones are now pervasive. Yet, existing applications rely predominantly on a client-server computing paradigm, using phones only as thin clients, sending sensed information via the cellular network to servers for processing. This makes the cellular network the bottleneck, limiting overall application performance. In this paper, we propose Mobi Streams, a Distributed Stream Processing System (DSPS) that runs directly on smartphones. Mobi Streams can offload computing from remote servers to local phones and thus alleviate the pressure on the cellular network. Implementing DSPS on smartphones faces significant challenges: 1) multiple phones can readily fail simultaneously, and 2) the phones' ad-hoc WiFi network has low bandwidth. Mobi Streams tackles these challenges through two new techniques: 1) token-triggered check pointing, and 2) broadcast-based check pointing. Our evaluations driven by two real world applications deployed in the US and Singapore show that migrating from a server platform to a smartphone platform eliminates the cellular network bottleneck, leading to 0.78~42.6X throughput increase and 10%~94.8% latency decrease. Also, Mobi Streams' fault tolerance scheme increases throughput by 230% and reduces latency by 40% vs. prior state-of-the-art fault-tolerant DSPSs.

多核手机现在很普遍。然而，现有的应用程序主要依赖于客户机-服务器计算范式，仅将手机用作瘦客户机，通过蜂窝网络将感知到的信息发送到服务器进行处理。这使得蜂窝网络成为瓶颈，限制了整个应用程序的性能。在本文中，我们提出了Mobi Streams，一种直接运行在智能手机上的分布式流处理系统(dsp)。Mobi Streams可以将计算从远程服务器转移到本地电话，从而减轻蜂窝网络的压力。在智能手机上实施dsp面临着重大挑战:1)多部手机可能同时出现故障，2)手机的自组网WiFi网络带宽较低。Mobi Streams通过两项新技术解决了这些挑战:1)令牌触发的检查指向，以及2)基于广播的检查指向。我们通过部署在美国和新加坡的两个实际应用程序进行的评估表明，从服务器平台迁移到智能手机平台消除了蜂窝网络瓶颈，导致吞吐量增加0.78~42.6倍，延迟减少10%~94.8%。此外，与之前最先进的容错dsp相比，Mobi Streams的容错方案将吞吐量提高了230%，将延迟降低了40%。

{"title":"MobiStreams: A Reliable Distributed Stream Processing System for Mobile Devices","authors":"Huayong Wang, L. Peh","doi":"10.1109/IPDPS.2014.17","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.17","url":null,"abstract":"Multi-core phones are now pervasive. Yet, existing applications rely predominantly on a client-server computing paradigm, using phones only as thin clients, sending sensed information via the cellular network to servers for processing. This makes the cellular network the bottleneck, limiting overall application performance. In this paper, we propose Mobi Streams, a Distributed Stream Processing System (DSPS) that runs directly on smartphones. Mobi Streams can offload computing from remote servers to local phones and thus alleviate the pressure on the cellular network. Implementing DSPS on smartphones faces significant challenges: 1) multiple phones can readily fail simultaneously, and 2) the phones' ad-hoc WiFi network has low bandwidth. Mobi Streams tackles these challenges through two new techniques: 1) token-triggered check pointing, and 2) broadcast-based check pointing. Our evaluations driven by two real world applications deployed in the US and Singapore show that migrating from a server platform to a smartphone platform eliminates the cellular network bottleneck, leading to 0.78~42.6X throughput increase and 10%~94.8% latency decrease. Also, Mobi Streams' fault tolerance scheme increases throughput by 230% and reduces latency by 40% vs. prior state-of-the-art fault-tolerant DSPSs.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130873994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Identifying Code Phases Using Piece-Wise Linear Regressions 使用分段线性回归识别代码阶段

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.100

Harald Servat, Germán Llort, Juan Gonzalez, Judit Giménez, Jesús Labarta

Node-level performance is one of the factors that may limit applications from reaching the supercomputers' peak performance. Studying node-level performance and attributing it to the source code results into valuable insight that can be used to improve the application efficiency, albeit performing such a study may be an intimidating task due to the complexity and size of the applications. We present in this paper a mechanism that takes advantage of combining piece-wise linear regressions, coarse-grain sampling, and minimal instrumentation to detect performance phases in the computation regions even if their granularity is very fine. This mechanism then maps the performance of each phase into the application syntactical structure displaying a correlation between performance and source code. We introduce a methodology on top of this mechanism to describe the node-level performance of parallel applications, even for first-time seen applications. Finally, we demonstrate the methodology describing optimized in-production applications and further improving their performance applying small transformations to the code based on the hints discovered.

节点级性能是可能限制应用程序达到超级计算机峰值性能的因素之一。研究节点级性能并将其归因于源代码，可以获得用于提高应用程序效率的有价值的见解，尽管由于应用程序的复杂性和规模，执行这样的研究可能是一项令人生畏的任务。我们在本文中提出了一种机制，该机制结合了分段线性回归、粗粒度采样和最小仪器来检测计算区域中的性能阶段，即使它们的粒度非常细。该机制然后将每个阶段的性能映射到应用程序语法结构中，显示性能和源代码之间的相关性。我们在此机制之上引入了一种方法来描述并行应用程序的节点级性能，甚至是首次看到的应用程序。最后，我们演示了描述优化的生产应用程序的方法，并根据发现的提示对代码进行小转换，从而进一步提高其性能。

{"title":"Identifying Code Phases Using Piece-Wise Linear Regressions","authors":"Harald Servat, Germán Llort, Juan Gonzalez, Judit Giménez, Jesús Labarta","doi":"10.1109/IPDPS.2014.100","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.100","url":null,"abstract":"Node-level performance is one of the factors that may limit applications from reaching the supercomputers' peak performance. Studying node-level performance and attributing it to the source code results into valuable insight that can be used to improve the application efficiency, albeit performing such a study may be an intimidating task due to the complexity and size of the applications. We present in this paper a mechanism that takes advantage of combining piece-wise linear regressions, coarse-grain sampling, and minimal instrumentation to detect performance phases in the computation regions even if their granularity is very fine. This mechanism then maps the performance of each phase into the application syntactical structure displaying a correlation between performance and source code. We introduce a methodology on top of this mechanism to describe the node-level performance of parallel applications, even for first-time seen applications. Finally, we demonstrate the methodology describing optimized in-production applications and further improving their performance applying small transformations to the code based on the hints discovered.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"41 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121006223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Victim Selection and Distributed Work Stealing Performance: A Case Study 受害者选择和分布式工作盗窃行为:一个案例研究

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.74

Swann Perarnau, M. Sato

Work stealing is a popular solution to perform dynamic load balancing of irregular computations, both for shared memory and distributed memory systems. While shared memory performance of work stealing is well understood, distributing this algorithm to several thousands of nodes can introduce new performance issues. In particular, most studies of work stealing assume that all participating processes are equidistant from each other, in terms of communication latency. This paper presents a new performance evaluation of the popular UTS benchmark, in its work stealing implementation, on the scale of ten thousands of compute nodes. Taking advantage of the physical scale of the K Computer, we investigate in details the performance impact of communication latencies on work stealing. In particular, we introduce a new performance metric to assess the time needed by the work stealing scheduler to distribute work among all processes. Using this metric, we identify a previously overlooked issue: the victim selection function used by the work stealing application can severely impact its performance at large scale. To solve this issue, we introduce a new strategy taking into account the physical distance between nodes and achieve significant performance improvements.

对于共享内存和分布式内存系统，窃取工作是执行不规则计算的动态负载平衡的流行解决方案。虽然工作窃取的共享内存性能很好理解，但将该算法分布到数千个节点可能会引入新的性能问题。特别是，大多数关于窃取工作的研究都假设所有参与的进程在通信延迟方面彼此之间的距离是相等的。本文提出了一种新的性能评估的流行的UTS基准，在其工作窃取实现，在数万个计算节点的规模。利用K计算机的物理规模，我们详细研究了通信延迟对工作窃取的性能影响。特别是，我们引入了一个新的性能度量来评估工作窃取调度器在所有进程之间分配工作所需的时间。使用这个指标，我们发现了一个以前被忽视的问题:窃取工作的应用程序使用的受害者选择函数可能会严重影响其大规模的性能。为了解决这个问题，我们引入了一种考虑节点之间物理距离的新策略，并取得了显著的性能改进。

{"title":"Victim Selection and Distributed Work Stealing Performance: A Case Study","authors":"Swann Perarnau, M. Sato","doi":"10.1109/IPDPS.2014.74","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.74","url":null,"abstract":"Work stealing is a popular solution to perform dynamic load balancing of irregular computations, both for shared memory and distributed memory systems. While shared memory performance of work stealing is well understood, distributing this algorithm to several thousands of nodes can introduce new performance issues. In particular, most studies of work stealing assume that all participating processes are equidistant from each other, in terms of communication latency. This paper presents a new performance evaluation of the popular UTS benchmark, in its work stealing implementation, on the scale of ten thousands of compute nodes. Taking advantage of the physical scale of the K Computer, we investigate in details the performance impact of communication latencies on work stealing. In particular, we introduce a new performance metric to assess the time needed by the work stealing scheduler to distribute work among all processes. Using this metric, we identify a previously overlooked issue: the victim selection function used by the work stealing application can severely impact its performance at large scale. To solve this issue, we introduce a new strategy taking into account the physical distance between nodes and achieve significant performance improvements.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125094273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Improved Time Bounds for Linearizable Implementations of Abstract Data Types 抽象数据类型线性化实现的改进时间界限

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.77

Jiaqi Wang, Edward Talmage, Hyunyoung Lee, J. Welch

Linearizability is a well-known consistency condition for shared objects in concurrent systems. We focus on the problem of implementing linearizable objects of arbitrary data types in message-passing systems with bounded, but uncertain, message delay and bounded, but non-zero, clock skew. We present an algorithm that exploits axiomatic properties of different operations to reduce the running time of each operation below that obtainable with previously known algorithms. We also prove lower bounds on the time complexity of various kinds of operations, specified by the axioms they satisfy, resulting in reduced gaps in some cases and tight bounds in others.

线性性是并发系统中共享对象的一致性条件。我们关注在消息传递系统中实现任意数据类型的线性对象的问题，该系统具有有界但不确定的消息延迟和有界但非零的时钟偏差。我们提出了一种算法，该算法利用不同操作的公理性质来减少每个操作的运行时间，低于以前已知算法所能获得的运行时间。我们还证明了各种操作的时间复杂度的下界，由它们所满足的公理指定，从而导致某些情况下的间隙减小而另一些情况下的边界很紧。

引用次数: 9