2015 IEEE International Parallel and Distributed Processing Symposium Workshop最新文献

英文中文

A Unifying Programming Model for Parallel Graph Algorithms 并行图算法的统一编程模型

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.79

Jeremiah Willcock, A. Lumsdaine

Abstractions and programming models simplify the writing of programs by providing a clear mental framework for reasoning about problem domains and for isolating program expression from irrelevant implementation details. This paper focuses on the domain of graph algorithms, where there are several classes of details that we would like to hide from the programmer, including execution model, granularity of decomposition, and data representation. Most current systems expose some or all of these issues at the same level as their graph abstractions, constraining portability and extensibility while also negatively impacting programmer productivity. To address these challenges, this paper presents a unifying generalized SIMD-like programming model (and corresponding C++ implementation) that can be used to uniformly express graph and other irregular applications on a wide range of types of parallelism, decompositions, and data representations. With respect to these issues, we develop a detailed analysis of our approach and compare it to a number of popular alternatives.

抽象和编程模型通过为问题域的推理和将程序表达式与不相关的实现细节隔离开来提供一个清晰的心理框架，从而简化了程序的编写。本文的重点是图算法领域，其中有几类我们希望对程序员隐藏的细节，包括执行模型、分解粒度和数据表示。大多数当前的系统在其图形抽象的同一级别上暴露了部分或全部这些问题，限制了可移植性和可扩展性，同时也对程序员的工作效率产生了负面影响。为了解决这些挑战，本文提出了一个统一的通用simd类编程模型(以及相应的c++实现)，该模型可用于在广泛的并行、分解和数据表示类型上统一表示图形和其他不规则应用程序。关于这些问题，我们对我们的方法进行了详细的分析，并将其与许多流行的替代方案进行了比较。

引用次数: 0

Performance Constrained Static Energy Reduction Using Way-Sharing Target-Banks 基于路径共享目标库的性能约束静态能量降低

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.49

Shounak Chakraborty, Shirshendu Das, H. Kapoor

Most of chip-multiprocessors share a common large sized last level cache (LLC). In non-uniform cache access based architectures, the LLC is divided into multiple banks to be accessed independently. It has been observed that the principal amount of chip power in CMP is consumed by the LLC banks which can be divided into two major parts: dynamic and static. Techniques have been proposed to reduce the static power consumption of LLC by powering off the less utilized banks and forwarding its requests to other active banks (target banks). Once a bank is powered off, all the future requests arrive to its controller and get forwarded to the target bank. Such a bank shutdown process saves static power but reduces the performance of LLC. Due to multiple banks shutdown the target banks may also get overloaded. Additionally, the request forwarding increases the on chip traffic. In this paper, we improve the performance of the target banks by dynamically managing its associativity. The cost of request forwarding is optimized by considering network distance as an additional metric for target selection. These two strategies help to reduce performance degradation. Experimental analysis shows 43% reduction in static energy and 23% reduction in EDP for a 4MB LLC with a performance constraint of 3%.

大多数芯片多处理器共享一个公共的大容量最后一级缓存(LLC)。在基于非统一缓存访问的体系结构中，LLC被划分为多个独立访问的银行。在CMP中，芯片功耗主要由有限责任银行消耗，该银行可分为动态和静态两大部分。已经提出了一些技术，通过关闭利用率较低的银行并将其请求转发给其他活动银行(目标银行)来减少LLC的静态功耗。一旦银行关闭电源，所有未来的请求都会到达其控制器并被转发到目标银行。这种银行关闭过程节省了静态功率，但降低了LLC的性能。由于多家银行关闭，目标银行也可能过载。此外，请求转发增加了片上流量。在本文中，我们通过动态管理目标银行的关联性来提高其性能。通过考虑网络距离作为目标选择的附加度量来优化请求转发的开销。这两种策略有助于减少性能下降。实验分析表明，在性能约束为3%的情况下，4MB LLC的静态能量降低43%，EDP降低23%。

{"title":"Performance Constrained Static Energy Reduction Using Way-Sharing Target-Banks","authors":"Shounak Chakraborty, Shirshendu Das, H. Kapoor","doi":"10.1109/IPDPSW.2015.49","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.49","url":null,"abstract":"Most of chip-multiprocessors share a common large sized last level cache (LLC). In non-uniform cache access based architectures, the LLC is divided into multiple banks to be accessed independently. It has been observed that the principal amount of chip power in CMP is consumed by the LLC banks which can be divided into two major parts: dynamic and static. Techniques have been proposed to reduce the static power consumption of LLC by powering off the less utilized banks and forwarding its requests to other active banks (target banks). Once a bank is powered off, all the future requests arrive to its controller and get forwarded to the target bank. Such a bank shutdown process saves static power but reduces the performance of LLC. Due to multiple banks shutdown the target banks may also get overloaded. Additionally, the request forwarding increases the on chip traffic. In this paper, we improve the performance of the target banks by dynamically managing its associativity. The cost of request forwarding is optimized by considering network distance as an additional metric for target selection. These two strategies help to reduce performance degradation. Experimental analysis shows 43% reduction in static energy and 23% reduction in EDP for a 4MB LLC with a performance constraint of 3%.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133876551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Directive-Based Auto-Tuning for the Finite Difference Method on the Xeon Phi 基于指令的Xeon Phi协处理器有限差分法自动调优

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.11

T. Katagiri, S. Ohshima, M. Matsumoto

In this paper, we present a directive-based auto-tuning (AT) framework, called ppOpen-AT, and demonstrate its effect using simulation code based on the Finite Difference Method (FDM). The framework utilizes well-known loop transformation techniques. However, the codes used are carefully designed to minimize the software stack in order to meet the requirements of a many-core architecture currently in operation. The results of evaluations conducted using ppOpen-AT indicate that maximum speedup factors greater than 550% are obtained when it is applied in eight nodes of the Intel Xeon Phi. Further, in the AT for data packing and unpacking, a 49% speedup factor for the whole application is achieved. By using it with strong scaling on 32 nodes in a cluster of the Xeon Phi, we also obtain 24% speedups for the overall execution.

在本文中，我们提出了一个基于指令的自动调谐(AT)框架，称为ppOpen-AT，并使用基于有限差分法(FDM)的仿真代码演示了其效果。该框架利用了众所周知的循环转换技术。然而，所使用的代码经过精心设计，以最小化软件堆栈，以满足当前运行的多核体系结构的要求。使用ppOpen-AT进行的评估结果表明，当它应用于Intel Xeon Phi的8个节点时，获得的最大加速因子大于550%。此外，在用于数据打包和解包的AT中，整个应用程序的加速系数达到了49%。通过在Xeon Phi处理器集群的32个节点上使用它，我们还获得了24%的整体执行速度提升。

引用次数: 9

Fast Burrows Wheeler Compression Using All-Cores 快速Burrows Wheeler压缩使用全核

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.53

A. Deshpande, P J Narayanan

In this paper, we present an all-core implementation of Burrows Wheeler Compression algorithm that exploits all computing resources on a system. Our focus is to provide significant benefit to everyday users on common end-to-end applications by exploiting the parallelism of multiple CPU cores and additional accelerators, viz. Many-core GPU, on their machines. The all-core framework is suitable for problems that process large files or buffers in blocks. We consider a system to be made up of compute stations and use a work-queue to dynamically divide the tasks among them. Each compute station uses an implementation that optimally exploits its architecture. We develop a fast GPU BWC algorithm by extending the state-of-the-art GPU string sort to efficiently perform BWT step of BWC. Our hybrid BWC with GPU acceleration achieves a 2.9× speedup over best CPU implementation. Our all-core framework allows concurrent processing of blocks by both GPU and all available CPU cores. We achieve a 3.06× speedup by using all CPU cores and a 4.87× speedup when we additionally use an accelerator i.e. GPU. Our approach will scale to the number and different types of computing resources or accelerators found on a system.

在本文中，我们提出了一种利用系统上所有计算资源的Burrows Wheeler压缩算法的全核实现。我们的重点是通过在机器上利用多个CPU内核和额外加速器(即多核GPU)的并行性，为日常用户提供常见的端到端应用程序的显著好处。全核框架适用于处理大块文件或缓冲区的问题。我们考虑一个系统由多个计算站组成，并使用工作队列在计算站之间动态划分任务。每个计算站都使用最佳地利用其体系结构的实现。我们通过扩展最先进的GPU字符串排序，开发了一种快速的GPU BWC算法，以有效地执行BWC的BWT步骤。我们的混合BWC与GPU加速实现了2.9倍的速度比最好的CPU实现。我们的全核框架允许GPU和所有可用的CPU内核并发处理块。通过使用所有CPU内核，我们实现了3.06倍的加速，当我们额外使用加速器(如GPU)时，我们实现了4.87倍的加速。我们的方法将根据系统上的计算资源或加速器的数量和不同类型进行扩展。

{"title":"Fast Burrows Wheeler Compression Using All-Cores","authors":"A. Deshpande, P J Narayanan","doi":"10.1109/IPDPSW.2015.53","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.53","url":null,"abstract":"In this paper, we present an all-core implementation of Burrows Wheeler Compression algorithm that exploits all computing resources on a system. Our focus is to provide significant benefit to everyday users on common end-to-end applications by exploiting the parallelism of multiple CPU cores and additional accelerators, viz. Many-core GPU, on their machines. The all-core framework is suitable for problems that process large files or buffers in blocks. We consider a system to be made up of compute stations and use a work-queue to dynamically divide the tasks among them. Each compute station uses an implementation that optimally exploits its architecture. We develop a fast GPU BWC algorithm by extending the state-of-the-art GPU string sort to efficiently perform BWT step of BWC. Our hybrid BWC with GPU acceleration achieves a 2.9× speedup over best CPU implementation. Our all-core framework allows concurrent processing of blocks by both GPU and all available CPU cores. We achieve a 3.06× speedup by using all CPU cores and a 4.87× speedup when we additionally use an accelerator i.e. GPU. Our approach will scale to the number and different types of computing resources or accelerators found on a system.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116402240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Dynamic Job Scheduling in the Cloud Using Slowdown Optimization and Sandpile Cellular Automata Model 基于减速优化和沙堆元胞自动机模型的云中动态作业调度

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.139

Jakub Gasior, F. Seredyński

We present in this paper a general framework to study issues of effective load balancing and scheduling in highly parallel and distributed environments such as currently built Cloud computing systems. We propose a novel approach based on the concept of the Sandpile cellular automaton: a decentralized multi-agent system working in a critical state at the edge of chaos. Our goal is providing fairness between concurrent job submissions by minimizing slowdown of individual applications and dynamically rescheduling them to the best suited resources. The algorithm design is experimentally validated by a number of numerical experiments showing the effectiveness and scalability of the scheme in the presence of a large number of jobs and resources and its ability to react to dynamic changes in real time.

在本文中，我们提出了一个通用框架来研究高度并行和分布式环境(如当前构建的云计算系统)中有效的负载平衡和调度问题。我们提出了一种基于Sandpile元胞自动机概念的新方法:在混沌边缘的临界状态下工作的分散多智能体系统。我们的目标是通过最小化单个应用程序的减速并动态地将它们重新调度到最适合的资源，从而在并发作业提交之间提供公平性。通过大量的数值实验验证了算法设计的有效性和可扩展性，证明了该方案在大量作业和资源存在下的有效性和可扩展性，以及对动态变化的实时反应能力。

引用次数: 1

A Multi-objective Evolutionary Algorithm for Cloud Platform Reconfiguration 云平台重构的多目标进化算法

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.138

Francois Legillon, N. Melab, Didier Renard, E. Talbi

Offers of public IAAS providers often vary: new providers enter the market, existing ones change their pricing or improve their offering. Decision on whether and how to improve already deployed platforms, either by reconfiguration or migration to another provider, can be seen as a NP-hard optimization problem. In this paper, we define a new realistic model for this Migration Problem, based on a Multi-Objective Optimization formulation. An evolutionary approach is introduced to tackle the problem, using specific operators. Experiments are conducted on multiple realistic data-sets, showing that the evolutionary approach is viable to tackle real-size instances in a reasonable amount of time.

公共IAAS提供商提供的服务经常变化:新的提供商进入市场，现有的提供商改变其定价或改进其产品。决定是否以及如何通过重新配置或迁移到另一个提供商来改进已经部署的平台，可以被视为NP-hard优化问题。本文在多目标优化的基础上，定义了一种新的求解该迁移问题的现实模型。引入了一种进化的方法来解决这个问题，使用特定的操作符。在多个真实数据集上进行了实验，表明进化方法在合理的时间内处理真实尺寸的实例是可行的。

引用次数: 2

PCO Introduction and Committees 私隐专员简介及委员会

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.169

D. E. Baz, B. Uçar

PCO Introduction and Committees

私隐专员简介及委员会

引用次数: 0

High-Performance Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation MVAPICH2-X的高性能Coarray Fortran支持:初步经验和评估

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.115

Jian Lin, Khaled Hamidouche, Xiaoyi Lu, Mingzhe Li, D. Panda

Coarray Fortran (CAF) is a parallel programming paradigm that extends Fortran for the partitioned global address space (PGAS) programming model at the language level. The current runtime implementations of CAF are mainly using MPI or GASNet as underlying communication components. MVAPICH2-X is a hybrid MPI+PGAS programming library with a Unified Communication Runtime (UCR) design. In this paper, the classic implementation of CAF runtime in Open UH is redesigned and rebuilt on top of MVAPICH2-X. The proposed design does not only enable the support of MPI+CAF hybrid programming model, but also provides superior performance on most of the CAF one-sided operations and the newly proposed collective operations in Fortran 2015 specification. A comprehensive evaluation with different benchmarks and applications has been performed. Comparing with current GASNet-based solutions, the CAF runtime with MVAPICH2-X can improve the bandwidths of put and bidirectional operations up to 3.5X for inter-node communication, and improve the bandwidths of collective communication operations represented by broadcast up to 3.0X on 64 processes. It also reduces the execution time of NPB CAF benchmarks by up to 18% on 256 processes.

Coarray Fortran (CAF)是一种并行编程范例，它在语言级别上为分区全局地址空间(PGAS)编程模型扩展了Fortran。当前CAF的运行时实现主要使用MPI或GASNet作为底层通信组件。MVAPICH2-X是一个混合MPI+PGAS编程库，具有统一通信运行时(UCR)设计。本文在MVAPICH2-X的基础上，对Open UH中CAF运行时的经典实现进行了重新设计和重构。提出的设计不仅支持MPI+CAF混合编程模型，而且在大多数CAF单面操作和Fortran 2015规范中新提出的集合操作上提供了优越的性能。对不同的基准和应用程序进行了全面的评估。与目前基于gasnet的解决方案相比，采用MVAPICH2-X的CAF运行时可将节点间通信的放操作和双向操作带宽提高3.5倍，将以广播为代表的64进程集体通信操作带宽提高3.0倍。它还可以在256个进程上将NPB CAF基准测试的执行时间最多减少18%。

{"title":"High-Performance Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation","authors":"Jian Lin, Khaled Hamidouche, Xiaoyi Lu, Mingzhe Li, D. Panda","doi":"10.1109/IPDPSW.2015.115","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.115","url":null,"abstract":"Coarray Fortran (CAF) is a parallel programming paradigm that extends Fortran for the partitioned global address space (PGAS) programming model at the language level. The current runtime implementations of CAF are mainly using MPI or GASNet as underlying communication components. MVAPICH2-X is a hybrid MPI+PGAS programming library with a Unified Communication Runtime (UCR) design. In this paper, the classic implementation of CAF runtime in Open UH is redesigned and rebuilt on top of MVAPICH2-X. The proposed design does not only enable the support of MPI+CAF hybrid programming model, but also provides superior performance on most of the CAF one-sided operations and the newly proposed collective operations in Fortran 2015 specification. A comprehensive evaluation with different benchmarks and applications has been performed. Comparing with current GASNet-based solutions, the CAF runtime with MVAPICH2-X can improve the bandwidths of put and bidirectional operations up to 3.5X for inter-node communication, and improve the bandwidths of collective communication operations represented by broadcast up to 3.0X on 64 processes. It also reduces the execution time of NPB CAF benchmarks by up to 18% on 256 processes.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128792390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

AsHES Introduction and Committees 灰烬介绍及委员会

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2014.217

Yunquan Zhang

ASHES Introduction and Committees

ASHES简介及委员会

引用次数: 0

iWAPT Invited Talks 邀请演讲

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.180

P. Sadayappan, Ray-Bing Chen

iWAPT Invited Talks

邀请演讲

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀