2015 IEEE International Parallel and Distributed Processing Symposium Workshop最新文献

英文中文

Introducing Tetra: An Educational Parallel Programming System 教育并行编程系统Tetra简介

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.51

Ian Finlayson, Jerome Mueller, S. Rajapakse, Daniel Easterling

Despite the fact that we are firmly in the multicore era, the use of parallel programming is not as widespread as it could be - in the software industry or in education. There have been many calls to incorporate more parallel programming content into undergraduate computer science education. One obstacle to doing this is that the programming languages most commonly used for parallel programming are detailed, low-level languages such as C, C++, Fortran (with OpenMP or MPI), OpenCL and CUDA. These languages allow programmers to write very efficient code, but that is not so important for those whose goal is to learn the concepts of parallel computing. This paper introduces a parallel programming language called Tetra which provides parallel programming features as first class language features, and also provides garbage collection and is designed to be as simple as possible. Tetra also includes an integrated development environment which is specifically geared for debugging parallel programs and visualizing program execution across multiple threads.

尽管我们确实处于多核时代，但并行编程的使用并没有像它可能的那样广泛——在软件行业或教育领域。有很多人呼吁在本科计算机科学教育中加入更多的并行编程内容。这样做的一个障碍是，最常用于并行编程的编程语言是详细的低级语言，如C、c++、Fortran(带有OpenMP或MPI)、OpenCL和CUDA。这些语言允许程序员编写非常高效的代码，但对于那些以学习并行计算概念为目标的人来说，这并不重要。本文介绍了一种并行编程语言Tetra，它提供了并行编程的头等语言特性，并提供了垃圾回收功能，设计得尽可能简单。Tetra还包括一个集成开发环境，专门用于调试并行程序和可视化跨多线程的程序执行。

引用次数: 9

Auto-tuning Non-blocking Collective Communication Operations 自动调优非阻塞集体通信操作

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.15

Youcef Barigou, V. Venkatesan, E. Gabriel

Collective operations are widely used in large scale scientific applications, and critical to the scalability of these applications for large process counts. It has also been demonstrated that collective operations have to be carefully tuned for a given platform and application scenario to maximize their performance. Non-blocking collective operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. This paper presents the automatic run-time tuning of non-blocking collective communication operations, which allows the communication library to choose the best performing implementation for a non-blocking collective operation on a case by case basis. The paper demonstrates that libraries using a single algorithm or implementation for a non-blocking collective operation will inevitably lead to suboptimal performance in many scenarios, and thus validate the necessity for run-time tuning of these operations. The benefits of the approach are further demonstrated for an application kernel using a multi-dimensional Fast Fourier Transform. The results obtained for the application scenario indicate a performance improvement of up to 40% compared to the current state of the art.

集合操作广泛用于大规模科学应用程序中，并且对于这些应用程序的大型过程计数的可伸缩性至关重要。还证明了必须针对给定的平台和应用程序场景仔细调整集合操作，以最大化其性能。非阻塞集体操作通过提供能够重叠通信和计算的额外好处，扩展了集体操作的概念。本文提出了非阻塞集体通信操作的自动运行时调优，它允许通信库根据具体情况为非阻塞集体操作选择性能最佳的实现。本文论证了使用单一算法或实现进行非阻塞集体操作的库在许多情况下将不可避免地导致次优性能，从而验证了对这些操作进行运行时调优的必要性。利用多维快速傅里叶变换进一步证明了该方法的优点。在应用场景中获得的结果表明，与现有技术相比，性能提高了40%。

{"title":"Auto-tuning Non-blocking Collective Communication Operations","authors":"Youcef Barigou, V. Venkatesan, E. Gabriel","doi":"10.1109/IPDPSW.2015.15","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.15","url":null,"abstract":"Collective operations are widely used in large scale scientific applications, and critical to the scalability of these applications for large process counts. It has also been demonstrated that collective operations have to be carefully tuned for a given platform and application scenario to maximize their performance. Non-blocking collective operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. This paper presents the automatic run-time tuning of non-blocking collective communication operations, which allows the communication library to choose the best performing implementation for a non-blocking collective operation on a case by case basis. The paper demonstrates that libraries using a single algorithm or implementation for a non-blocking collective operation will inevitably lead to suboptimal performance in many scenarios, and thus validate the necessity for run-time tuning of these operations. The benefits of the approach are further demonstrated for an application kernel using a multi-dimensional Fast Fourier Transform. The results obtained for the application scenario indicate a performance improvement of up to 40% compared to the current state of the art.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132854904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Streamlining Whole Function Vectorization in C Using Higher Order Vector Semantics 用高阶向量语义在C语言中简化整个函数的向量化

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.37

Gil Rapaport, A. Zaks, Y. Ben-Asher

Taking full advantage of SIMD instructions in C programs still requires tedious and non-portable programming using intrinsics, despite considerable efforts spent developing auto-vectorization capabilities in recent decades. Whole Function Vectorization (WFV) is a recent technique for extending the use of SIMD across entire functions. WFV has so far only been used in data-parallel languages such as OpenCL and ISPC. We propose a vector-oriented programming framework that facilitates WFV directly in C. We show that our framework achieves competitive performance to Open CL and ISPC while maintaining C's original syntax and semantics. This allows C programmers to gain better performance for their applications by improving SIMD utilization, without stepping out of C.

在C程序中充分利用SIMD指令仍然需要使用内在特性进行冗长且不可移植的编程，尽管近几十年来在开发自动向量化功能方面付出了相当大的努力。全函数矢量化(WFV)是将SIMD的使用扩展到整个函数的最新技术。到目前为止，WFV仅用于数据并行语言，如OpenCL和ISPC。我们提出了一个面向向量的编程框架，直接在C中促进WFV。我们表明，我们的框架在保持C的原始语法和语义的同时，实现了与Open CL和ISPC竞争的性能。这允许C程序员通过提高SIMD利用率来获得更好的应用程序性能，而无需走出C。

引用次数: 8

Performance Portable Applications for Hardware Accelerators: Lessons Learned from SPEC ACCEL 硬件加速器的性能便携应用:从specaccel中学到的经验教训

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.26

G. Juckeland, Alexander Grund, W. Nagel

The popular and diverse hardware accelerator ecosystem makes apples-to-apples comparisons between platforms rather difficult. SPEC ACCEL tries to offer a yardstick to compare different accelerator hardware and software ecosystems. This paper uses this SPEC benchmark to compare an AMD GPU, an NVIDIA GPU and an Intel Xeon Phi with respect to performance and energy consumption. It also provides observations on the performance portability between the different platforms. Since the SPEC ACCEL Open ACC suite can yet not be run on a Xeon Phi, that suite was ported to OpenMP 4.0 target directives to enable a comparison. The challenges and solutions of this porting of 15 applications are described as well.

受欢迎且多样化的硬件加速器生态系统使得在平台之间进行苹果对苹果的比较相当困难。SPEC ACCEL试图提供一个标准来比较不同的加速器硬件和软件生态系统。本文使用该SPEC基准比较了AMD GPU、NVIDIA GPU和Intel Xeon Phi的性能和能耗。它还提供了对不同平台之间的性能可移植性的观察。由于SPEC ACCEL Open ACC套件还不能在Xeon Phi上运行，因此该套件被移植到OpenMP 4.0目标指令中以进行比较。本文还描述了移植15个应用程序的挑战和解决方案。

引用次数: 10

Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms 在异构平台上弥合Cholesky分解性能和边界之间的差距

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.35

E. Agullo, Olivier Beaumont, Lionel Eyraud-Dubois, J. Herrmann, Suraj Kumar, L. Marchal, Samuel Thibault

We consider the problem of allocating and scheduling dense linear application on fully heterogeneous platforms made of CPUs and GPUs. More specifically, we focus on the Cholesky factorization since it exhibits the main features of such problems. Indeed, the relative performance of CPU and GPU highly depends on the sub-routine: GPUs are for instance much more efficient to process regular kernels such as matrix-matrix multiplications rather than more irregular kernels such as matrix factorization. In this context, one solution consists in relying on dynamic scheduling and resource allocation mechanisms such as the ones provided by PaRSEC or StarPU. In this paper we analyze the performance of dynamic schedulers based on both actual executions and simulations, and we investigate how adding static rules based on an offline analysis of the problem to their decision process can indeed improve their performance, up to reaching some improved theoretical performance bounds which we introduce.

研究了在由cpu和gpu组成的全异构平台上密集线性应用程序的分配和调度问题。更具体地说，我们关注Cholesky分解，因为它展示了这类问题的主要特征。实际上，CPU和GPU的相对性能高度依赖于子例程:例如，GPU在处理规则内核(如矩阵-矩阵乘法)时比处理更不规则的内核(如矩阵分解)时效率更高。在这种情况下，一种解决方案是依赖于动态调度和资源分配机制，如PaRSEC或StarPU提供的机制。在本文中，我们分析了基于实际执行和模拟的动态调度程序的性能，并研究了如何将基于离线问题分析的静态规则添加到它们的决策过程中，从而确实提高了它们的性能，直至达到我们介绍的一些改进的理论性能界限。

引用次数: 25

A Crossbar Interconnection Network in DNA DNA中的交叉棒互连网络

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.103

B. Talawar

DNA computers provide exciting challenges and opportunities in the fields of computer architecture, neural networks, autonomous micromechanical devices, and chemical reaction networks. The advent of digital abstractions such as the seesaw gates hold many opportunities for computer architects to realize complex digital circuits using DNA strand displacement principles. The paper presents a realization of a single bit, 2×2 crossbar interconnection network built using seesaw gates. The functional correctness of the implemented crossbar was verified using a chemical reaction simulator.

DNA计算机在计算机体系结构、神经网络、自主微机械设备和化学反应网络等领域提供了令人兴奋的挑战和机遇。数字抽象的出现，如跷跷板门，为计算机架构师利用DNA链位移原理实现复杂的数字电路提供了许多机会。本文介绍了一种利用跷跷板门构建的单比特2×2横杆互连网络的实现方法。利用化学反应模拟器验证了所实现的横杆功能的正确性。

引用次数: 1

Parallel Asynchronous Modified Newton Methods for Network Flows 网络流的并行异步改进牛顿方法

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.34

D. E. Baz, M. Elkihel

We consider single commodity strictly convex network flow problems. The dual problem is unconstrained, differentiable and well suited for solution via parallel iterative methods. We propose and prove convergence of parallel asynchronous modified Newton algorithms for solving the dual problem. Parallel asynchronous Newton multisplitting algorithms are also considered, their convergence is also shown. A first set of computational results is presented and analyzed.

考虑单商品严格凸网络流问题。该对偶问题是无约束的、可微的，适合用并行迭代法求解。提出并证明了求解对偶问题的并行异步改进牛顿算法的收敛性。同时考虑了并行异步牛顿多分裂算法，并证明了其收敛性。给出了第一组计算结果并进行了分析。

引用次数: 1

Cache Support in a High Performance Fault-Tolerant Distributed Storage System for Cloud and Big Data 面向云和大数据的高性能容错分布式存储系统对缓存的支持

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.65

L. Lundberg, Håkan Grahn, D. Ilie, C. Melander

Due to the trends towards Big Data and Cloud Computing, one would like to provide large storage systems that are accessible by many servers. A shared storage can, however, become a performance bottleneck and a single-point of failure. Distributed storage systems provide a shared storage to the outside world, but internally they consist of a network of servers and disks, thus avoiding the performance bottleneck and single-point of failure problems. We introduce a cache in a distributed storage system. The cache system must be fault tolerant so that no data is lost in case of a hardware failure. This requirement excludes the use of the common write-invalidate cache consistency protocols. The cache is implemented and evaluated in two steps. The first step focuses on design decisions that improve the performance when only one server uses the same file. In the second step we extend the cache with features that focus on the case when more than one server access the same file. The cache improves the throughput significantly compared to having no cache. The two-step evaluation approach makes it possible to quantify how different design decisions affect the performance of different use cases.

由于大数据和云计算的趋势，人们希望提供可由许多服务器访问的大型存储系统。然而，共享存储可能成为性能瓶颈和单点故障。分布式存储系统向外界提供共享存储，但在内部由服务器和磁盘组成网络，从而避免了性能瓶颈和单点故障问题。我们在分布式存储系统中引入缓存。缓存系统必须具有容错性，以便在硬件出现故障时不会丢失数据。此要求不包括使用常见的write-invalidate缓存一致性协议。缓存分两个步骤实现和求值。第一步关注在只有一个服务器使用相同文件时如何改进性能的设计决策。在第二步中，我们扩展缓存，使用一些特性来关注多个服务器访问同一个文件的情况。与没有缓存相比，缓存显著提高了吞吐量。两步评估方法使得量化不同的设计决策如何影响不同用例的性能成为可能。

{"title":"Cache Support in a High Performance Fault-Tolerant Distributed Storage System for Cloud and Big Data","authors":"L. Lundberg, Håkan Grahn, D. Ilie, C. Melander","doi":"10.1109/IPDPSW.2015.65","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.65","url":null,"abstract":"Due to the trends towards Big Data and Cloud Computing, one would like to provide large storage systems that are accessible by many servers. A shared storage can, however, become a performance bottleneck and a single-point of failure. Distributed storage systems provide a shared storage to the outside world, but internally they consist of a network of servers and disks, thus avoiding the performance bottleneck and single-point of failure problems. We introduce a cache in a distributed storage system. The cache system must be fault tolerant so that no data is lost in case of a hardware failure. This requirement excludes the use of the common write-invalidate cache consistency protocols. The cache is implemented and evaluated in two steps. The first step focuses on design decisions that improve the performance when only one server uses the same file. In the second step we extend the cache with features that focus on the case when more than one server access the same file. The cache improves the throughput significantly compared to having no cache. The two-step evaluation approach makes it possible to quantify how different design decisions affect the performance of different use cases.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"266 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116064953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

HPBC Introduction and Committees HPBC介绍和委员会

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.162

E. Aubanel, V. Bhavsar, M. Frumkin

HPBC Introduction and Committees

HPBC介绍和委员会

引用次数: 0

Towards a Combined Grouping and Aggregation Algorithm for Fast Query Processing in Columnar Databases with GPUs 基于gpu的列数据库快速查询处理组合分组和聚合算法研究

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.21

S. Meraji, John Keenleyside, Sunil Kamath, Bob Blainey

Column-store in-memory databases have received a lot of attention because of their fast query processing response times on modern multi-core machines. Among different database operations, group by/aggregate is an important and potentially costly operation. Moreover, sort-based and hash-based algorithms are the most common ways of processing group by/aggregate queries. While sort-based algorithms are used in traditional Data Base Management Systems (DBMS), hash based algorithms can be applied for faster query processing in new columnar databases. Besides, Graphical Processing Units (GPU) can be utilized as fast, high bandwidth co-processors to improve the query processing performance of columnar databases. The focus of this article is on the prototype for group by/aggregate operations that we created to exploit GPUs. We show different hash based algorithms to improve the performance of group by/aggregate operations on GPU. One of the parameters that affect the performance of the group by/aggregate algorithm is the number of groups and hashing algorithm. We show that we can get up to 7.6x improvement in kernel performance compared to a multi-core CPU implementation when we use a partitioned multi-level hash algorithm using GPU shared and global memories.

列存储内存数据库由于其在现代多核机器上的快速查询处理响应时间而受到了广泛关注。在不同的数据库操作中，分组/聚合是一种重要且可能代价高昂的操作。此外，基于排序和基于散列的算法是处理分组/聚合查询的最常用方法。传统的数据库管理系统(DBMS)中使用基于排序的算法，而在新的列式数据库中，基于散列的算法可以用于更快的查询处理。此外，图形处理单元(GPU)可以作为快速、高带宽的协处理器来提高列式数据库的查询处理性能。本文的重点是我们为利用gpu而创建的分组/聚合操作的原型。我们展示了不同的基于哈希的算法来提高GPU上分组/聚合操作的性能。影响group by/aggregate算法性能的参数之一是组的数量和散列算法。我们表明，当我们使用GPU共享和全局内存的分区多级哈希算法时，与多核CPU实现相比，我们可以获得7.6倍的内核性能改进。

{"title":"Towards a Combined Grouping and Aggregation Algorithm for Fast Query Processing in Columnar Databases with GPUs","authors":"S. Meraji, John Keenleyside, Sunil Kamath, Bob Blainey","doi":"10.1109/IPDPSW.2015.21","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.21","url":null,"abstract":"Column-store in-memory databases have received a lot of attention because of their fast query processing response times on modern multi-core machines. Among different database operations, group by/aggregate is an important and potentially costly operation. Moreover, sort-based and hash-based algorithms are the most common ways of processing group by/aggregate queries. While sort-based algorithms are used in traditional Data Base Management Systems (DBMS), hash based algorithms can be applied for faster query processing in new columnar databases. Besides, Graphical Processing Units (GPU) can be utilized as fast, high bandwidth co-processors to improve the query processing performance of columnar databases. The focus of this article is on the prototype for group by/aggregate operations that we created to exploit GPUs. We show different hash based algorithms to improve the performance of group by/aggregate operations on GPU. One of the parameters that affect the performance of the group by/aggregate algorithm is the number of groups and hashing algorithm. We show that we can get up to 7.6x improvement in kernel performance compared to a multi-core CPU implementation when we use a partitioned multi-level hash algorithm using GPU shared and global memories.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121185162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀