2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)最新文献

英文中文

Parallel Isolation-Aggregation algorithms to solve Markov chains problems with application to page ranking 并行隔离-聚合算法解决马尔可夫链问题，并应用于页面排序

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)

Pub Date : 2010-04-19 DOI: 10.1109/IPDPSW.2010.5470779

A. Touzene

In this paper, we propose two parallel Aggregation-Isolation iterative methods for solving Markov chains. These parallel methods conserves as much as possible the benefits of aggregation, and Gauss-Seidel effects. Some experiments have been conducted testing models from queuing systems and models from Google Page Ranking. The results of the experiments show super linear speed-up for the parallel Aggregation-Isolation method.

本文提出了求解马尔可夫链的两种并行聚合-隔离迭代方法。这些并行方法尽可能地保留了聚合和高斯-塞德尔效应的好处。一些实验已经对来自排队系统的模型和来自谷歌页面排名的模型进行了测试。实验结果表明，并行聚合隔离方法具有超线性加速。

引用次数: 0

Evaluating database-oriented replication schemes in Software Transactional Memory systems 评估软件事务性内存系统中面向数据库的复制方案

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)

Pub Date : 2010-04-19 DOI: 10.1109/IPDPSW.2010.5470866

R. Palmieri, F. Quaglia, P. Romano, N. Carvalho

Software Transactional Memories (STMs) are emerging as a highly attractive programming model, thanks to their ability to mask concurrency management issues to the overlying applications. In this paper we are interested in dependability of STM systems via replication. In particular we present an extensive simulation study aimed at assessing the efficiency of some recently proposed database-oriented replication schemes, when employed in the context of STM systems. Our results point out the limited efficiency and scalability of these schemes, highlighting the need for redesigning ad-hoc solutions well fitting the requirements of STM environments. Possible directions for the re-design process are also discussed and supported by some early quantitative data.

软件事务性内存(Software Transactional memory, stm)正在成为一种非常有吸引力的编程模型，这要归功于它们能够将并发管理问题掩盖到覆盖的应用程序中。在本文中，我们感兴趣的是通过复制的STM系统的可靠性。特别是，我们提出了一项广泛的模拟研究，旨在评估一些最近提出的面向数据库的复制方案在STM系统中使用时的效率。我们的研究结果指出了这些方案的有限效率和可扩展性，强调需要重新设计适合STM环境要求的特设解决方案。本文还讨论了重新设计过程的可能方向，并得到了一些早期定量数据的支持。

引用次数: 22

Hardware implementation for scalable lookahead Regular Expression detection 硬件实现可伸缩的前瞻正则表达式检测

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)

Pub Date : 2010-04-19 DOI: 10.1109/IPDPSW.2010.5470750

M. Bando, N. S. Artan, Nishit Mehta, Yiping Guan, H. J. Chao

Regular Expressions (RegExes) are widely used in various applications to identify strings of text. Their flexibility, however, increases the complexity of the detection system and often limits the detection speed as well as the total number of RegExes that can be detected using limited resources. The two classical detection methods, Deterministic Finite Automaton (DFA) and Non-Deterministic Finite Automaton (NFA), have the potential problems of prohibitively large memory requirements and a large number of concurrent operations, respectively. Although recent schemes addressing these problems to improve DFA and NFA are promising, they are inherently limited by their scalability, since they follow the state transition model in DFA and NFA, where the state transitions occur per each character of the input. We recently proposed a scalable RegEx detection system called Lookahead Finite Automata (LaFA) to solve these problems with three novel ideas: 1. Provide specialized and optimized detection modules to increase resource utilizations. 2. Systematically reordering the RegEx detection sequence to reduce number of concurrent operations. 3. Sharing states among automata for different RegExes to reduce resource requirements. In this paper, we propose an efficient hardware architecture and prototype design implementation based on LaFA. Our proof-of-concept prototype design is built on a fraction of a single commodity Field Programmable Gate Array (FPGA) chip and can accommodate up to twenty-five thousand (25k) RegExes. Using only 7% of the logic area and 25% of the memory on a Xilinx Virtex-4 FX100, the prototype design can achieve 2-Gbps (gigabits-per-second) detection throughput with only one detection engine. We estimate that 34-Gbps detection throughput can be achieved if the entire resources of a state-of-the-art FPGA chip are used to implement multiple detection engines.

正则表达式(RegExes)在各种应用程序中广泛用于识别文本字符串。然而，它们的灵活性增加了检测系统的复杂性，并且通常限制了检测速度以及使用有限资源可以检测到的regexe的总数。两种经典的检测方法，确定性有限自动机(DFA)和非确定性有限自动机(NFA)，分别存在过大的内存需求和大量并发操作的潜在问题。尽管最近解决这些问题以改进DFA和NFA的方案很有希望，但它们本质上受到可扩展性的限制，因为它们遵循DFA和NFA中的状态转换模型，其中状态转换发生在输入的每个字符上。我们最近提出了一个可扩展的RegEx检测系统，称为前向有限自动机(LaFA)，以三个新颖的想法来解决这些问题:提供专门和优化的检测模块，以提高资源利用率。2. 系统地重新排序RegEx检测序列，以减少并发操作的数量。3.在不同regex的自动机之间共享状态以减少资源需求。在本文中，我们提出了一种高效的基于LaFA的硬件架构和原型设计实现。我们的概念验证原型设计建立在单个商品现场可编程门阵列(FPGA)芯片的一小部分上，可容纳多达25,000 (25k) RegExes。该原型设计仅使用Xilinx Virtex-4 FX100上7%的逻辑面积和25%的内存，仅使用一个检测引擎即可实现2 gbps(千兆位每秒)的检测吞吐量。我们估计，如果使用最先进的FPGA芯片的全部资源来实现多个检测引擎，则可以实现34 gbps的检测吞吐量。

{"title":"Hardware implementation for scalable lookahead Regular Expression detection","authors":"M. Bando, N. S. Artan, Nishit Mehta, Yiping Guan, H. J. Chao","doi":"10.1109/IPDPSW.2010.5470750","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470750","url":null,"abstract":"Regular Expressions (RegExes) are widely used in various applications to identify strings of text. Their flexibility, however, increases the complexity of the detection system and often limits the detection speed as well as the total number of RegExes that can be detected using limited resources. The two classical detection methods, Deterministic Finite Automaton (DFA) and Non-Deterministic Finite Automaton (NFA), have the potential problems of prohibitively large memory requirements and a large number of concurrent operations, respectively. Although recent schemes addressing these problems to improve DFA and NFA are promising, they are inherently limited by their scalability, since they follow the state transition model in DFA and NFA, where the state transitions occur per each character of the input. We recently proposed a scalable RegEx detection system called Lookahead Finite Automata (LaFA) to solve these problems with three novel ideas: 1. Provide specialized and optimized detection modules to increase resource utilizations. 2. Systematically reordering the RegEx detection sequence to reduce number of concurrent operations. 3. Sharing states among automata for different RegExes to reduce resource requirements. In this paper, we propose an efficient hardware architecture and prototype design implementation based on LaFA. Our proof-of-concept prototype design is built on a fraction of a single commodity Field Programmable Gate Array (FPGA) chip and can accommodate up to twenty-five thousand (25k) RegExes. Using only 7% of the logic area and 25% of the memory on a Xilinx Virtex-4 FX100, the prototype design can achieve 2-Gbps (gigabits-per-second) detection throughput with only one detection engine. We estimate that 34-Gbps detection throughput can be achieved if the entire resources of a state-of-the-art FPGA chip are used to implement multiple detection engines.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122230069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Power assignment and transmission scheduling in wireless networks 无线网络中的功率分配与传输调度

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)

Pub Date : 2010-04-19 DOI: 10.1109/IPDPSW.2010.5470777

Keqin Li

The problem of downlink data transmission scheduling in wireless networks is studied. It is pointed out that every downlink data transmission scheduling algorithm must have two components to solve the two subproblems of power assignment and transmission scheduling. Two types of downlink data transmission scheduling algorithms are proposed. In the first type, power assignment is performed before transmission scheduling. In the second type, power assignment is performed after transmission scheduling. The performance of two algorithms of the first type which use the equal power allocation method are analyzed. It is shown that both algorithms exhibit excellent worst-case performance and asymptotically optimal average-case performance under the condition that the total transmission power is equally allocated to the channels. In general, both algorithms exhibit excellent average-case performance. It is demonstrated that two algorithms of the second type perform better than the two algorithms of the first type due to the equal time power allocation method. Furthermore, the performance of our algorithms are very close to the optimal and the room for further performance improvement is very limited.

研究了无线网络下行数据传输调度问题。指出每一种下行数据传输调度算法都必须有两个组成部分来解决功率分配和传输调度两个子问题。提出了两种下行数据传输调度算法。第一种是先进行功率分配，再进行传输调度。第二种是在传输调度后进行功率分配。分析了第一类采用等功率分配方法的两种算法的性能。结果表明，在总传输功率平均分配的条件下，两种算法均具有优异的最坏情况性能和渐近最优的平均情况性能。一般来说，这两种算法都表现出优异的平均情况性能。结果表明，由于采用等时间功率分配方法，第二种算法的性能优于第一种算法。此外，我们的算法性能非常接近最优，进一步性能改进的空间非常有限。

{"title":"Power assignment and transmission scheduling in wireless networks","authors":"Keqin Li","doi":"10.1109/IPDPSW.2010.5470777","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470777","url":null,"abstract":"The problem of downlink data transmission scheduling in wireless networks is studied. It is pointed out that every downlink data transmission scheduling algorithm must have two components to solve the two subproblems of power assignment and transmission scheduling. Two types of downlink data transmission scheduling algorithms are proposed. In the first type, power assignment is performed before transmission scheduling. In the second type, power assignment is performed after transmission scheduling. The performance of two algorithms of the first type which use the equal power allocation method are analyzed. It is shown that both algorithms exhibit excellent worst-case performance and asymptotically optimal average-case performance under the condition that the total transmission power is equally allocated to the channels. In general, both algorithms exhibit excellent average-case performance. It is demonstrated that two algorithms of the second type perform better than the two algorithms of the first type due to the equal time power allocation method. Furthermore, the performance of our algorithms are very close to the optimal and the room for further performance improvement is very limited.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125712713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Scalable verification of MPI programs 可扩展的验证MPI程序

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)

Pub Date : 2010-04-19 DOI: 10.1109/IPDPSW.2010.5470683

Anh Vo, G. Gopalakrishnan

Large message passing programs today are being deployed on clusters with hundreds, if not thousands of processors. Any programming bugs that happen will be very hard to debug and greatly affect productivity. Although there have been many tools aiming at helping developers debug MPI programs, many of them fail to catch bugs that are caused by non-determinism in MPI codes. In this work, we propose a distributed, scalable framework that can explore all relevant schedules of MPI programs to check for deadlocks, resource leaks, local assertion errors, and other common MPI bugs.

如今，大型消息传递程序被部署在具有数百甚至数千个处理器的集群上。发生的任何编程错误都将很难调试，并极大地影响生产力。尽管有许多工具旨在帮助开发人员调试MPI程序，但其中许多工具都无法捕获由MPI代码中的不确定性引起的错误。在这项工作中，我们提出了一个分布式的、可扩展的框架，它可以探索MPI程序的所有相关时间表，以检查死锁、资源泄漏、本地断言错误和其他常见的MPI错误。

引用次数: 9

Fault tolerant linear algebra: Recovering from fail-stop failures without checkpointing 容错线性代数:从无检查点的故障停止故障中恢复

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)

Pub Date : 2010-04-19 DOI: 10.1109/IPDPSW.2010.5470775

T. Davies, Zizhong Chen

Today's long running high performance computing applications typically tolerate fail-stop failures by checkpointing. While checkpointing is a very general technique and can be applied in a wide range of applications, it often introduces a considerable overhead especially when applications modify a large amount of memory between checkpoints. In this research, we will design highly scalable low overhead fault tolerant schemes according to the specific characteristics of an application. We will focus on linear algebra operations and re-design selected algorithms to tolerate fail-stop failures without checkpointing. We will also incorporate the developed techniques into the widely used numerical linear algebra library package ScaLAPACK.

今天长时间运行的高性能计算应用程序通常通过检查点来容忍故障停止。虽然检查点是一种非常通用的技术，可以应用于各种应用程序，但它通常会带来相当大的开销，特别是当应用程序在检查点之间修改大量内存时。在本研究中，我们将根据应用程序的特定特征设计高可扩展的低开销容错方案。我们将专注于线性代数运算和重新设计选择的算法，以容忍无检查点的故障停止故障。我们还将把开发的技术整合到广泛使用的数值线性代数库包ScaLAPACK中。

引用次数: 5

A distributed diffusive heuristic for clustering a virtual P2P supercomputer 虚拟P2P超级计算机聚类的分布式扩散启发式算法

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)

Pub Date : 2010-04-19 DOI: 10.1109/IPDPSW.2010.5470922

Joachim Gehweiler, Henning Meyerhenke

For the management of a virtual P2P supercomputer one is interested in subgroups of processors that can communicate with each other efficiently. The task of finding these subgroups can be formulated as a graph clustering problem, where clusters are vertex subsets that are densely connected within themselves, but sparsely connected to each other. Due to resource constraints, clustering using global knowledge (i. e., knowing (nearly) the whole input graph) might not be permissible in a P2P scenario, e. g., because collecting the data is not possible or would consume a high amount of resources. That is why we present a distributed heuristic using only limited local knowledge for clustering static and dynamic graphs. Based on disturbed diffusion, our algorithm DIDIC implicitly optimizes cut-related quality measures such as modularity. It thus settles between distributed clustering algorithms for other quality measures (e. g., energy efficiency in the field of ad-hoc-networking) and graph clustering algorithms optimizing cut-related measures with global knowledge. Our experiments with graphs resembling a virtual P2P supercomputer show the promising potential of the new approach: Although each node starts with a random cluster number, may communicate only with its direct neighbors within the graph, and requires only a small amount of additional memory space, the solutions computed by DIDIC converge to clusterings that are comparable in quality to those computed by the established non-distributed graph clustering library mcl, whose main algorithm uses global knowledge.

对于虚拟P2P超级计算机的管理，人们感兴趣的是能够有效地相互通信的处理器子组。寻找这些子群的任务可以表述为一个图聚类问题，其中聚类是顶点子集，这些顶点子集内部紧密连接，但彼此之间稀疏连接。由于资源限制，使用全局知识(即，知道(几乎)整个输入图)的聚类在P2P场景中可能不被允许，例如，因为收集数据是不可能的，或者会消耗大量的资源。这就是为什么我们提出了一种分布式启发式算法，仅使用有限的局部知识来聚类静态和动态图。基于扰动扩散，DIDIC算法隐式优化了与切割相关的质量度量，如模块化。因此，它在用于其他质量度量的分布式聚类算法(例如，ad-hoc网络领域的能源效率)和使用全局知识优化切割相关度量的图聚类算法之间进行选择。我们对类似虚拟P2P超级计算机的图形进行的实验显示了新方法的巨大潜力:尽管每个节点以随机的簇号开始，可以只与图内的直接邻居通信，并且只需要少量的额外内存空间，但DIDIC计算的解收敛到的聚类质量与已建立的非分布式图聚类库mcl计算的聚类相当，其主要算法使用全局知识。

{"title":"A distributed diffusive heuristic for clustering a virtual P2P supercomputer","authors":"Joachim Gehweiler, Henning Meyerhenke","doi":"10.1109/IPDPSW.2010.5470922","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470922","url":null,"abstract":"For the management of a virtual P2P supercomputer one is interested in subgroups of processors that can communicate with each other efficiently. The task of finding these subgroups can be formulated as a graph clustering problem, where clusters are vertex subsets that are densely connected within themselves, but sparsely connected to each other. Due to resource constraints, clustering using global knowledge (i. e., knowing (nearly) the whole input graph) might not be permissible in a P2P scenario, e. g., because collecting the data is not possible or would consume a high amount of resources. That is why we present a distributed heuristic using only limited local knowledge for clustering static and dynamic graphs. Based on disturbed diffusion, our algorithm DIDIC implicitly optimizes cut-related quality measures such as modularity. It thus settles between distributed clustering algorithms for other quality measures (e. g., energy efficiency in the field of ad-hoc-networking) and graph clustering algorithms optimizing cut-related measures with global knowledge. Our experiments with graphs resembling a virtual P2P supercomputer show the promising potential of the new approach: Although each node starts with a random cluster number, may communicate only with its direct neighbors within the graph, and requires only a small amount of additional memory space, the solutions computed by DIDIC converge to clusterings that are comparable in quality to those computed by the established non-distributed graph clustering library mcl, whose main algorithm uses global knowledge.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116548158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

CAP-OS: Operating system for runtime scheduling, task mapping and resource management on reconfigurable multiprocessor architectures CAP-OS:在可重构多处理器架构上用于运行时调度、任务映射和资源管理的操作系统

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)

Pub Date : 2010-04-19 DOI: 10.1109/IPDPSW.2010.5470732

D. Göhringer, M. Hübner, Etienne Nguepi Zeutebouo, J. Becker

Operating systems traditionally handle the task scheduling of one or more application instances on a processor like hardware architecture. Novel runtime adaptive hardware exploits the dynamic reconfiguration on FPGAs, where hardware blocks are generated, started and terminated. This is similar to software tasks in well established operating system approaches. The hardware counterparts to the software tasks have to be transferred to the reconfigurable hardware via a configuration access port. This port enables the allocation of hardware blocks on the FPGA. Current reconfigurable hardware, like e.g. Xilinx Virtex 5 provide two internal configuration access ports (ICAPs), where only one of these ports can be accessed at one point of time. In e.g. a multiprocessor system on an FPGA, it can happen that multiple instances try to access these ports simultaneously. To prevent conflicts, the access to these ports as well as the hardware resource management needs to be controlled by a special purpose operating system running on an embedded processor. This special purpose operating system, called CAPOS (Configuration Access Port-Operating System), which will be presented in this paper, supports the clients using the configuration port with the service of priority-based access scheduling, hardware task mapping and resource management.

操作系统传统上处理处理器(如硬件架构)上的一个或多个应用程序实例的任务调度。新的运行时自适应硬件利用fpga上的动态重构，其中硬件块生成，启动和终止。这类似于完善的操作系统方法中的软件任务。必须通过配置访问端口将软件任务的硬件对应物转移到可重新配置的硬件。该接口用于FPGA上的硬件块分配。当前的可重构硬件，如Xilinx Virtex 5提供了两个内部配置访问端口(icap)，其中只有一个端口可以在一个时间点被访问。例如，在FPGA上的多处理器系统中，可能会发生多个实例试图同时访问这些端口的情况。为了防止冲突，对这些端口的访问以及硬件资源管理需要由运行在嵌入式处理器上的专用操作系统来控制。本文将介绍一种特殊用途的操作系统CAPOS (Configuration Access port - operating system)，它为使用配置端口的客户端提供基于优先级的访问调度、硬件任务映射和资源管理服务。

{"title":"CAP-OS: Operating system for runtime scheduling, task mapping and resource management on reconfigurable multiprocessor architectures","authors":"D. Göhringer, M. Hübner, Etienne Nguepi Zeutebouo, J. Becker","doi":"10.1109/IPDPSW.2010.5470732","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470732","url":null,"abstract":"Operating systems traditionally handle the task scheduling of one or more application instances on a processor like hardware architecture. Novel runtime adaptive hardware exploits the dynamic reconfiguration on FPGAs, where hardware blocks are generated, started and terminated. This is similar to software tasks in well established operating system approaches. The hardware counterparts to the software tasks have to be transferred to the reconfigurable hardware via a configuration access port. This port enables the allocation of hardware blocks on the FPGA. Current reconfigurable hardware, like e.g. Xilinx Virtex 5 provide two internal configuration access ports (ICAPs), where only one of these ports can be accessed at one point of time. In e.g. a multiprocessor system on an FPGA, it can happen that multiple instances try to access these ports simultaneously. To prevent conflicts, the access to these ports as well as the hardware resource management needs to be controlled by a special purpose operating system running on an embedded processor. This special purpose operating system, called CAPOS (Configuration Access Port-Operating System), which will be presented in this paper, supports the clients using the configuration port with the service of priority-based access scheduling, hardware task mapping and resource management.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121865317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities 重叠计算和通信:屏障算法和ConnectX-2 CORE-Direct功能

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)

Pub Date : 2010-04-19 DOI: 10.1109/IPDPSW.2010.5470854

R. Graham, S. Poole, Pavel Shamis, Gil Bloch, N. Bloch, H. Chapman, Michael Kagan, Ariel Shahar, Ishai Rabinovitz, G. Shainer

This paper explores the computation and communication overlap capabilities enabled by the new CORE-Direct hardware capabilities introduced in the InfiniBand Network Interface Card (NIC) ConnectX-2. We use the latency dominated nonblocking barrier algorithm in this study, and find that at 64 process count, a contiguous time slot of about 80% of the nonblocking barrier time is available for computation. This time slot increases as the number of processes participating increases. In contrast, Central Processing Unit (CPU) based implementations provide a time slot of up to 30% of the nonblocking barrier time. This bodes well for the scalability of simulations employing offloaded collective operations. These capabilities can be used to reduce the effects of system noise, and when using non-blocking collective operations may also be used to hide the effects of application load imbalance.

本文探讨了InfiniBand网络接口卡(NIC) ConnectX-2中引入的新的CORE-Direct硬件功能所带来的计算和通信重叠能力。在本研究中，我们使用延迟主导的非阻塞屏障算法，并发现在64个进程计数时，可用于计算的连续时隙约为非阻塞屏障时间的80%。这个时隙随着参与的进程数量的增加而增加。相比之下，基于中央处理单元(CPU)的实现提供了高达30%的非阻塞屏障时间的时隙。这对于采用卸载集体操作的模拟的可伸缩性来说是个好兆头。这些功能可用于减少系统噪声的影响，并且在使用非阻塞集体操作时，还可用于隐藏应用程序负载不平衡的影响。

引用次数: 45

Solving the advection PDE on the cell broadband engine 解决了蜂窝宽带引擎上的平流PDE问题

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)

Pub Date : 2010-04-19 DOI: 10.1109/IPDPSW.2010.5470761

G. Rokos, G. Peteinatos, Georgia Kouveli, G. Goumas, K. Kourtis, N. Koziris

In this paper we present the venture of porting two different algorithms for solving the two-dimensional advection PDE on the CBE platform, an in-place and an out-of-place one, and compare their computational performance, completion time and code productivity. Study of the advection equation reveals data dependencies which lead to limited performance and inefficient scaling to parallel architectures. We explore programming techniques and optimizations which maximize performance for these solver versions. The out-of-place version is straightforward to implement and achieves greater raw performance than the in-place one, but requires more computational steps to converge. In both cases, achieving high computational performance relies heavily on manual source code optimization, due to compiler incapability to do data vectorization and efficient instruction scheduling. The latter proves to be a key factor in pursuit of high GFLOPS measurements.

在本文中，我们提出了在CBE平台上移植两种不同的算法来求解二维平流PDE，一种原位算法和一种非原位算法，并比较了它们的计算性能、完成时间和代码生产率。对平流方程的研究揭示了数据依赖性，这导致了性能的限制和对并行架构的低效扩展。我们探索编程技术和优化，使这些求解器版本的性能最大化。异地版本很容易实现，并且比就地版本获得更好的原始性能，但是需要更多的计算步骤来收敛。在这两种情况下，由于编译器无法进行数据向量化和有效的指令调度，实现高计算性能在很大程度上依赖于手动源代码优化。后者被证明是追求高GFLOPS测量的关键因素。

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀