Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470779
A. Touzene
In this paper, we propose two parallel Aggregation-Isolation iterative methods for solving Markov chains. These parallel methods conserves as much as possible the benefits of aggregation, and Gauss-Seidel effects. Some experiments have been conducted testing models from queuing systems and models from Google Page Ranking. The results of the experiments show super linear speed-up for the parallel Aggregation-Isolation method.
{"title":"Parallel Isolation-Aggregation algorithms to solve Markov chains problems with application to page ranking","authors":"A. Touzene","doi":"10.1109/IPDPSW.2010.5470779","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470779","url":null,"abstract":"In this paper, we propose two parallel Aggregation-Isolation iterative methods for solving Markov chains. These parallel methods conserves as much as possible the benefits of aggregation, and Gauss-Seidel effects. Some experiments have been conducted testing models from queuing systems and models from Google Page Ranking. The results of the experiments show super linear speed-up for the parallel Aggregation-Isolation method.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123265298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470866
R. Palmieri, F. Quaglia, P. Romano, N. Carvalho
Software Transactional Memories (STMs) are emerging as a highly attractive programming model, thanks to their ability to mask concurrency management issues to the overlying applications. In this paper we are interested in dependability of STM systems via replication. In particular we present an extensive simulation study aimed at assessing the efficiency of some recently proposed database-oriented replication schemes, when employed in the context of STM systems. Our results point out the limited efficiency and scalability of these schemes, highlighting the need for redesigning ad-hoc solutions well fitting the requirements of STM environments. Possible directions for the re-design process are also discussed and supported by some early quantitative data.
{"title":"Evaluating database-oriented replication schemes in Software Transactional Memory systems","authors":"R. Palmieri, F. Quaglia, P. Romano, N. Carvalho","doi":"10.1109/IPDPSW.2010.5470866","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470866","url":null,"abstract":"Software Transactional Memories (STMs) are emerging as a highly attractive programming model, thanks to their ability to mask concurrency management issues to the overlying applications. In this paper we are interested in dependability of STM systems via replication. In particular we present an extensive simulation study aimed at assessing the efficiency of some recently proposed database-oriented replication schemes, when employed in the context of STM systems. Our results point out the limited efficiency and scalability of these schemes, highlighting the need for redesigning ad-hoc solutions well fitting the requirements of STM environments. Possible directions for the re-design process are also discussed and supported by some early quantitative data.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122199133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470750
M. Bando, N. S. Artan, Nishit Mehta, Yiping Guan, H. J. Chao
Regular Expressions (RegExes) are widely used in various applications to identify strings of text. Their flexibility, however, increases the complexity of the detection system and often limits the detection speed as well as the total number of RegExes that can be detected using limited resources. The two classical detection methods, Deterministic Finite Automaton (DFA) and Non-Deterministic Finite Automaton (NFA), have the potential problems of prohibitively large memory requirements and a large number of concurrent operations, respectively. Although recent schemes addressing these problems to improve DFA and NFA are promising, they are inherently limited by their scalability, since they follow the state transition model in DFA and NFA, where the state transitions occur per each character of the input. We recently proposed a scalable RegEx detection system called Lookahead Finite Automata (LaFA) to solve these problems with three novel ideas: 1. Provide specialized and optimized detection modules to increase resource utilizations. 2. Systematically reordering the RegEx detection sequence to reduce number of concurrent operations. 3. Sharing states among automata for different RegExes to reduce resource requirements. In this paper, we propose an efficient hardware architecture and prototype design implementation based on LaFA. Our proof-of-concept prototype design is built on a fraction of a single commodity Field Programmable Gate Array (FPGA) chip and can accommodate up to twenty-five thousand (25k) RegExes. Using only 7% of the logic area and 25% of the memory on a Xilinx Virtex-4 FX100, the prototype design can achieve 2-Gbps (gigabits-per-second) detection throughput with only one detection engine. We estimate that 34-Gbps detection throughput can be achieved if the entire resources of a state-of-the-art FPGA chip are used to implement multiple detection engines.
{"title":"Hardware implementation for scalable lookahead Regular Expression detection","authors":"M. Bando, N. S. Artan, Nishit Mehta, Yiping Guan, H. J. Chao","doi":"10.1109/IPDPSW.2010.5470750","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470750","url":null,"abstract":"Regular Expressions (RegExes) are widely used in various applications to identify strings of text. Their flexibility, however, increases the complexity of the detection system and often limits the detection speed as well as the total number of RegExes that can be detected using limited resources. The two classical detection methods, Deterministic Finite Automaton (DFA) and Non-Deterministic Finite Automaton (NFA), have the potential problems of prohibitively large memory requirements and a large number of concurrent operations, respectively. Although recent schemes addressing these problems to improve DFA and NFA are promising, they are inherently limited by their scalability, since they follow the state transition model in DFA and NFA, where the state transitions occur per each character of the input. We recently proposed a scalable RegEx detection system called Lookahead Finite Automata (LaFA) to solve these problems with three novel ideas: 1. Provide specialized and optimized detection modules to increase resource utilizations. 2. Systematically reordering the RegEx detection sequence to reduce number of concurrent operations. 3. Sharing states among automata for different RegExes to reduce resource requirements. In this paper, we propose an efficient hardware architecture and prototype design implementation based on LaFA. Our proof-of-concept prototype design is built on a fraction of a single commodity Field Programmable Gate Array (FPGA) chip and can accommodate up to twenty-five thousand (25k) RegExes. Using only 7% of the logic area and 25% of the memory on a Xilinx Virtex-4 FX100, the prototype design can achieve 2-Gbps (gigabits-per-second) detection throughput with only one detection engine. We estimate that 34-Gbps detection throughput can be achieved if the entire resources of a state-of-the-art FPGA chip are used to implement multiple detection engines.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122230069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470777
Keqin Li
The problem of downlink data transmission scheduling in wireless networks is studied. It is pointed out that every downlink data transmission scheduling algorithm must have two components to solve the two subproblems of power assignment and transmission scheduling. Two types of downlink data transmission scheduling algorithms are proposed. In the first type, power assignment is performed before transmission scheduling. In the second type, power assignment is performed after transmission scheduling. The performance of two algorithms of the first type which use the equal power allocation method are analyzed. It is shown that both algorithms exhibit excellent worst-case performance and asymptotically optimal average-case performance under the condition that the total transmission power is equally allocated to the channels. In general, both algorithms exhibit excellent average-case performance. It is demonstrated that two algorithms of the second type perform better than the two algorithms of the first type due to the equal time power allocation method. Furthermore, the performance of our algorithms are very close to the optimal and the room for further performance improvement is very limited.
{"title":"Power assignment and transmission scheduling in wireless networks","authors":"Keqin Li","doi":"10.1109/IPDPSW.2010.5470777","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470777","url":null,"abstract":"The problem of downlink data transmission scheduling in wireless networks is studied. It is pointed out that every downlink data transmission scheduling algorithm must have two components to solve the two subproblems of power assignment and transmission scheduling. Two types of downlink data transmission scheduling algorithms are proposed. In the first type, power assignment is performed before transmission scheduling. In the second type, power assignment is performed after transmission scheduling. The performance of two algorithms of the first type which use the equal power allocation method are analyzed. It is shown that both algorithms exhibit excellent worst-case performance and asymptotically optimal average-case performance under the condition that the total transmission power is equally allocated to the channels. In general, both algorithms exhibit excellent average-case performance. It is demonstrated that two algorithms of the second type perform better than the two algorithms of the first type due to the equal time power allocation method. Furthermore, the performance of our algorithms are very close to the optimal and the room for further performance improvement is very limited.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125712713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470683
Anh Vo, G. Gopalakrishnan
Large message passing programs today are being deployed on clusters with hundreds, if not thousands of processors. Any programming bugs that happen will be very hard to debug and greatly affect productivity. Although there have been many tools aiming at helping developers debug MPI programs, many of them fail to catch bugs that are caused by non-determinism in MPI codes. In this work, we propose a distributed, scalable framework that can explore all relevant schedules of MPI programs to check for deadlocks, resource leaks, local assertion errors, and other common MPI bugs.
{"title":"Scalable verification of MPI programs","authors":"Anh Vo, G. Gopalakrishnan","doi":"10.1109/IPDPSW.2010.5470683","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470683","url":null,"abstract":"Large message passing programs today are being deployed on clusters with hundreds, if not thousands of processors. Any programming bugs that happen will be very hard to debug and greatly affect productivity. Although there have been many tools aiming at helping developers debug MPI programs, many of them fail to catch bugs that are caused by non-determinism in MPI codes. In this work, we propose a distributed, scalable framework that can explore all relevant schedules of MPI programs to check for deadlocks, resource leaks, local assertion errors, and other common MPI bugs.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115267317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470775
T. Davies, Zizhong Chen
Today's long running high performance computing applications typically tolerate fail-stop failures by checkpointing. While checkpointing is a very general technique and can be applied in a wide range of applications, it often introduces a considerable overhead especially when applications modify a large amount of memory between checkpoints. In this research, we will design highly scalable low overhead fault tolerant schemes according to the specific characteristics of an application. We will focus on linear algebra operations and re-design selected algorithms to tolerate fail-stop failures without checkpointing. We will also incorporate the developed techniques into the widely used numerical linear algebra library package ScaLAPACK.
{"title":"Fault tolerant linear algebra: Recovering from fail-stop failures without checkpointing","authors":"T. Davies, Zizhong Chen","doi":"10.1109/IPDPSW.2010.5470775","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470775","url":null,"abstract":"Today's long running high performance computing applications typically tolerate fail-stop failures by checkpointing. While checkpointing is a very general technique and can be applied in a wide range of applications, it often introduces a considerable overhead especially when applications modify a large amount of memory between checkpoints. In this research, we will design highly scalable low overhead fault tolerant schemes according to the specific characteristics of an application. We will focus on linear algebra operations and re-design selected algorithms to tolerate fail-stop failures without checkpointing. We will also incorporate the developed techniques into the widely used numerical linear algebra library package ScaLAPACK.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123830396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470922
Joachim Gehweiler, Henning Meyerhenke
For the management of a virtual P2P supercomputer one is interested in subgroups of processors that can communicate with each other efficiently. The task of finding these subgroups can be formulated as a graph clustering problem, where clusters are vertex subsets that are densely connected within themselves, but sparsely connected to each other. Due to resource constraints, clustering using global knowledge (i. e., knowing (nearly) the whole input graph) might not be permissible in a P2P scenario, e. g., because collecting the data is not possible or would consume a high amount of resources. That is why we present a distributed heuristic using only limited local knowledge for clustering static and dynamic graphs. Based on disturbed diffusion, our algorithm DIDIC implicitly optimizes cut-related quality measures such as modularity. It thus settles between distributed clustering algorithms for other quality measures (e. g., energy efficiency in the field of ad-hoc-networking) and graph clustering algorithms optimizing cut-related measures with global knowledge. Our experiments with graphs resembling a virtual P2P supercomputer show the promising potential of the new approach: Although each node starts with a random cluster number, may communicate only with its direct neighbors within the graph, and requires only a small amount of additional memory space, the solutions computed by DIDIC converge to clusterings that are comparable in quality to those computed by the established non-distributed graph clustering library mcl, whose main algorithm uses global knowledge.
{"title":"A distributed diffusive heuristic for clustering a virtual P2P supercomputer","authors":"Joachim Gehweiler, Henning Meyerhenke","doi":"10.1109/IPDPSW.2010.5470922","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470922","url":null,"abstract":"For the management of a virtual P2P supercomputer one is interested in subgroups of processors that can communicate with each other efficiently. The task of finding these subgroups can be formulated as a graph clustering problem, where clusters are vertex subsets that are densely connected within themselves, but sparsely connected to each other. Due to resource constraints, clustering using global knowledge (i. e., knowing (nearly) the whole input graph) might not be permissible in a P2P scenario, e. g., because collecting the data is not possible or would consume a high amount of resources. That is why we present a distributed heuristic using only limited local knowledge for clustering static and dynamic graphs. Based on disturbed diffusion, our algorithm DIDIC implicitly optimizes cut-related quality measures such as modularity. It thus settles between distributed clustering algorithms for other quality measures (e. g., energy efficiency in the field of ad-hoc-networking) and graph clustering algorithms optimizing cut-related measures with global knowledge. Our experiments with graphs resembling a virtual P2P supercomputer show the promising potential of the new approach: Although each node starts with a random cluster number, may communicate only with its direct neighbors within the graph, and requires only a small amount of additional memory space, the solutions computed by DIDIC converge to clusterings that are comparable in quality to those computed by the established non-distributed graph clustering library mcl, whose main algorithm uses global knowledge.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116548158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470732
D. Göhringer, M. Hübner, Etienne Nguepi Zeutebouo, J. Becker
Operating systems traditionally handle the task scheduling of one or more application instances on a processor like hardware architecture. Novel runtime adaptive hardware exploits the dynamic reconfiguration on FPGAs, where hardware blocks are generated, started and terminated. This is similar to software tasks in well established operating system approaches. The hardware counterparts to the software tasks have to be transferred to the reconfigurable hardware via a configuration access port. This port enables the allocation of hardware blocks on the FPGA. Current reconfigurable hardware, like e.g. Xilinx Virtex 5 provide two internal configuration access ports (ICAPs), where only one of these ports can be accessed at one point of time. In e.g. a multiprocessor system on an FPGA, it can happen that multiple instances try to access these ports simultaneously. To prevent conflicts, the access to these ports as well as the hardware resource management needs to be controlled by a special purpose operating system running on an embedded processor. This special purpose operating system, called CAPOS (Configuration Access Port-Operating System), which will be presented in this paper, supports the clients using the configuration port with the service of priority-based access scheduling, hardware task mapping and resource management.
操作系统传统上处理处理器(如硬件架构)上的一个或多个应用程序实例的任务调度。新的运行时自适应硬件利用fpga上的动态重构,其中硬件块生成,启动和终止。这类似于完善的操作系统方法中的软件任务。必须通过配置访问端口将软件任务的硬件对应物转移到可重新配置的硬件。该接口用于FPGA上的硬件块分配。当前的可重构硬件,如Xilinx Virtex 5提供了两个内部配置访问端口(icap),其中只有一个端口可以在一个时间点被访问。例如,在FPGA上的多处理器系统中,可能会发生多个实例试图同时访问这些端口的情况。为了防止冲突,对这些端口的访问以及硬件资源管理需要由运行在嵌入式处理器上的专用操作系统来控制。本文将介绍一种特殊用途的操作系统CAPOS (Configuration Access port - operating system),它为使用配置端口的客户端提供基于优先级的访问调度、硬件任务映射和资源管理服务。
{"title":"CAP-OS: Operating system for runtime scheduling, task mapping and resource management on reconfigurable multiprocessor architectures","authors":"D. Göhringer, M. Hübner, Etienne Nguepi Zeutebouo, J. Becker","doi":"10.1109/IPDPSW.2010.5470732","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470732","url":null,"abstract":"Operating systems traditionally handle the task scheduling of one or more application instances on a processor like hardware architecture. Novel runtime adaptive hardware exploits the dynamic reconfiguration on FPGAs, where hardware blocks are generated, started and terminated. This is similar to software tasks in well established operating system approaches. The hardware counterparts to the software tasks have to be transferred to the reconfigurable hardware via a configuration access port. This port enables the allocation of hardware blocks on the FPGA. Current reconfigurable hardware, like e.g. Xilinx Virtex 5 provide two internal configuration access ports (ICAPs), where only one of these ports can be accessed at one point of time. In e.g. a multiprocessor system on an FPGA, it can happen that multiple instances try to access these ports simultaneously. To prevent conflicts, the access to these ports as well as the hardware resource management needs to be controlled by a special purpose operating system running on an embedded processor. This special purpose operating system, called CAPOS (Configuration Access Port-Operating System), which will be presented in this paper, supports the clients using the configuration port with the service of priority-based access scheduling, hardware task mapping and resource management.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121865317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470854
R. Graham, S. Poole, Pavel Shamis, Gil Bloch, N. Bloch, H. Chapman, Michael Kagan, Ariel Shahar, Ishai Rabinovitz, G. Shainer
This paper explores the computation and communication overlap capabilities enabled by the new CORE-Direct hardware capabilities introduced in the InfiniBand Network Interface Card (NIC) ConnectX-2. We use the latency dominated nonblocking barrier algorithm in this study, and find that at 64 process count, a contiguous time slot of about 80% of the nonblocking barrier time is available for computation. This time slot increases as the number of processes participating increases. In contrast, Central Processing Unit (CPU) based implementations provide a time slot of up to 30% of the nonblocking barrier time. This bodes well for the scalability of simulations employing offloaded collective operations. These capabilities can be used to reduce the effects of system noise, and when using non-blocking collective operations may also be used to hide the effects of application load imbalance.
{"title":"Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities","authors":"R. Graham, S. Poole, Pavel Shamis, Gil Bloch, N. Bloch, H. Chapman, Michael Kagan, Ariel Shahar, Ishai Rabinovitz, G. Shainer","doi":"10.1109/IPDPSW.2010.5470854","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470854","url":null,"abstract":"This paper explores the computation and communication overlap capabilities enabled by the new CORE-Direct hardware capabilities introduced in the InfiniBand Network Interface Card (NIC) ConnectX-2. We use the latency dominated nonblocking barrier algorithm in this study, and find that at 64 process count, a contiguous time slot of about 80% of the nonblocking barrier time is available for computation. This time slot increases as the number of processes participating increases. In contrast, Central Processing Unit (CPU) based implementations provide a time slot of up to 30% of the nonblocking barrier time. This bodes well for the scalability of simulations employing offloaded collective operations. These capabilities can be used to reduce the effects of system noise, and when using non-blocking collective operations may also be used to hide the effects of application load imbalance.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121237932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-04-19DOI: 10.1109/IPDPSW.2010.5470761
G. Rokos, G. Peteinatos, Georgia Kouveli, G. Goumas, K. Kourtis, N. Koziris
In this paper we present the venture of porting two different algorithms for solving the two-dimensional advection PDE on the CBE platform, an in-place and an out-of-place one, and compare their computational performance, completion time and code productivity. Study of the advection equation reveals data dependencies which lead to limited performance and inefficient scaling to parallel architectures. We explore programming techniques and optimizations which maximize performance for these solver versions. The out-of-place version is straightforward to implement and achieves greater raw performance than the in-place one, but requires more computational steps to converge. In both cases, achieving high computational performance relies heavily on manual source code optimization, due to compiler incapability to do data vectorization and efficient instruction scheduling. The latter proves to be a key factor in pursuit of high GFLOPS measurements.
{"title":"Solving the advection PDE on the cell broadband engine","authors":"G. Rokos, G. Peteinatos, Georgia Kouveli, G. Goumas, K. Kourtis, N. Koziris","doi":"10.1109/IPDPSW.2010.5470761","DOIUrl":"https://doi.org/10.1109/IPDPSW.2010.5470761","url":null,"abstract":"In this paper we present the venture of porting two different algorithms for solving the two-dimensional advection PDE on the CBE platform, an in-place and an out-of-place one, and compare their computational performance, completion time and code productivity. Study of the advection equation reveals data dependencies which lead to limited performance and inefficient scaling to parallel architectures. We explore programming techniques and optimizations which maximize performance for these solver versions. The out-of-place version is straightforward to implement and achieves greater raw performance than the in-place one, but requires more computational steps to converge. In both cases, achieving high computational performance relies heavily on manual source code optimization, due to compiler incapability to do data vectorization and efficient instruction scheduling. The latter proves to be a key factor in pursuit of high GFLOPS measurements.","PeriodicalId":329280,"journal":{"name":"2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133277007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}