Nachiket Kapre, Han Jianglei, Andrew Bean, P. Moorthy, Siddhartha
Memory management units that use low-level AXI descriptor chains to hold irregular graph-oriented access sequences can help improve DRAM memory throughput of graph algorithms by almost an order of magnitude. For the Xilinx Zed board, we explore and compare the memory throughputs achievable when using (1) cache-enabled CPUs with an OS, (2) cache-enabled CPUs running bare metal code, (2) CPU-based control of FPGA-based AXI DMAs, and finally (3) local FPGA-based control of AXI DMA transfers. For short-burst irregular traffic generated from sparse graph access patterns, we observe a performance penalty of almost 10× due to DRAM row activations when compared to cache-friendly sequential access. When using an AXI DMA engine configured in FPGA logic and programmed in AXI register mode from the CPU, we can improve DRAM performance by as much as 2.4× over naïve random access on the CPU. In this mode, we use the host CPU to trigger DMA transfer by writing appropriate control information in the internal register of the DMA engine. We also encode the sparse graph access patterns as locally-stored BRAM-hosted AXI descriptor chains to drive the AXI DMA engines with minimal CPU involvement under Scatter Gather mode. In this configuration, we deliver an additional 3× speedup, for a cumulative throughput improvement of 7× over a CPU-based approach using caches while running an OS to manage irregular access.
{"title":"GraphMMU: Memory Management Unit for Sparse Graph Accelerators","authors":"Nachiket Kapre, Han Jianglei, Andrew Bean, P. Moorthy, Siddhartha","doi":"10.1109/IPDPSW.2015.101","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.101","url":null,"abstract":"Memory management units that use low-level AXI descriptor chains to hold irregular graph-oriented access sequences can help improve DRAM memory throughput of graph algorithms by almost an order of magnitude. For the Xilinx Zed board, we explore and compare the memory throughputs achievable when using (1) cache-enabled CPUs with an OS, (2) cache-enabled CPUs running bare metal code, (2) CPU-based control of FPGA-based AXI DMAs, and finally (3) local FPGA-based control of AXI DMA transfers. For short-burst irregular traffic generated from sparse graph access patterns, we observe a performance penalty of almost 10× due to DRAM row activations when compared to cache-friendly sequential access. When using an AXI DMA engine configured in FPGA logic and programmed in AXI register mode from the CPU, we can improve DRAM performance by as much as 2.4× over naïve random access on the CPU. In this mode, we use the host CPU to trigger DMA transfer by writing appropriate control information in the internal register of the DMA engine. We also encode the sparse graph access patterns as locally-stored BRAM-hosted AXI descriptor chains to drive the AXI DMA engines with minimal CPU involvement under Scatter Gather mode. In this configuration, we deliver an additional 3× speedup, for a cumulative throughput improvement of 7× over a CPU-based approach using caches while running an OS to manage irregular access.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134287147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Odendahl, Andrés Goens, R. Leupers, G. Ascheid, T. Henriksson
The problem of finding an optimal allocation of logical data buffers to memory has emerged as a new research challenge due to the increased complexity of applications and new emerging Dynamic RAM (DRAM) interface technologies. This new opportunity of a large off-chip memory accessible by an ample bandwidth allows to reduce the on-chip Static RAM (SRAM) significantly and save production cost of future manycore platforms. We thus propose changes to an existing work that allows to uniformly reduce the on-chip memory size for a given application. We additionally introduce a novel linear programming model to automatically generate all necessary on chip memory sizes for a given application based on an optimal allocation of data buffers. An extension allows to further reduce the required on-chip memory in multi-application scenarios. We conduct a case study to validate all our models and show the applicability of our approach.
{"title":"Buffer Allocation Based On-Chip Memory Optimization for Many-Core Platforms","authors":"M. Odendahl, Andrés Goens, R. Leupers, G. Ascheid, T. Henriksson","doi":"10.1109/IPDPSW.2015.67","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.67","url":null,"abstract":"The problem of finding an optimal allocation of logical data buffers to memory has emerged as a new research challenge due to the increased complexity of applications and new emerging Dynamic RAM (DRAM) interface technologies. This new opportunity of a large off-chip memory accessible by an ample bandwidth allows to reduce the on-chip Static RAM (SRAM) significantly and save production cost of future manycore platforms. We thus propose changes to an existing work that allows to uniformly reduce the on-chip memory size for a given application. We additionally introduce a novel linear programming model to automatically generate all necessary on chip memory sizes for a given application based on an optimal allocation of data buffers. An extension allows to further reduce the required on-chip memory in multi-application scenarios. We conduct a case study to validate all our models and show the applicability of our approach.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121204399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary form only given. Funded by the US National Science Foundation, the Extreme Science and Engineering Discovery Environment (XSEDE) project seeks to provide "a single virtual system that scientists can use to interactively share computing resources, data and experience."1 The resources, owned by many different organizations and individuals in the US or abroad, may be at national centers, on campuses, in individual research labs, or at home. Heterogeneity pervades such an environment. There are heterogeneous processor architectures, node architectures, operating systems, load management systems, file systems, linkage libraries, MPI implementations and versions, authentication policies, authorization requirements, internet access policies and mechanisms, and operational policies, tolerance for risk -- the list goes on and on. It is the role of the XSEDE architecture to provide a clean model for component/component interactions, the definition of the standard core components, and the architectural approach to the non-functional aspects, often called the "ilities". These interfaces and interaction patterns must be sufficient to implement the XSEDE use case both today and into the future. We have followed the principles that Notkin and others espoused in the early 1990s.This talk describes the architectural features required to satisfy one of the most demanding use cases: executing workflows spanning XSEDE resources and campus-based resources. This use highlights the obvious functional aspects of execution and data management, identity federation, identity delegation, as well as more difficult-to-homogenize qualities such as local operational policies. We will begin with a discussion of the use case requirements, then examine how the architectural components are combined to realize the use case. We will then discuss some of the problems encountered along the way both with the standards used as well as the approach of a homogeneous virtual machine.
{"title":"HCW 2014 Keynote Talk","authors":"A. Grimshaw","doi":"10.1109/IPDPSW.2015.156","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.156","url":null,"abstract":"Summary form only given. Funded by the US National Science Foundation, the Extreme Science and Engineering Discovery Environment (XSEDE) project seeks to provide \"a single virtual system that scientists can use to interactively share computing resources, data and experience.\"1 The resources, owned by many different organizations and individuals in the US or abroad, may be at national centers, on campuses, in individual research labs, or at home. Heterogeneity pervades such an environment. There are heterogeneous processor architectures, node architectures, operating systems, load management systems, file systems, linkage libraries, MPI implementations and versions, authentication policies, authorization requirements, internet access policies and mechanisms, and operational policies, tolerance for risk -- the list goes on and on. It is the role of the XSEDE architecture to provide a clean model for component/component interactions, the definition of the standard core components, and the architectural approach to the non-functional aspects, often called the \"ilities\". These interfaces and interaction patterns must be sufficient to implement the XSEDE use case both today and into the future. We have followed the principles that Notkin and others espoused in the early 1990s.This talk describes the architectural features required to satisfy one of the most demanding use cases: executing workflows spanning XSEDE resources and campus-based resources. This use highlights the obvious functional aspects of execution and data management, identity federation, identity delegation, as well as more difficult-to-homogenize qualities such as local operational policies. We will begin with a discussion of the use case requirements, then examine how the architectural components are combined to realize the use case. We will then discuss some of the problems encountered along the way both with the standards used as well as the approach of a homogeneous virtual machine.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115857428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many applications in network analysis require the computation of the network's Laplacian pseudo-inverse - e.g., Topological centrality in social networks or estimating commute times in electrical networks. As large graphs become ubiquitous, the traditional approaches - with quadratic or cubic complexity in the number of vertices - do not scale. To alleviate this performance issue, a divide-and-conquer approach has been recently developed. In this work, we take one step further in improving the performance of computing the pseudo-inverse of Laplacian by parallelization. Specifically, we propose a parallel, GPU-based version of this new divide-and-conquer method. Furthermore, we implement this solution in Mat lab, a native environment for such computations, recently enhanced with the ability to harness the computational capabilites of GPUs. We find that using GPUs through Mat lab, we achieve speed-ups of up to 320x compared with the sequential divide-and-conquer solution. We further compare this GPU-enabled version with three other parallel solutions: a parallel CPU implementation and CUDA-based implementation of the divide-and-conquer algorithm, as well as a GPU-based implementation that uses cuBLAS to compute the pseudo-inverse in the traditional way. We find that the GPU-based implementation outperforms the CPU parallel version significantly. Furthermore, our results demonstrate that a best GPU-based implementation does not exist: depending on the size and structure of the graph, the relative performance of the three GPU-based versions can differ significantly. We conclude that GPUs can be successfully used to improve the performance of the pseudo-inverse of a graph's Laplacian, but choosing the best performing solution remains challenging due to the non-trivial correlation between the achieved performance and the characteristics of the input graph. Our future work attempts to expose and exploit this correlation.
{"title":"Computing the Pseudo-Inverse of a Graph's Laplacian Using GPUs","authors":"Nishant Saurabh, A. Varbanescu, Gyan Ranjan","doi":"10.1109/IPDPSW.2015.125","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.125","url":null,"abstract":"Many applications in network analysis require the computation of the network's Laplacian pseudo-inverse - e.g., Topological centrality in social networks or estimating commute times in electrical networks. As large graphs become ubiquitous, the traditional approaches - with quadratic or cubic complexity in the number of vertices - do not scale. To alleviate this performance issue, a divide-and-conquer approach has been recently developed. In this work, we take one step further in improving the performance of computing the pseudo-inverse of Laplacian by parallelization. Specifically, we propose a parallel, GPU-based version of this new divide-and-conquer method. Furthermore, we implement this solution in Mat lab, a native environment for such computations, recently enhanced with the ability to harness the computational capabilites of GPUs. We find that using GPUs through Mat lab, we achieve speed-ups of up to 320x compared with the sequential divide-and-conquer solution. We further compare this GPU-enabled version with three other parallel solutions: a parallel CPU implementation and CUDA-based implementation of the divide-and-conquer algorithm, as well as a GPU-based implementation that uses cuBLAS to compute the pseudo-inverse in the traditional way. We find that the GPU-based implementation outperforms the CPU parallel version significantly. Furthermore, our results demonstrate that a best GPU-based implementation does not exist: depending on the size and structure of the graph, the relative performance of the three GPU-based versions can differ significantly. We conclude that GPUs can be successfully used to improve the performance of the pseudo-inverse of a graph's Laplacian, but choosing the best performing solution remains challenging due to the non-trivial correlation between the achieved performance and the characteristics of the input graph. Our future work attempts to expose and exploit this correlation.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114404472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jean-Claude Charr, R. Couturier, Ahmed Fanfakh, Arnaud Giersch
Computing platforms are consuming more and more energy due to the increasing number of nodes composing them. To minimize the operating costs of these platforms many techniques have been used. Dynamic voltage and frequency scaling (DVFS) is one of them. It reduces the frequency of a CPU to lower its energy consumption. However, lowering the frequency of a CPU may increase the execution time of an application running on that processor. Therefore, the frequency that gives the best trade-off between the energy consumption and the performance of an application must be selected. In this paper, a new online frequency selecting algorithm for heterogeneous platforms (heterogeneous CPUs) is presented. It selects the frequencies and tries to give the best trade-off between energy saving and performance degradation, for each node computing the message passing iterative application. The algorithm has a small overhead and works without training or profiling. It uses a new energy model for message passing iterative applications running on a heterogeneous platform. The proposed algorithm is evaluated on the SimGrid simulator while running the NAS parallel benchmarks. The experiments show that it reduces the energy consumption by up to 34% while limiting the performance degradation as much as possible. Finally, the algorithm is compared to an existing method, the comparison results show that it outperforms the latter, on average it saves 4% more energy while keeping the same performance.
{"title":"Energy Consumption Reduction with DVFS for Message Passing Iterative Applications on Heterogeneous Architectures","authors":"Jean-Claude Charr, R. Couturier, Ahmed Fanfakh, Arnaud Giersch","doi":"10.1109/IPDPSW.2015.44","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.44","url":null,"abstract":"Computing platforms are consuming more and more energy due to the increasing number of nodes composing them. To minimize the operating costs of these platforms many techniques have been used. Dynamic voltage and frequency scaling (DVFS) is one of them. It reduces the frequency of a CPU to lower its energy consumption. However, lowering the frequency of a CPU may increase the execution time of an application running on that processor. Therefore, the frequency that gives the best trade-off between the energy consumption and the performance of an application must be selected. In this paper, a new online frequency selecting algorithm for heterogeneous platforms (heterogeneous CPUs) is presented. It selects the frequencies and tries to give the best trade-off between energy saving and performance degradation, for each node computing the message passing iterative application. The algorithm has a small overhead and works without training or profiling. It uses a new energy model for message passing iterative applications running on a heterogeneous platform. The proposed algorithm is evaluated on the SimGrid simulator while running the NAS parallel benchmarks. The experiments show that it reduces the energy consumption by up to 34% while limiting the performance degradation as much as possible. Finally, the algorithm is compared to an existing method, the comparison results show that it outperforms the latter, on average it saves 4% more energy while keeping the same performance.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116140785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vignesh Adhinarayanan, Wu-chun Feng, J. Woodring, D. Rogers, J. Ahrens
Post-processing visualization pipelines are traditionally used to gain insight from simulation data. However, changes to the system architecture for high-performance computing (HPC), dictated by the exascale goal, have limited the applicability of post-processing visualization. As an alternative, in-situ pipelines are proposed in order to enhance the knowledge discovery process via "real-time" visualization. Quantitative studies have already shown how in-situ visualization can improve performance and reduce storage needs at the cost of scientific exploration capabilities. However, to fully understand the trade-off space, a head-to-head comparison of power and energy (between the two types of visualization pipelines) is necessary. Thus, in this work, we study the greenness (i.e., Power, energy, and energy efficiency) of the in-situ and the post-processing visualization pipelines, using a proxy heat-transfer simulation as an example. For a realistic I/O load, the in-situ pipeline consumes 43% less energy than the post-processing pipeline. Contrary to expectations, our findings also show that only 9% of the total energy is saved by reducing off-chip data movement, while the rest of the savings comes from reducing the system idle time. This suggests an alternative set of optimization techniques for reducing the power consumption of the traditional post-processing pipeline.
{"title":"On the Greenness of In-Situ and Post-Processing Visualization Pipelines","authors":"Vignesh Adhinarayanan, Wu-chun Feng, J. Woodring, D. Rogers, J. Ahrens","doi":"10.1109/IPDPSW.2015.132","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.132","url":null,"abstract":"Post-processing visualization pipelines are traditionally used to gain insight from simulation data. However, changes to the system architecture for high-performance computing (HPC), dictated by the exascale goal, have limited the applicability of post-processing visualization. As an alternative, in-situ pipelines are proposed in order to enhance the knowledge discovery process via \"real-time\" visualization. Quantitative studies have already shown how in-situ visualization can improve performance and reduce storage needs at the cost of scientific exploration capabilities. However, to fully understand the trade-off space, a head-to-head comparison of power and energy (between the two types of visualization pipelines) is necessary. Thus, in this work, we study the greenness (i.e., Power, energy, and energy efficiency) of the in-situ and the post-processing visualization pipelines, using a proxy heat-transfer simulation as an example. For a realistic I/O load, the in-situ pipeline consumes 43% less energy than the post-processing pipeline. Contrary to expectations, our findings also show that only 9% of the total energy is saved by reducing off-chip data movement, while the rest of the savings comes from reducing the system idle time. This suggests an alternative set of optimization techniques for reducing the power consumption of the traditional post-processing pipeline.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127405872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper explores use of various partitioning methods to store RDF data effectively, to meet the needs of extensively growing highly interactive semantic web applications. It proposes a combinational approach of structure index partitioning and vertical partitioning - SIVP and demonstrates the implementation of SIVP. The paper presents five metrics to measure and analyze performance of SIVP store. SIVP is experimented on FOAF and SwetoDBLP datasets. SIVP store have shown an average of 34% gain over vertical partitioning for FOAF dataset. For SwetoDBLP dataset, SIVP have shown an average of 26% gain over VP. SIVP is better than vertical partitioning provided extra time needed in SIVP, which consists of lookup time and merge time, is compensated by frequency of a query higher than breakeven point for that query.
{"title":"Query Execution for RDF Data Using Structure Indexed Vertical Partitioning","authors":"Bhavik Shah, Trupti Padiya, Minal Bhise","doi":"10.1109/IPDPSW.2015.143","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.143","url":null,"abstract":"The paper explores use of various partitioning methods to store RDF data effectively, to meet the needs of extensively growing highly interactive semantic web applications. It proposes a combinational approach of structure index partitioning and vertical partitioning - SIVP and demonstrates the implementation of SIVP. The paper presents five metrics to measure and analyze performance of SIVP store. SIVP is experimented on FOAF and SwetoDBLP datasets. SIVP store have shown an average of 34% gain over vertical partitioning for FOAF dataset. For SwetoDBLP dataset, SIVP have shown an average of 26% gain over VP. SIVP is better than vertical partitioning provided extra time needed in SIVP, which consists of lookup time and merge time, is compensated by frequency of a query higher than breakeven point for that query.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127570263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Humos, Sungbum Hong, Jacqueline Jackson, Xuejun Liang, T. Pei, Bernard Aldrich
The Computer Science Department at Jackson State University (JSU) is updating its curriculum according to the new ABET guidelines. As part of this effort, the computer science faculty members have integrated modules of the NSF/IEEE-TCPP Curriculum Initiative on PDC (Parallel and Distributed Computing) into department-wide core and elective courses offered on fall 2014. These courses are: csc 119 Object Oriented Programming (core), csc 216 Computer Architecture and Organization (core), csc 312 Advanced Computer Architecture (elective), csc 325 Operating Systems (core), csc 350 Organization of Programming Languages (core) and csc 425 Parallel Computing (elective). The inclusion of the PDC modules was gradual and light weighted in the low level courses and more aggressive in the high level courses. Csc 119 Object Oriented Programming provided students with an early introduction to Java Threads: how to create and use. In csc 216 Computer Architecture and Organization students learned about GPUs and were asked to write simple problems using CUDA. Csc 312 Advanced Computer Architecture covered Instruction level and Processor level Parallelism. For csc 325 Operating Systems, mutual exclusion problems and Parallel Computing and Algorithms were introduced. In csc 350 Organization of Programming Languages, students learned about the implementation of threads in Java. Csc 425 Parallel Computing is an advanced study of parallel computing hardware and software issues. Assessment results showed that student perception of PDC concepts was satisfactory with some weakness in writing parallel code. However, students were very excited and motivated to learn about PDC. We were also able to share our experience with the Computer Engineering Department at JSU. New PDC modules will be integrated into some of their courses next fall and spring semesters. Our findings were made available on the Center for Parallel and Distributed Computing Curriculum Development and Educational Resources (CDER) website. In this paper, we will describe our experience of incorporating PDC modules into the aforementioned computer science courses at JSU.
{"title":"Incorporating PDC Modules Into Computer Science Courses at Jackson State University","authors":"A. Humos, Sungbum Hong, Jacqueline Jackson, Xuejun Liang, T. Pei, Bernard Aldrich","doi":"10.1109/IPDPSW.2015.39","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.39","url":null,"abstract":"The Computer Science Department at Jackson State University (JSU) is updating its curriculum according to the new ABET guidelines. As part of this effort, the computer science faculty members have integrated modules of the NSF/IEEE-TCPP Curriculum Initiative on PDC (Parallel and Distributed Computing) into department-wide core and elective courses offered on fall 2014. These courses are: csc 119 Object Oriented Programming (core), csc 216 Computer Architecture and Organization (core), csc 312 Advanced Computer Architecture (elective), csc 325 Operating Systems (core), csc 350 Organization of Programming Languages (core) and csc 425 Parallel Computing (elective). The inclusion of the PDC modules was gradual and light weighted in the low level courses and more aggressive in the high level courses. Csc 119 Object Oriented Programming provided students with an early introduction to Java Threads: how to create and use. In csc 216 Computer Architecture and Organization students learned about GPUs and were asked to write simple problems using CUDA. Csc 312 Advanced Computer Architecture covered Instruction level and Processor level Parallelism. For csc 325 Operating Systems, mutual exclusion problems and Parallel Computing and Algorithms were introduced. In csc 350 Organization of Programming Languages, students learned about the implementation of threads in Java. Csc 425 Parallel Computing is an advanced study of parallel computing hardware and software issues. Assessment results showed that student perception of PDC concepts was satisfactory with some weakness in writing parallel code. However, students were very excited and motivated to learn about PDC. We were also able to share our experience with the Computer Engineering Department at JSU. New PDC modules will be integrated into some of their courses next fall and spring semesters. Our findings were made available on the Center for Parallel and Distributed Computing Curriculum Development and Educational Resources (CDER) website. In this paper, we will describe our experience of incorporating PDC modules into the aforementioned computer science courses at JSU.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"34 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126160001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a new model for scaling applications with increasing power budget, which we call the iso-powerefficiency function. We show that viewing scaling in this way has advantages over the previously proposed is efficiency function that assumes all processors run at maximum power. Our experimental results show that over provisioning can result in better scaling under a power budget.
{"title":"Iso-Power-Efficiency: An Approach to Scaling Application Codes with a Power Budget","authors":"R. Long, S. Moore, B. Rountree","doi":"10.1109/IPDPSW.2015.122","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.122","url":null,"abstract":"We propose a new model for scaling applications with increasing power budget, which we call the iso-powerefficiency function. We show that viewing scaling in this way has advantages over the previously proposed is efficiency function that assumes all processors run at maximum power. Our experimental results show that over provisioning can result in better scaling under a power budget.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128118636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The context of this research is Volpex, a communication framework based on Put/Get calls to an abstract global space that can seamlessly handle multiple active replicas of communicating processes. Volpex is designed for a heterogeneous and unreliable execution environment where parallel applications need replication as well as check pointing to make continuous progress. Since different instances of the same process can execute in the same logical state at different clock times, communicated data objects must be logged to ensure consistent execution of process replicas. Logging to support redundancy can be the source of a significant overhead in execution time and storage and can limit scalability. In this paper we develop, implement, and evaluate Log on Read and Log on Write logging schemes to support redundant communication. Log on Read schemes log a copy of the data object returned to every Get (or Read) request. On the other hand, Log on Write schemes log the old data object only when a Put request is overwriting a data object. This reduces redundant copying, but identifying the correct data object to return to a Get request is complex. A Virtual Time Stamp (VTS) that captures global execution state is logged along with the data object to make this possible. We develop an optimized Log on Read scheme that minimizes redundancy and an optimized Log on Write scheme that reduces the VTS size and overhead. Experimental results show that the optimizations are effective in terms of storage and time overhead and an optimized Log on Read scheme presents the best tradeoffs for most scenarios.
本研究的背景是Volpex,这是一个基于对抽象全局空间的Put/Get调用的通信框架,可以无缝地处理通信过程的多个活动副本。Volpex是为异构和不可靠的执行环境而设计的,在这种环境中,并行应用程序需要复制和检查指向来进行持续的进展。由于同一流程的不同实例可以在不同的时钟时间以相同的逻辑状态执行,因此必须记录通信数据对象,以确保流程副本的一致执行。支持冗余的日志记录可能是执行时间和存储开销的重要来源,并可能限制可伸缩性。在本文中,我们开发、实现和评估了Log on Read和Log on Write日志方案,以支持冗余通信。登录读取方案记录返回给每个Get(或Read)请求的数据对象的副本。另一方面,Log On Write模式只在Put请求覆盖数据对象时记录旧数据对象。这减少了冗余复制,但是识别要返回给Get请求的正确数据对象是复杂的。捕获全局执行状态的虚拟时间戳(VTS)与数据对象一起被记录下来,以实现这一点。我们开发了一个优化的Log on Read方案,最大限度地减少冗余,以及一个优化的Log on Write方案,减少VTS大小和开销。实验结果表明,优化在存储和时间开销方面是有效的,优化的Log on Read方案在大多数情况下都是最佳的折衷方案。
{"title":"Efficient Message Logging to Support Process Replicas in a Volunteer Computing Environment","authors":"M. Islam, Hien Nguyen, J. Subhlok, E. Gabriel","doi":"10.1109/IPDPSW.2015.91","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.91","url":null,"abstract":"The context of this research is Volpex, a communication framework based on Put/Get calls to an abstract global space that can seamlessly handle multiple active replicas of communicating processes. Volpex is designed for a heterogeneous and unreliable execution environment where parallel applications need replication as well as check pointing to make continuous progress. Since different instances of the same process can execute in the same logical state at different clock times, communicated data objects must be logged to ensure consistent execution of process replicas. Logging to support redundancy can be the source of a significant overhead in execution time and storage and can limit scalability. In this paper we develop, implement, and evaluate Log on Read and Log on Write logging schemes to support redundant communication. Log on Read schemes log a copy of the data object returned to every Get (or Read) request. On the other hand, Log on Write schemes log the old data object only when a Put request is overwriting a data object. This reduces redundant copying, but identifying the correct data object to return to a Get request is complex. A Virtual Time Stamp (VTS) that captures global execution state is logged along with the data object to make this possible. We develop an optimized Log on Read scheme that minimizes redundancy and an optimized Log on Write scheme that reduces the VTS size and overhead. Experimental results show that the optimizations are effective in terms of storage and time overhead and an optimized Log on Read scheme presents the best tradeoffs for most scenarios.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126031383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}