2015 IEEE International Parallel and Distributed Processing Symposium Workshop最新文献_第6页

GraphMMU: Memory Management Unit for Sparse Graph Accelerators GraphMMU:稀疏图形加速器的内存管理单元

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.101

Nachiket Kapre, Han Jianglei, Andrew Bean, P. Moorthy, Siddhartha

Memory management units that use low-level AXI descriptor chains to hold irregular graph-oriented access sequences can help improve DRAM memory throughput of graph algorithms by almost an order of magnitude. For the Xilinx Zed board, we explore and compare the memory throughputs achievable when using (1) cache-enabled CPUs with an OS, (2) cache-enabled CPUs running bare metal code, (2) CPU-based control of FPGA-based AXI DMAs, and finally (3) local FPGA-based control of AXI DMA transfers. For short-burst irregular traffic generated from sparse graph access patterns, we observe a performance penalty of almost 10× due to DRAM row activations when compared to cache-friendly sequential access. When using an AXI DMA engine configured in FPGA logic and programmed in AXI register mode from the CPU, we can improve DRAM performance by as much as 2.4× over naïve random access on the CPU. In this mode, we use the host CPU to trigger DMA transfer by writing appropriate control information in the internal register of the DMA engine. We also encode the sparse graph access patterns as locally-stored BRAM-hosted AXI descriptor chains to drive the AXI DMA engines with minimal CPU involvement under Scatter Gather mode. In this configuration, we deliver an additional 3× speedup, for a cumulative throughput improvement of 7× over a CPU-based approach using caches while running an OS to manage irregular access.

使用低级AXI描述符链来保存不规则的面向图形的访问序列的内存管理单元可以帮助将图形算法的DRAM内存吞吐量提高几乎一个数量级。对于Xilinx Zed板，我们探索并比较了使用(1)支持缓存的cpu与操作系统，(2)支持缓存的cpu运行裸机代码，(2)基于cpu的基于fpga的AXI DMA控制，以及(3)基于本地fpga的AXI DMA传输控制时可实现的内存吞吐量。对于稀疏图访问模式生成的短突发不规则流量，我们观察到，与缓存友好的顺序访问相比，由于DRAM行激活，性能损失几乎是10倍。当使用在FPGA逻辑中配置并从CPU以AXI寄存器模式编程的AXI DMA引擎时，我们可以通过naïve对CPU的随机访问将DRAM性能提高多达2.4倍。在这种模式下，我们使用主机CPU通过在DMA引擎的内部寄存器中写入适当的控制信息来触发DMA传输。我们还将稀疏图访问模式编码为本地存储的bram托管的AXI描述符链，以便在Scatter Gather模式下以最小的CPU占用来驱动AXI DMA引擎。在此配置中，我们提供了额外的3倍加速，在运行操作系统管理不规则访问时，使用缓存的基于cpu的方法累计吞吐量提高了7倍。

{"title":"GraphMMU: Memory Management Unit for Sparse Graph Accelerators","authors":"Nachiket Kapre, Han Jianglei, Andrew Bean, P. Moorthy, Siddhartha","doi":"10.1109/IPDPSW.2015.101","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.101","url":null,"abstract":"Memory management units that use low-level AXI descriptor chains to hold irregular graph-oriented access sequences can help improve DRAM memory throughput of graph algorithms by almost an order of magnitude. For the Xilinx Zed board, we explore and compare the memory throughputs achievable when using (1) cache-enabled CPUs with an OS, (2) cache-enabled CPUs running bare metal code, (2) CPU-based control of FPGA-based AXI DMAs, and finally (3) local FPGA-based control of AXI DMA transfers. For short-burst irregular traffic generated from sparse graph access patterns, we observe a performance penalty of almost 10× due to DRAM row activations when compared to cache-friendly sequential access. When using an AXI DMA engine configured in FPGA logic and programmed in AXI register mode from the CPU, we can improve DRAM performance by as much as 2.4× over naïve random access on the CPU. In this mode, we use the host CPU to trigger DMA transfer by writing appropriate control information in the internal register of the DMA engine. We also encode the sparse graph access patterns as locally-stored BRAM-hosted AXI descriptor chains to drive the AXI DMA engines with minimal CPU involvement under Scatter Gather mode. In this configuration, we deliver an additional 3× speedup, for a cumulative throughput improvement of 7× over a CPU-based approach using caches while running an OS to manage irregular access.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134287147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Buffer Allocation Based On-Chip Memory Optimization for Many-Core Platforms 基于多核平台片上内存优化的缓冲区分配

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.67

M. Odendahl, Andrés Goens, R. Leupers, G. Ascheid, T. Henriksson

The problem of finding an optimal allocation of logical data buffers to memory has emerged as a new research challenge due to the increased complexity of applications and new emerging Dynamic RAM (DRAM) interface technologies. This new opportunity of a large off-chip memory accessible by an ample bandwidth allows to reduce the on-chip Static RAM (SRAM) significantly and save production cost of future manycore platforms. We thus propose changes to an existing work that allows to uniformly reduce the on-chip memory size for a given application. We additionally introduce a novel linear programming model to automatically generate all necessary on chip memory sizes for a given application based on an optimal allocation of data buffers. An extension allows to further reduce the required on-chip memory in multi-application scenarios. We conduct a case study to validate all our models and show the applicability of our approach.

由于应用程序的复杂性增加和新兴的动态RAM (DRAM)接口技术，寻找逻辑数据缓冲区到内存的最佳分配问题已经成为一个新的研究挑战。这种由充足带宽访问的大型片外存储器的新机会可以显着减少片上静态RAM (SRAM)，并节省未来多核平台的生产成本。因此，我们建议对现有工作进行更改，以允许统一地减少给定应用程序的片上内存大小。我们还引入了一种新颖的线性规划模型，可以根据数据缓冲区的最佳分配自动生成给定应用程序所需的所有芯片内存大小。扩展允许在多应用场景中进一步减少所需的片上内存。我们进行一个案例研究来验证我们所有的模型，并展示我们方法的适用性。

引用次数: 1

HCW 2014 Keynote Talk HCW 2014主题演讲

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.156

A. Grimshaw

Summary form only given. Funded by the US National Science Foundation, the Extreme Science and Engineering Discovery Environment (XSEDE) project seeks to provide "a single virtual system that scientists can use to interactively share computing resources, data and experience."1 The resources, owned by many different organizations and individuals in the US or abroad, may be at national centers, on campuses, in individual research labs, or at home. Heterogeneity pervades such an environment. There are heterogeneous processor architectures, node architectures, operating systems, load management systems, file systems, linkage libraries, MPI implementations and versions, authentication policies, authorization requirements, internet access policies and mechanisms, and operational policies, tolerance for risk -- the list goes on and on. It is the role of the XSEDE architecture to provide a clean model for component/component interactions, the definition of the standard core components, and the architectural approach to the non-functional aspects, often called the "ilities". These interfaces and interaction patterns must be sufficient to implement the XSEDE use case both today and into the future. We have followed the principles that Notkin and others espoused in the early 1990s.This talk describes the architectural features required to satisfy one of the most demanding use cases: executing workflows spanning XSEDE resources and campus-based resources. This use highlights the obvious functional aspects of execution and data management, identity federation, identity delegation, as well as more difficult-to-homogenize qualities such as local operational policies. We will begin with a discussion of the use case requirements, then examine how the architectural components are combined to realize the use case. We will then discuss some of the problems encountered along the way both with the standards used as well as the approach of a homogeneous virtual machine.

只提供摘要形式。极端科学与工程发现环境(XSEDE)项目由美国国家科学基金会资助，旨在提供“一个单一的虚拟系统，科学家可以用它来交互式地共享计算资源、数据和经验。”这些资源由美国或国外许多不同的组织和个人拥有，可能在国家中心，在校园，在个人研究实验室，或在家里。这种环境中弥漫着异质性。有异构处理器体系结构、节点体系结构、操作系统、负载管理系统、文件系统、链接库、MPI实现和版本、身份验证策略、授权需求、互联网访问策略和机制、操作策略、风险容忍度——这个列表可以一直列下去。XSEDE体系结构的作用是为组件/组件交互、标准核心组件的定义以及非功能方面(通常称为“功能”)的体系结构方法提供一个清晰的模型。这些接口和交互模式必须足以在现在和将来实现XSEDE用例。我们遵循了诺特金和其他人在20世纪90年代初所倡导的原则。本演讲描述了满足最苛刻的用例之一所需的架构特性:执行跨XSEDE资源和基于校园的资源的工作流。这种用法突出了执行和数据管理、身份联合、身份委托以及更难以同质化的特性(如本地操作策略)的明显功能方面。我们将从讨论用例需求开始，然后检查如何组合架构组件来实现用例。然后，我们将讨论在使用标准和同构虚拟机方法的过程中遇到的一些问题。

{"title":"HCW 2014 Keynote Talk","authors":"A. Grimshaw","doi":"10.1109/IPDPSW.2015.156","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.156","url":null,"abstract":"Summary form only given. Funded by the US National Science Foundation, the Extreme Science and Engineering Discovery Environment (XSEDE) project seeks to provide \"a single virtual system that scientists can use to interactively share computing resources, data and experience.\"1 The resources, owned by many different organizations and individuals in the US or abroad, may be at national centers, on campuses, in individual research labs, or at home. Heterogeneity pervades such an environment. There are heterogeneous processor architectures, node architectures, operating systems, load management systems, file systems, linkage libraries, MPI implementations and versions, authentication policies, authorization requirements, internet access policies and mechanisms, and operational policies, tolerance for risk -- the list goes on and on. It is the role of the XSEDE architecture to provide a clean model for component/component interactions, the definition of the standard core components, and the architectural approach to the non-functional aspects, often called the \"ilities\". These interfaces and interaction patterns must be sufficient to implement the XSEDE use case both today and into the future. We have followed the principles that Notkin and others espoused in the early 1990s.This talk describes the architectural features required to satisfy one of the most demanding use cases: executing workflows spanning XSEDE resources and campus-based resources. This use highlights the obvious functional aspects of execution and data management, identity federation, identity delegation, as well as more difficult-to-homogenize qualities such as local operational policies. We will begin with a discussion of the use case requirements, then examine how the architectural components are combined to realize the use case. We will then discuss some of the problems encountered along the way both with the standards used as well as the approach of a homogeneous virtual machine.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115857428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Computing the Pseudo-Inverse of a Graph's Laplacian Using GPUs 用gpu计算图拉普拉斯算子的伪逆

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.125

Nishant Saurabh, A. Varbanescu, Gyan Ranjan

Many applications in network analysis require the computation of the network's Laplacian pseudo-inverse - e.g., Topological centrality in social networks or estimating commute times in electrical networks. As large graphs become ubiquitous, the traditional approaches - with quadratic or cubic complexity in the number of vertices - do not scale. To alleviate this performance issue, a divide-and-conquer approach has been recently developed. In this work, we take one step further in improving the performance of computing the pseudo-inverse of Laplacian by parallelization. Specifically, we propose a parallel, GPU-based version of this new divide-and-conquer method. Furthermore, we implement this solution in Mat lab, a native environment for such computations, recently enhanced with the ability to harness the computational capabilites of GPUs. We find that using GPUs through Mat lab, we achieve speed-ups of up to 320x compared with the sequential divide-and-conquer solution. We further compare this GPU-enabled version with three other parallel solutions: a parallel CPU implementation and CUDA-based implementation of the divide-and-conquer algorithm, as well as a GPU-based implementation that uses cuBLAS to compute the pseudo-inverse in the traditional way. We find that the GPU-based implementation outperforms the CPU parallel version significantly. Furthermore, our results demonstrate that a best GPU-based implementation does not exist: depending on the size and structure of the graph, the relative performance of the three GPU-based versions can differ significantly. We conclude that GPUs can be successfully used to improve the performance of the pseudo-inverse of a graph's Laplacian, but choosing the best performing solution remains challenging due to the non-trivial correlation between the achieved performance and the characteristics of the input graph. Our future work attempts to expose and exploit this correlation.

网络分析中的许多应用都需要计算网络的拉普拉斯伪逆-例如，社会网络中的拓扑中心性或估计电网中的通勤时间。随着大型图变得无处不在，传统的方法——在顶点数量上具有二次或三次复杂度——不再适用。为了缓解这个性能问题，最近开发了一种分而治之的方法。在这项工作中，我们进一步提高了并行化计算拉普拉斯算子伪逆的性能。具体来说，我们提出了一种基于gpu的并行分治方法。此外，我们在Mat lab中实现了这个解决方案，Mat lab是一个用于此类计算的本地环境，最近增强了利用gpu计算能力的能力。我们发现通过Mat lab使用gpu，与顺序分治方案相比，我们实现了高达320倍的加速。我们进一步将这个支持gpu的版本与其他三种并行解决方案进行比较:一个基于并行CPU实现和基于cuda的分治算法实现，以及一个基于gpu的实现，使用cuBLAS以传统方式计算伪逆。我们发现基于gpu的实现明显优于CPU并行版本。此外，我们的结果表明，最佳的基于gpu的实现并不存在:根据图形的大小和结构，三种基于gpu的版本的相对性能可能会有很大差异。我们得出的结论是，gpu可以成功地用于提高图的拉普拉斯算子的伪逆的性能，但由于所获得的性能与输入图的特征之间存在非平凡的相关性，因此选择性能最佳的解决方案仍然具有挑战性。我们未来的工作试图揭示和利用这种相关性。

{"title":"Computing the Pseudo-Inverse of a Graph's Laplacian Using GPUs","authors":"Nishant Saurabh, A. Varbanescu, Gyan Ranjan","doi":"10.1109/IPDPSW.2015.125","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.125","url":null,"abstract":"Many applications in network analysis require the computation of the network's Laplacian pseudo-inverse - e.g., Topological centrality in social networks or estimating commute times in electrical networks. As large graphs become ubiquitous, the traditional approaches - with quadratic or cubic complexity in the number of vertices - do not scale. To alleviate this performance issue, a divide-and-conquer approach has been recently developed. In this work, we take one step further in improving the performance of computing the pseudo-inverse of Laplacian by parallelization. Specifically, we propose a parallel, GPU-based version of this new divide-and-conquer method. Furthermore, we implement this solution in Mat lab, a native environment for such computations, recently enhanced with the ability to harness the computational capabilites of GPUs. We find that using GPUs through Mat lab, we achieve speed-ups of up to 320x compared with the sequential divide-and-conquer solution. We further compare this GPU-enabled version with three other parallel solutions: a parallel CPU implementation and CUDA-based implementation of the divide-and-conquer algorithm, as well as a GPU-based implementation that uses cuBLAS to compute the pseudo-inverse in the traditional way. We find that the GPU-based implementation outperforms the CPU parallel version significantly. Furthermore, our results demonstrate that a best GPU-based implementation does not exist: depending on the size and structure of the graph, the relative performance of the three GPU-based versions can differ significantly. We conclude that GPUs can be successfully used to improve the performance of the pseudo-inverse of a graph's Laplacian, but choosing the best performing solution remains challenging due to the non-trivial correlation between the achieved performance and the characteristics of the input graph. Our future work attempts to expose and exploit this correlation.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114404472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Energy Consumption Reduction with DVFS for Message Passing Iterative Applications on Heterogeneous Architectures 基于DVFS的异构架构消息传递迭代应用能耗降低

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.44

Jean-Claude Charr, R. Couturier, Ahmed Fanfakh, Arnaud Giersch

Computing platforms are consuming more and more energy due to the increasing number of nodes composing them. To minimize the operating costs of these platforms many techniques have been used. Dynamic voltage and frequency scaling (DVFS) is one of them. It reduces the frequency of a CPU to lower its energy consumption. However, lowering the frequency of a CPU may increase the execution time of an application running on that processor. Therefore, the frequency that gives the best trade-off between the energy consumption and the performance of an application must be selected. In this paper, a new online frequency selecting algorithm for heterogeneous platforms (heterogeneous CPUs) is presented. It selects the frequencies and tries to give the best trade-off between energy saving and performance degradation, for each node computing the message passing iterative application. The algorithm has a small overhead and works without training or profiling. It uses a new energy model for message passing iterative applications running on a heterogeneous platform. The proposed algorithm is evaluated on the SimGrid simulator while running the NAS parallel benchmarks. The experiments show that it reduces the energy consumption by up to 34% while limiting the performance degradation as much as possible. Finally, the algorithm is compared to an existing method, the comparison results show that it outperforms the latter, on average it saves 4% more energy while keeping the same performance.

由于组成计算平台的节点越来越多，计算平台消耗的能量越来越大。为了最大限度地降低这些平台的运营成本，已经使用了许多技术。动态电压频率缩放(DVFS)就是其中之一。通过降低CPU的工作频率来降低CPU的能耗。但是，降低CPU的频率可能会增加在该处理器上运行的应用程序的执行时间。因此，必须选择在能耗和应用程序性能之间进行最佳权衡的频率。本文提出了一种新的异构平台(异构cpu)在线选频算法。对于计算消息传递迭代应用程序的每个节点，它选择频率并尝试在节能和性能降低之间做出最佳权衡。该算法开销小，无需训练或分析即可工作。它使用一种新的能量模型来传递在异构平台上运行的迭代应用程序的消息。在运行NAS并行基准测试的同时，在SimGrid模拟器上对提出的算法进行了评估。实验表明，在尽可能限制性能下降的情况下，该方法可将能耗降低34%。最后，将该算法与现有的一种方法进行了比较，结果表明，在保持相同性能的情况下，该算法平均多节省4%的能量。

{"title":"Energy Consumption Reduction with DVFS for Message Passing Iterative Applications on Heterogeneous Architectures","authors":"Jean-Claude Charr, R. Couturier, Ahmed Fanfakh, Arnaud Giersch","doi":"10.1109/IPDPSW.2015.44","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.44","url":null,"abstract":"Computing platforms are consuming more and more energy due to the increasing number of nodes composing them. To minimize the operating costs of these platforms many techniques have been used. Dynamic voltage and frequency scaling (DVFS) is one of them. It reduces the frequency of a CPU to lower its energy consumption. However, lowering the frequency of a CPU may increase the execution time of an application running on that processor. Therefore, the frequency that gives the best trade-off between the energy consumption and the performance of an application must be selected. In this paper, a new online frequency selecting algorithm for heterogeneous platforms (heterogeneous CPUs) is presented. It selects the frequencies and tries to give the best trade-off between energy saving and performance degradation, for each node computing the message passing iterative application. The algorithm has a small overhead and works without training or profiling. It uses a new energy model for message passing iterative applications running on a heterogeneous platform. The proposed algorithm is evaluated on the SimGrid simulator while running the NAS parallel benchmarks. The experiments show that it reduces the energy consumption by up to 34% while limiting the performance degradation as much as possible. Finally, the algorithm is compared to an existing method, the comparison results show that it outperforms the latter, on average it saves 4% more energy while keeping the same performance.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116140785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

On the Greenness of In-Situ and Post-Processing Visualization Pipelines 原位和后处理可视化管道的绿色研究

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.132

Vignesh Adhinarayanan, Wu-chun Feng, J. Woodring, D. Rogers, J. Ahrens

Post-processing visualization pipelines are traditionally used to gain insight from simulation data. However, changes to the system architecture for high-performance computing (HPC), dictated by the exascale goal, have limited the applicability of post-processing visualization. As an alternative, in-situ pipelines are proposed in order to enhance the knowledge discovery process via "real-time" visualization. Quantitative studies have already shown how in-situ visualization can improve performance and reduce storage needs at the cost of scientific exploration capabilities. However, to fully understand the trade-off space, a head-to-head comparison of power and energy (between the two types of visualization pipelines) is necessary. Thus, in this work, we study the greenness (i.e., Power, energy, and energy efficiency) of the in-situ and the post-processing visualization pipelines, using a proxy heat-transfer simulation as an example. For a realistic I/O load, the in-situ pipeline consumes 43% less energy than the post-processing pipeline. Contrary to expectations, our findings also show that only 9% of the total energy is saved by reducing off-chip data movement, while the rest of the savings comes from reducing the system idle time. This suggests an alternative set of optimization techniques for reducing the power consumption of the traditional post-processing pipeline.

传统上，后处理可视化管道用于从仿真数据中获得洞察力。然而，高性能计算(HPC)系统架构的变化(由百亿亿级目标决定)限制了后处理可视化的适用性。作为替代方案，提出了原位管道，以通过“实时”可视化来增强知识发现过程。定量研究已经表明，原位可视化如何以牺牲科学勘探能力为代价，提高性能并减少存储需求。然而，为了充分理解权衡空间，有必要对功率和能量(两种类型的可视化管道之间)进行正面比较。因此，在这项工作中，我们以代理传热模拟为例，研究了原位和后处理可视化管道的绿色度(即功率，能量和能源效率)。对于实际的I/O负载，原位管道比后处理管道消耗的能量少43%。与预期相反，我们的研究结果还表明，通过减少片外数据移动只节省了总能量的9%，而其余的节省来自减少系统空闲时间。这为减少传统后处理管道的功耗提供了一套可选的优化技术。

{"title":"On the Greenness of In-Situ and Post-Processing Visualization Pipelines","authors":"Vignesh Adhinarayanan, Wu-chun Feng, J. Woodring, D. Rogers, J. Ahrens","doi":"10.1109/IPDPSW.2015.132","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.132","url":null,"abstract":"Post-processing visualization pipelines are traditionally used to gain insight from simulation data. However, changes to the system architecture for high-performance computing (HPC), dictated by the exascale goal, have limited the applicability of post-processing visualization. As an alternative, in-situ pipelines are proposed in order to enhance the knowledge discovery process via \"real-time\" visualization. Quantitative studies have already shown how in-situ visualization can improve performance and reduce storage needs at the cost of scientific exploration capabilities. However, to fully understand the trade-off space, a head-to-head comparison of power and energy (between the two types of visualization pipelines) is necessary. Thus, in this work, we study the greenness (i.e., Power, energy, and energy efficiency) of the in-situ and the post-processing visualization pipelines, using a proxy heat-transfer simulation as an example. For a realistic I/O load, the in-situ pipeline consumes 43% less energy than the post-processing pipeline. Contrary to expectations, our findings also show that only 9% of the total energy is saved by reducing off-chip data movement, while the rest of the savings comes from reducing the system idle time. This suggests an alternative set of optimization techniques for reducing the power consumption of the traditional post-processing pipeline.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127405872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Query Execution for RDF Data Using Structure Indexed Vertical Partitioning 使用结构索引垂直分区执行 RDF 数据查询

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.143

Bhavik Shah, Trupti Padiya, Minal Bhise

The paper explores use of various partitioning methods to store RDF data effectively, to meet the needs of extensively growing highly interactive semantic web applications. It proposes a combinational approach of structure index partitioning and vertical partitioning - SIVP and demonstrates the implementation of SIVP. The paper presents five metrics to measure and analyze performance of SIVP store. SIVP is experimented on FOAF and SwetoDBLP datasets. SIVP store have shown an average of 34% gain over vertical partitioning for FOAF dataset. For SwetoDBLP dataset, SIVP have shown an average of 26% gain over VP. SIVP is better than vertical partitioning provided extra time needed in SIVP, which consists of lookup time and merge time, is compensated by frequency of a query higher than breakeven point for that query.

本文探讨了如何使用各种分区方法来有效存储 RDF 数据，以满足广泛增长的高度交互式语义网络应用的需求。它提出了一种结构索引分区和垂直分区的组合方法--SIVP，并演示了 SIVP 的实现。论文提出了衡量和分析 SIVP 存储性能的五个指标。SIVP 在 FOAF 和 SwetoDBLP 数据集上进行了实验。在 FOAF 数据集上，SIVP 存储比垂直分区平均增益 34%。在 SwetoDBLP 数据集上，SIVP 平均比 VP 提高了 26%。SIVP 优于垂直分区，前提是 SIVP 所需的额外时间（包括查找时间和合并时间）可以通过查询频率高于该查询的盈亏平衡点来补偿。

引用次数: 5

Incorporating PDC Modules Into Computer Science Courses at Jackson State University 杰克逊州立大学将PDC模块纳入计算机科学课程

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.39

A. Humos, Sungbum Hong, Jacqueline Jackson, Xuejun Liang, T. Pei, Bernard Aldrich

The Computer Science Department at Jackson State University (JSU) is updating its curriculum according to the new ABET guidelines. As part of this effort, the computer science faculty members have integrated modules of the NSF/IEEE-TCPP Curriculum Initiative on PDC (Parallel and Distributed Computing) into department-wide core and elective courses offered on fall 2014. These courses are: csc 119 Object Oriented Programming (core), csc 216 Computer Architecture and Organization (core), csc 312 Advanced Computer Architecture (elective), csc 325 Operating Systems (core), csc 350 Organization of Programming Languages (core) and csc 425 Parallel Computing (elective). The inclusion of the PDC modules was gradual and light weighted in the low level courses and more aggressive in the high level courses. Csc 119 Object Oriented Programming provided students with an early introduction to Java Threads: how to create and use. In csc 216 Computer Architecture and Organization students learned about GPUs and were asked to write simple problems using CUDA. Csc 312 Advanced Computer Architecture covered Instruction level and Processor level Parallelism. For csc 325 Operating Systems, mutual exclusion problems and Parallel Computing and Algorithms were introduced. In csc 350 Organization of Programming Languages, students learned about the implementation of threads in Java. Csc 425 Parallel Computing is an advanced study of parallel computing hardware and software issues. Assessment results showed that student perception of PDC concepts was satisfactory with some weakness in writing parallel code. However, students were very excited and motivated to learn about PDC. We were also able to share our experience with the Computer Engineering Department at JSU. New PDC modules will be integrated into some of their courses next fall and spring semesters. Our findings were made available on the Center for Parallel and Distributed Computing Curriculum Development and Educational Resources (CDER) website. In this paper, we will describe our experience of incorporating PDC modules into the aforementioned computer science courses at JSU.

杰克逊州立大学计算机科学系(JSU)正在根据新的ABET指南更新其课程。作为这一努力的一部分，计算机科学教师已经将NSF/IEEE-TCPP并行和分布式计算课程计划的模块整合到2014年秋季提供的全系核心和选修课程中。这些课程是:csc 119面向对象编程(核心)，csc 216计算机体系结构与组织(核心)，csc 312高级计算机体系结构(选修)，csc 325操作系统(核心)，csc 350编程语言组织(核心)和csc 425并行计算(选修)。在低水平地层中，PDC模块的加入是渐进的，权重较轻，而在高水平地层中，PDC模块的加入力度更大。Csc 119面向对象编程为学生提供了Java线程的早期介绍:如何创建和使用。在csc 216计算机体系结构与组织课程中，学生们学习了gpu，并被要求使用CUDA编写简单的问题。Csc 312高级计算机体系结构涵盖指令级和处理器级并行。介绍了csc 325操作系统的互斥问题以及并行计算和算法。在csc 350编程语言组织课程中，学生们学习了Java中线程的实现。Csc 425并行计算是并行计算硬件和软件问题的高级研究。评估结果显示，学生对PDC概念的认知是令人满意的，但在编写并行代码方面存在一些不足。然而，学生们非常兴奋，并有动力去学习PDC。我们还能够与JSU计算机工程系分享我们的经验。新的PDC模块将在明年秋季和春季学期整合到他们的一些课程中。我们的研究结果在并行和分布式计算课程开发和教育资源中心(CDER)网站上公布。在本文中，我们将描述我们在JSU将PDC模块整合到上述计算机科学课程中的经验。

{"title":"Incorporating PDC Modules Into Computer Science Courses at Jackson State University","authors":"A. Humos, Sungbum Hong, Jacqueline Jackson, Xuejun Liang, T. Pei, Bernard Aldrich","doi":"10.1109/IPDPSW.2015.39","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.39","url":null,"abstract":"The Computer Science Department at Jackson State University (JSU) is updating its curriculum according to the new ABET guidelines. As part of this effort, the computer science faculty members have integrated modules of the NSF/IEEE-TCPP Curriculum Initiative on PDC (Parallel and Distributed Computing) into department-wide core and elective courses offered on fall 2014. These courses are: csc 119 Object Oriented Programming (core), csc 216 Computer Architecture and Organization (core), csc 312 Advanced Computer Architecture (elective), csc 325 Operating Systems (core), csc 350 Organization of Programming Languages (core) and csc 425 Parallel Computing (elective). The inclusion of the PDC modules was gradual and light weighted in the low level courses and more aggressive in the high level courses. Csc 119 Object Oriented Programming provided students with an early introduction to Java Threads: how to create and use. In csc 216 Computer Architecture and Organization students learned about GPUs and were asked to write simple problems using CUDA. Csc 312 Advanced Computer Architecture covered Instruction level and Processor level Parallelism. For csc 325 Operating Systems, mutual exclusion problems and Parallel Computing and Algorithms were introduced. In csc 350 Organization of Programming Languages, students learned about the implementation of threads in Java. Csc 425 Parallel Computing is an advanced study of parallel computing hardware and software issues. Assessment results showed that student perception of PDC concepts was satisfactory with some weakness in writing parallel code. However, students were very excited and motivated to learn about PDC. We were also able to share our experience with the Computer Engineering Department at JSU. New PDC modules will be integrated into some of their courses next fall and spring semesters. Our findings were made available on the Center for Parallel and Distributed Computing Curriculum Development and Educational Resources (CDER) website. In this paper, we will describe our experience of incorporating PDC modules into the aforementioned computer science courses at JSU.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"34 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126160001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Iso-Power-Efficiency: An Approach to Scaling Application Codes with a Power Budget 等功率效率:一种在功率预算下扩展应用代码的方法

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.122

R. Long, S. Moore, B. Rountree

We propose a new model for scaling applications with increasing power budget, which we call the iso-powerefficiency function. We show that viewing scaling in this way has advantages over the previously proposed is efficiency function that assumes all processors run at maximum power. Our experimental results show that over provisioning can result in better scaling under a power budget.

我们提出了一个新的模型，用于增加功率预算的应用，我们称之为等功率效率函数。我们表明，以这种方式查看缩放比先前提出的假设所有处理器以最大功率运行的效率函数具有优势。我们的实验结果表明，在功率预算下，过度配置可以带来更好的扩展。

引用次数: 1

Efficient Message Logging to Support Process Replicas in a Volunteer Computing Environment 在志愿计算环境中支持进程副本的高效消息记录

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.91

M. Islam, Hien Nguyen, J. Subhlok, E. Gabriel

The context of this research is Volpex, a communication framework based on Put/Get calls to an abstract global space that can seamlessly handle multiple active replicas of communicating processes. Volpex is designed for a heterogeneous and unreliable execution environment where parallel applications need replication as well as check pointing to make continuous progress. Since different instances of the same process can execute in the same logical state at different clock times, communicated data objects must be logged to ensure consistent execution of process replicas. Logging to support redundancy can be the source of a significant overhead in execution time and storage and can limit scalability. In this paper we develop, implement, and evaluate Log on Read and Log on Write logging schemes to support redundant communication. Log on Read schemes log a copy of the data object returned to every Get (or Read) request. On the other hand, Log on Write schemes log the old data object only when a Put request is overwriting a data object. This reduces redundant copying, but identifying the correct data object to return to a Get request is complex. A Virtual Time Stamp (VTS) that captures global execution state is logged along with the data object to make this possible. We develop an optimized Log on Read scheme that minimizes redundancy and an optimized Log on Write scheme that reduces the VTS size and overhead. Experimental results show that the optimizations are effective in terms of storage and time overhead and an optimized Log on Read scheme presents the best tradeoffs for most scenarios.

本研究的背景是Volpex，这是一个基于对抽象全局空间的Put/Get调用的通信框架，可以无缝地处理通信过程的多个活动副本。Volpex是为异构和不可靠的执行环境而设计的，在这种环境中，并行应用程序需要复制和检查指向来进行持续的进展。由于同一流程的不同实例可以在不同的时钟时间以相同的逻辑状态执行，因此必须记录通信数据对象，以确保流程副本的一致执行。支持冗余的日志记录可能是执行时间和存储开销的重要来源，并可能限制可伸缩性。在本文中，我们开发、实现和评估了Log on Read和Log on Write日志方案，以支持冗余通信。登录读取方案记录返回给每个Get(或Read)请求的数据对象的副本。另一方面，Log On Write模式只在Put请求覆盖数据对象时记录旧数据对象。这减少了冗余复制，但是识别要返回给Get请求的正确数据对象是复杂的。捕获全局执行状态的虚拟时间戳(VTS)与数据对象一起被记录下来，以实现这一点。我们开发了一个优化的Log on Read方案，最大限度地减少冗余，以及一个优化的Log on Write方案，减少VTS大小和开销。实验结果表明，优化在存储和时间开销方面是有效的，优化的Log on Read方案在大多数情况下都是最佳的折衷方案。

{"title":"Efficient Message Logging to Support Process Replicas in a Volunteer Computing Environment","authors":"M. Islam, Hien Nguyen, J. Subhlok, E. Gabriel","doi":"10.1109/IPDPSW.2015.91","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.91","url":null,"abstract":"The context of this research is Volpex, a communication framework based on Put/Get calls to an abstract global space that can seamlessly handle multiple active replicas of communicating processes. Volpex is designed for a heterogeneous and unreliable execution environment where parallel applications need replication as well as check pointing to make continuous progress. Since different instances of the same process can execute in the same logical state at different clock times, communicated data objects must be logged to ensure consistent execution of process replicas. Logging to support redundancy can be the source of a significant overhead in execution time and storage and can limit scalability. In this paper we develop, implement, and evaluate Log on Read and Log on Write logging schemes to support redundant communication. Log on Read schemes log a copy of the data object returned to every Get (or Read) request. On the other hand, Log on Write schemes log the old data object only when a Put request is overwriting a data object. This reduces redundant copying, but identifying the correct data object to return to a Get request is complex. A Virtual Time Stamp (VTS) that captures global execution state is logged along with the data object to make this possible. We develop an optimized Log on Read scheme that minimizes redundancy and an optimized Log on Write scheme that reduces the VTS size and overhead. Experimental results show that the optimizations are effective in terms of storage and time overhead and an optimized Log on Read scheme presents the best tradeoffs for most scenarios.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126031383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1