2010 39th International Conference on Parallel Processing Workshops最新文献

英文中文

A Tie-Breaking Strategy for Processor Allocation in Meshes 网格中处理器分配的打破策略

2010 39th International Conference on Parallel Processing Workshops

Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.50

Christopher R. Johnson, David P. Bunde, V. Leung

Many of the proposed algorithms for allocating processors to jobs in supercomputers choose arbitrarily among potential allocations that are "equally good" according to the allocation algorithm. In this paper, we add a parametrized tie-breaking strategy to the MC1x1 allocation algorithm for mesh supercomputers. This strategy attempts to favor allocations that preserve large regions of free processors, benefiting future allocations and improving machine performance. Trace-based simulations show the promise of our strategy; with good parameter choices, most jobs benefit and no class of jobs is harmed significantly.

在超级计算机中，许多被提出的将处理器分配给作业的算法在根据分配算法“同样好”的潜在分配中任意选择。本文在网格超级计算机的MC1x1分配算法中加入了参数化断线策略。该策略试图支持保留大量空闲处理器区域的分配，从而有利于未来的分配并提高机器性能。基于轨迹的模拟显示了我们的策略的前景;有了良好的参数选择，大多数作业都会受益，而且没有任何一类作业受到明显损害。

引用次数: 8

Non-intrusive Performance Analysis of Parallel Hardware Accelerated Applications on Hybrid Architectures 混合架构下并行硬件加速应用的非侵入性性能分析

2010 39th International Conference on Parallel Processing Workshops

Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.30

R. Dietrich, T. Ilsche, G. Juckeland

New high performance computing (HPC) applications recently have to face scalability over an increasing number of nodes and the programming of special accelerator hardware. Hybrid composition of large computing systems leads to a new dimension in complexity of software development. This paper presents a novel approach to gain insight into accelerator interaction and utilization without any changes to the application. It leverages well established methods for performance analysis to accelerator hardware, allowing a holistic view on performance bottlenecks of hybrid applications. A general strategy is presented to get dynamic runtime information about hybrid program execution with minimal impact on the program ???ow. The achievable level of detail is exemplarily studied for the CUDA environment and the OpenCL framework. Combined with existing performance analysis techniques this facilitates obtaining the full potential of hybrid computing power.

新的高性能计算(HPC)应用程序最近不得不面对在越来越多的节点上的可伸缩性和特殊加速器硬件的编程。大型计算系统的混合组合导致了软件开发复杂性的一个新维度。本文提出了一种在不改变应用程序的情况下深入了解加速器交互和利用的新方法。它利用完善的性能分析方法来加速硬件，从而全面了解混合应用程序的性能瓶颈。提出了一种获取混合程序执行动态运行时信息的通用策略，使其对程序的影响最小。在CUDA环境和OpenCL框架中对可实现的细节级别进行了举例研究。结合现有的性能分析技术，这有助于获得混合计算能力的全部潜力。

引用次数: 21

A Hybrid Programming Model for Compressible Gas Dynamics Using OpenCL 基于OpenCL的可压缩气体动力学混合规划模型

2010 39th International Conference on Parallel Processing Workshops

Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.60

B. Bergen, Marcus G. Daniels, Paul M. Weber

The current trend towards multi-core/manycore and accelerated architectures presents challenges, both in portability, and also in the choices that developers must make on how to use the resources that these architectures provide. This paper explores some of the possibilities that are enabled by the Open Computing Language (OpenCL), and proposes a programming model that will allow developers and scientists to more fully subscribe hybrid compute nodes, while, at the same time, reducing the impact of system failure.

当前多核/多核和加速架构的趋势带来了挑战，包括可移植性，以及开发人员必须选择如何使用这些架构提供的资源。本文探讨了开放计算语言(OpenCL)实现的一些可能性，并提出了一种编程模型，该模型将允许开发人员和科学家更全面地订阅混合计算节点，同时减少系统故障的影响。

引用次数: 11

Performance Evaluation of an Irregular Application Parallelized in Java Java中并行化不规则应用程序的性能评估

2010 39th International Conference on Parallel Processing Workshops

Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.40

Christopher D. Krieger, M. Strout

Irregular scientific applications are difficult to parallelize in an efficient and scalable fashion due to indirect memory references (i.e. A[B[i]]), irregular communication patterns, and load balancing issues. In this paper, we present our experience parallelizing an irregular scientific application written in Java. The application is an N-Body molecular dynamics simulation that is the main component of a Java application called the Molecular Workbench (MW). We parallelized MW to run on multicore hardware using Java's java.util.concurrent library. Speedup was found to vary greatly depending on what type of force computation dominated the simulation. In order to understand the cause of this appreciable difference in scalability, various performance analysis tools were deployed. These tools include Intel's VTune, Apple's Shark, the Java Application Monitor (JaMON), and Sun's VisualVM. Virtual machine instrumentation as well as hardware performance monitors were used. To our knowledge this is the first such performance analysis of an irregular scientific application parallelized using Java threads. In the course of this investigation, a number of challenges were encountered. These difficulties in general stemmed from a mismatch between the nature of our application and either Java itself or the performance tools we used. This paper aims to share our real world experience with Java threading and today's parallel performance tools in an effort to influence future directions for the Java virtual machine, for the Java concurrency library, and for tools for multicore parallel software development.

由于间接内存引用(即A[B[i]])、不规则的通信模式和负载平衡问题，不规则的科学应用程序很难以有效和可扩展的方式并行化。在本文中，我们介绍了用Java编写的不规则科学应用程序的并行化经验。该应用程序是一个n体分子动力学模拟，它是称为molecular Workbench (MW)的Java应用程序的主要组件。我们使用Java的Java .util.concurrent库将MW并行化，以便在多核硬件上运行。研究发现，根据哪种类型的力计算主导了模拟，加速变化很大。为了理解可伸缩性方面这种明显差异的原因，部署了各种性能分析工具。这些工具包括英特尔的VTune、苹果的Shark、Java应用程序监视器(JaMON)和Sun的VisualVM。使用了虚拟机仪器和硬件性能监视器。据我们所知，这是第一次对使用Java线程并行化的不规则科学应用程序进行这样的性能分析。在调查过程中，遇到了一些挑战。这些困难通常源于我们的应用程序的性质与Java本身或我们使用的性能工具之间的不匹配。本文旨在分享我们在Java线程和当今并行性能工具方面的实际经验，以努力影响Java虚拟机、Java并发库和多核并行软件开发工具的未来方向。

{"title":"Performance Evaluation of an Irregular Application Parallelized in Java","authors":"Christopher D. Krieger, M. Strout","doi":"10.1109/ICPPW.2010.40","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.40","url":null,"abstract":"Irregular scientific applications are difficult to parallelize in an efficient and scalable fashion due to indirect memory references (i.e. A[B[i]]), irregular communication patterns, and load balancing issues. In this paper, we present our experience parallelizing an irregular scientific application written in Java. The application is an N-Body molecular dynamics simulation that is the main component of a Java application called the Molecular Workbench (MW). We parallelized MW to run on multicore hardware using Java's java.util.concurrent library. Speedup was found to vary greatly depending on what type of force computation dominated the simulation. In order to understand the cause of this appreciable difference in scalability, various performance analysis tools were deployed. These tools include Intel's VTune, Apple's Shark, the Java Application Monitor (JaMON), and Sun's VisualVM. Virtual machine instrumentation as well as hardware performance monitors were used. To our knowledge this is the first such performance analysis of an irregular scientific application parallelized using Java threads. In the course of this investigation, a number of challenges were encountered. These difficulties in general stemmed from a mismatch between the nature of our application and either Java itself or the performance tools we used. This paper aims to share our real world experience with Java threading and today's parallel performance tools in an effort to influence future directions for the Java virtual machine, for the Java concurrency library, and for tools for multicore parallel software development.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121957327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Load Balancing Scheme for ebXML Registries ebXML注册表的负载平衡方案

2010 39th International Conference on Parallel Processing Workshops

Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.12

Sadhana Sahasrabudhe, C. Paolini

Large scale Service Oriented Architecture (SOA) developments are becoming increasingly reliant on registry services that manage Web Services using taxonomic attributes. At present a registry stores a Web Services interface definition and protocol bindings in WSDL, along with one or more XML schema files that define the structure of a SOAP message exchanged between Web Services operations and client processes and other static metadata. During Web Service discovery an ebXML registry returns the access URI associated with the service binding to allow dynamic discovery and invocation. This usually restricts a calling process to a Web Service invocation on one host. This work explores a mechanism to manage service bindings for a Web Service that has been deployed across multiple hosts, such that, a URI returned by a registry can resolve to a host that satisfies different system constraints like current CPU load, physical memory, swap memory, and time of day. This paper discusses the design and development of new scheme for ebXML registries that facilitates periodic collection and management of dynamic system properties for registry clients and enforces constraints during service discovery and query operation.

大规模面向服务的体系结构(SOA)开发越来越依赖于使用分类属性管理Web服务的注册中心服务。目前，注册中心在WSDL中存储Web服务接口定义和协议绑定，以及一个或多个XML模式文件，这些文件定义了在Web服务操作和客户端流程以及其他静态元数据之间交换的SOAP消息的结构。在Web服务发现期间，ebXML注册中心返回与服务绑定关联的访问URI，以允许动态发现和调用。这通常将调用进程限制为一台主机上的Web服务调用。这项工作探索了一种机制来管理跨多个主机部署的Web服务的服务绑定，这样，注册中心返回的URI可以解析到满足不同系统约束(如当前CPU负载、物理内存、交换内存和时间)的主机。本文讨论了ebXML注册中心新方案的设计和开发，该方案促进了注册中心客户端动态系统属性的定期收集和管理，并在服务发现和查询操作期间实施了约束。

{"title":"A Load Balancing Scheme for ebXML Registries","authors":"Sadhana Sahasrabudhe, C. Paolini","doi":"10.1109/ICPPW.2010.12","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.12","url":null,"abstract":"Large scale Service Oriented Architecture (SOA) developments are becoming increasingly reliant on registry services that manage Web Services using taxonomic attributes. At present a registry stores a Web Services interface definition and protocol bindings in WSDL, along with one or more XML schema files that define the structure of a SOAP message exchanged between Web Services operations and client processes and other static metadata. During Web Service discovery an ebXML registry returns the access URI associated with the service binding to allow dynamic discovery and invocation. This usually restricts a calling process to a Web Service invocation on one host. This work explores a mechanism to manage service bindings for a Web Service that has been deployed across multiple hosts, such that, a URI returned by a registry can resolve to a host that satisfies different system constraints like current CPU load, physical memory, swap memory, and time of day. This paper discusses the design and development of new scheme for ebXML registries that facilitates periodic collection and management of dynamic system properties for registry clients and enforces constraints during service discovery and query operation.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116558759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability 调度100,000核超级计算机以获得最大的利用率和能力

2010 39th International Conference on Parallel Processing Workshops

Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.63

P. Andrews, P. Kovatch, Victor Hazlewood, Troy Baer

In late 2009, the National Institute for Computational Sciences placed in production the world’s fastest academic supercomputer (third overall), a Cray XT5 named Kraken, with almost 100,000 compute cores and a peak speed in excess of one Petaflop. Delivering over 50% of the total cycles available to the National Science Foundation users via the TeraGrid, Kraken has two missions that have historically proven difficult to simultaneously reconcile: providing the maximum number of total cycles to the community, while enabling full machine runs for “hero” users. Historically, this has been attempted by allowing schedulers to choose the correct time for the beginning of large jobs, with a concomitant reduction in utilization. At NICS, we used the results of a previous theoretical investigation to adopt a different approach, where the “clearing out” of the system is forced on a weekly basis, followed by consecutive full machine runs. As our previous simulation results suggested, this lead to a significant improvement in utilization, to over 90%. The difference in utilization between the traditional and adopted scheduling policies was the equivalent of a 300+ Teraflop supercomputer, or several million dollars of compute time per year.

2009年底，美国国家计算科学研究所(National Institute for Computational Sciences)投入生产了世界上最快的学术超级计算机(排名第三)，名为Kraken的克雷XT5，拥有近10万个计算核心，峰值速度超过1千万亿次。Kraken通过TeraGrid向美国国家科学基金会用户提供超过50%的可用总周期，这两项任务在历史上被证明是难以同时协调的:为社区提供最大数量的总周期，同时为“英雄”用户提供完整的机器运行。从历史上看，通过允许调度器为大型作业的开始选择正确的时间来尝试这一点，同时降低了利用率。在NICS，我们使用先前理论研究的结果来采用一种不同的方法，即每周强制对系统进行“清理”，然后连续运行全机。正如我们之前的模拟结果所表明的那样，这将导致利用率的显著提高，达到90%以上。传统调度策略和采用调度策略之间的利用率差异相当于一台300+ Teraflop的超级计算机，或者每年数百万美元的计算时间。

{"title":"Scheduling a 100,000 Core Supercomputer for Maximum Utilization and Capability","authors":"P. Andrews, P. Kovatch, Victor Hazlewood, Troy Baer","doi":"10.1109/ICPPW.2010.63","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.63","url":null,"abstract":"In late 2009, the National Institute for Computational Sciences placed in production the world’s fastest academic supercomputer (third overall), a Cray XT5 named Kraken, with almost 100,000 compute cores and a peak speed in excess of one Petaflop. Delivering over 50% of the total cycles available to the National Science Foundation users via the TeraGrid, Kraken has two missions that have historically proven difficult to simultaneously reconcile: providing the maximum number of total cycles to the community, while enabling full machine runs for “hero” users. Historically, this has been attempted by allowing schedulers to choose the correct time for the beginning of large jobs, with a concomitant reduction in utilization. At NICS, we used the results of a previous theoretical investigation to adopt a different approach, where the “clearing out” of the system is forced on a weekly basis, followed by consecutive full machine runs. As our previous simulation results suggested, this lead to a significant improvement in utilization, to over 90%. The difference in utilization between the traditional and adopted scheduling policies was the equivalent of a 300+ Teraflop supercomputer, or several million dollars of compute time per year.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126471361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

MPSoC Performance Analysis with Virtual Prototyping Platforms 基于虚拟样机平台的MPSoC性能分析

2010 39th International Conference on Parallel Processing Workshops

Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.32

David Castells-Rufas, Jaume Joven, Sergi Risueño, Eduard Fernandez-Alonso, J. Carrabina, T. William, H. Mix

There is some consensus that Embedded and HPC domains have to create synergies to face the challenges to create, maintain and optimize software for the future many-core platforms. In this work we show how some HPC performance analysis methods can be successfully adapted to the embedded domain. We propose to use Virtual Prototypes based on Instruction Set Simulators to produce trace files by transparent instrumentation that can be used for post-mortem performance analysis. Transparent instrumentation on ISS kills two birds in one shot: it adds no overhead for trace generation and it solves the problem of trace storage. A virtual prototype is build to generate OTF traces that are later analyzed with Vampir. We show how the performance analysis of the virtual prototype is valuable to optimize a parallel embedded test application, allowing an acceptable speedup factor on 4 processors to be obtained.

人们一致认为，嵌入式和高性能计算领域必须创造协同效应，以面对为未来多核平台创建、维护和优化软件的挑战。在这项工作中，我们展示了一些高性能计算性能分析方法如何成功地适应嵌入式领域。我们建议使用基于指令集模拟器的虚拟原型，通过透明的仪器生成可用于事后性能分析的跟踪文件。国际空间站上的透明仪器一举两得:它不会增加跟踪生成的开销，也解决了跟踪存储的问题。构建一个虚拟原型来生成OTF痕迹，然后用Vampir分析。我们展示了虚拟原型的性能分析如何对优化并行嵌入式测试应用程序有价值，允许在4个处理器上获得可接受的加速因子。

引用次数: 6

A Distributed and Energy Efficient Algorithm for Data Collection in Sensor Networks 一种分布式节能传感器网络数据采集算法

2010 39th International Conference on Parallel Processing Workshops

Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.84

Sarah Sharafkandi, D. Du, Alireza Razavi

In wireless sensor networks, collection of raw sensor data at a base station provides the flexibility to perform offline detailed analysis on the data which may not be possible with innetwork data aggregation. However, lossless data collection consumes considerable amount of energy for communication while sensors usually have limited energy. In this paper, we propose a Distributed and Energy efficient algorithm for Collection of Raw data in sensor networks called DECOR. DECOR exploits spatial correlation to reduce the communication energy in sensor networks with highly correlated data. In our approach, at each neighborhood, one sensor shares its raw data as a reference with the rest of sensors without any suppression or compression. Other sensors use this reference data to compress their observations by representing them in the forms of mutual differences. In a highly correlated network, transmission of reference data consumes significantly more energy than transmission of compressed data. Thus, we first attempt to minimize the number of reference transmissions. Then, we try to minimize the size of mutual differences. We derive analytical lower bounds for both these phases and based on our theoretical results, we propose a twostep distributed data collection algorithm which reduces the communication energy significantly compared to existing methods. In addition, we modify our algorithm for lossy communication channels and we evaluate its performance through simulation.

在无线传感器网络中，在基站收集原始传感器数据提供了对数据进行离线详细分析的灵活性，这在网络数据聚合中可能是不可能的。然而，无损数据采集需要消耗大量的通信能量，而传感器通常只有有限的能量。在本文中，我们提出了一种分布式和节能的算法，用于收集传感器网络中的原始数据，称为DECOR。DECOR利用空间相关性来减少传感器网络中高度相关数据的通信能量。在我们的方法中，在每个邻域，一个传感器共享其原始数据作为参考与其他传感器没有任何抑制或压缩。其他传感器利用这些参考数据以互差的形式表示它们的观测值，从而压缩它们的观测值。在高度相关的网络中，传输参考数据要比传输压缩数据消耗更多的能量。因此，我们首先尝试最小化参考传输的数量。然后，我们尽量减少彼此之间的差异。我们推导了这两个阶段的解析下界，并基于我们的理论结果，我们提出了一种两步分布式数据收集算法，与现有方法相比，该算法显著降低了通信能量。此外，我们还针对有损通信信道对算法进行了改进，并通过仿真对其性能进行了评价。

{"title":"A Distributed and Energy Efficient Algorithm for Data Collection in Sensor Networks","authors":"Sarah Sharafkandi, D. Du, Alireza Razavi","doi":"10.1109/ICPPW.2010.84","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.84","url":null,"abstract":"In wireless sensor networks, collection of raw sensor data at a base station provides the flexibility to perform offline detailed analysis on the data which may not be possible with innetwork data aggregation. However, lossless data collection consumes considerable amount of energy for communication while sensors usually have limited energy. In this paper, we propose a Distributed and Energy efficient algorithm for Collection of Raw data in sensor networks called DECOR. DECOR exploits spatial correlation to reduce the communication energy in sensor networks with highly correlated data. In our approach, at each neighborhood, one sensor shares its raw data as a reference with the rest of sensors without any suppression or compression. Other sensors use this reference data to compress their observations by representing them in the forms of mutual differences. In a highly correlated network, transmission of reference data consumes significantly more energy than transmission of compressed data. Thus, we first attempt to minimize the number of reference transmissions. Then, we try to minimize the size of mutual differences. We derive analytical lower bounds for both these phases and based on our theoretical results, we propose a twostep distributed data collection algorithm which reduces the communication energy significantly compared to existing methods. In addition, we modify our algorithm for lossy communication channels and we evaluate its performance through simulation.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124634158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Improving the Effectiveness of Context-Based Prefetching with Multi-order Analysis 利用多阶分析提高基于上下文的预取的有效性

2010 39th International Conference on Parallel Processing Workshops

Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.64

Yong Chen, Huaiyu Zhu, Hui Jin, Xian-He Sun

Data prefetching is an effective way to accelerate data access in high-end computing systems and to bridge the increasing performance gap between processor and memory. In recent years, the contextbased data prefetching has received intensive attention because of its general applicability. In this study, we provide a preliminary analysis of the impact of orders on the effectiveness of the context-based prefetching. Motivated by the observations from the analytical results, we propose a new context-based prefetching method named Multi-Order Context-based (MOC) prefetching to adopt multi-order context analysis to increase the context-based prefetching effectiveness. We have carried out simulation testing with the SPECCPU2006 benchmarks via an enhanced CMP$im simulator. The simulation results show that the proposed MOC prefetching method outperforms the existing single-order prefetching and reduces the data-access latency effectively.

在高端计算系统中，数据预取是加快数据访问速度和弥合处理器与内存之间日益增长的性能差距的有效途径。近年来，基于上下文的数据预取以其普遍的适用性受到了广泛的关注。在本研究中，我们初步分析了顺序对基于上下文的预取有效性的影响。在分析结果的启发下，我们提出了一种新的基于上下文的预取方法——多阶上下文预取(Multi-Order context-based, MOC)，通过多阶上下文分析来提高基于上下文的预取效率。我们通过增强型CMP$im模拟器对SPECCPU2006基准进行了模拟测试。仿真结果表明，提出的MOC预取方法优于现有的单阶预取方法，有效地降低了数据访问延迟。

引用次数: 6

High Performance Design and Implementation of Nemesis Communication Layer for Two-Sided and One-Sided MPI Semantics in MVAPICH2 MVAPICH2中双向和单向MPI语义复仇通信层的高性能设计与实现

2010 39th International Conference on Parallel Processing Workshops

Pub Date : 2010-09-13 DOI: 10.1109/ICPPW.2010.58

Miao Luo, S. Potluri, P. Lai, E. Mancini, H. Subramoni, K. Kandalla, S. Sur, D. Panda

High End Computing (HEC) systems are being deployed with eight to sixteen compute cores, with 64 to 128 cores/node being envisioned for exascale systems. mbox{MVAPICH2} is a popular implementation of MPI-2 specifically designed and optimized for InfiniBand, iWARP and RDMA over Converged Ethernet (RoCE). MVAPICH2 is based on MPICH2 from ANL. Recently MPICH2 has been redesigned with an effort to optimize intra-node communication for future many-core systems. The new communication layer in MPICH2 is called Nemesis, which is very well optimized for shared memory message passing, with a modular design for various high-performance interconnects. In this paper we explore the challenges involved in designing the next-generation MVAPICH2 stack, leveraging the Nemesis communication layer. We observe that Nemesis does not provide abstractions for one-sided communication. We propose an extended Nemesis interface for optimized one-sided communication and provide design details. Our experimental evaluation shows that our proposed one-sided interface extensions are able to provide significantly better performance than the basic Nemesis interface. For example, inter-node MPI_Put bandwidth increased from 1,800 MB/s to 3,000 MB/s and latency for small messages went down by 13%. Additionally, with our proposed designs, we are able to demonstrate performance gains with small messages, when compared to the existing MVAPICH2 CH3 implementation. The designs proposed in this paper is a superset of currently available options to MVAPICH2 users and provides the best combination of performance and modularity.

高端计算(HEC)系统正在部署8到16个计算核心，exascale系统预计将部署64到128个核心/节点。mbox{MVAPICH2}是MPI-2的流行实现，专门为InfiniBand, iWARP和RDMA在融合以太网(RoCE)上设计和优化。MVAPICH2基于ANL的MPICH2。最近，为了优化未来多核系统的节点内通信，对MPICH2进行了重新设计。MPICH2中的新通信层称为Nemesis，它对共享内存消息传递进行了很好的优化，并采用模块化设计用于各种高性能互连。在本文中，我们探讨了利用Nemesis通信层设计下一代MVAPICH2堆栈所涉及的挑战。我们观察到Nemesis没有为单边通信提供抽象。我们提出了一个扩展的Nemesis接口，用于优化单边通信，并提供了设计细节。我们的实验评估表明，我们提出的单边接口扩展能够提供比基本Nemesis接口更好的性能。例如，节点间的MPI_Put带宽从1800 MB/s增加到3000 MB/s，小消息的延迟降低了13%。此外，与现有的MVAPICH2 CH3实现相比，使用我们提出的设计，我们能够通过小消息展示性能提升。本文提出的设计是MVAPICH2用户当前可用选项的超集，并提供了性能和模块化的最佳组合。

{"title":"High Performance Design and Implementation of Nemesis Communication Layer for Two-Sided and One-Sided MPI Semantics in MVAPICH2","authors":"Miao Luo, S. Potluri, P. Lai, E. Mancini, H. Subramoni, K. Kandalla, S. Sur, D. Panda","doi":"10.1109/ICPPW.2010.58","DOIUrl":"https://doi.org/10.1109/ICPPW.2010.58","url":null,"abstract":"High End Computing (HEC) systems are being deployed with eight to sixteen compute cores, with 64 to 128 cores/node being envisioned for exascale systems. mbox{MVAPICH2} is a popular implementation of MPI-2 specifically designed and optimized for InfiniBand, iWARP and RDMA over Converged Ethernet (RoCE). MVAPICH2 is based on MPICH2 from ANL. Recently MPICH2 has been redesigned with an effort to optimize intra-node communication for future many-core systems. The new communication layer in MPICH2 is called Nemesis, which is very well optimized for shared memory message passing, with a modular design for various high-performance interconnects. In this paper we explore the challenges involved in designing the next-generation MVAPICH2 stack, leveraging the Nemesis communication layer. We observe that Nemesis does not provide abstractions for one-sided communication. We propose an extended Nemesis interface for optimized one-sided communication and provide design details. Our experimental evaluation shows that our proposed one-sided interface extensions are able to provide significantly better performance than the basic Nemesis interface. For example, inter-node MPI_Put bandwidth increased from 1,800 MB/s to 3,000 MB/s and latency for small messages went down by 13%. Additionally, with our proposed designs, we are able to demonstrate performance gains with small messages, when compared to the existing MVAPICH2 CH3 implementation. The designs proposed in this paper is a superset of currently available options to MVAPICH2 users and provides the best combination of performance and modularity.","PeriodicalId":415472,"journal":{"name":"2010 39th International Conference on Parallel Processing Workshops","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128065607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2010 39th International Conference on Parallel Processing Workshops

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀