2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文中文

Publisher's Information 出版商的信息

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/ipdps.2019.00118

引用次数: 0

The Path to Delivering Programable Exascale Systems 提供可编程百亿亿级系统的途径

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00081

L. DeRose

The trends in hardware architecture are paving the road towards Exascale. However, these trends are also increasing the complexity of design and development of the software developer environment that is deployed on modern supercomputers. Moreover, the scale and complexity of high-end systems creates a new set of challenges for application developers. Computational scientists are facing system characteristics that will significantly impact the programmability and scalability of applications. In order to address these issues, software architects need to take a holistic view of the entire system and deliver a high-level programming environment that can help maximize programmability, while not losing sight of performance portability. In this talk, I will discuss the current trends in computer architecture and their implications in application development and will present Cray’s high level parallel programming environment for performance and programmability on current and future supercomputers. I will also discuss some of the challenges and open research problems that need to be addressed in order to build a software developer environment for extreme-scale systems that helps users solve multi-disciplinary and multi-scale problems with high levels of performance, programmability, and scalability.

硬件架构的发展趋势正在为百亿亿级铺平道路。然而，这些趋势也增加了部署在现代超级计算机上的软件开发人员环境的设计和开发的复杂性。此外，高端系统的规模和复杂性给应用程序开发人员带来了一系列新的挑战。计算科学家正面临着将显著影响应用程序可编程性和可扩展性的系统特性。为了解决这些问题，软件架构师需要从整体上看待整个系统，并交付一个高级编程环境，以帮助最大化可编程性，同时不忽略性能可移植性。在这次演讲中，我将讨论计算机体系结构的当前趋势及其对应用程序开发的影响，并将介绍Cray在当前和未来超级计算机上的高性能和可编程性的高级并行编程环境。我还将讨论一些需要解决的挑战和开放的研究问题，以便为极端规模的系统构建一个软件开发环境，帮助用户解决具有高水平性能、可编程性和可扩展性的多学科和多规模问题。

{"title":"The Path to Delivering Programable Exascale Systems","authors":"L. DeRose","doi":"10.1109/IPDPS.2019.00081","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00081","url":null,"abstract":"The trends in hardware architecture are paving the road towards Exascale. However, these trends are also increasing the complexity of design and development of the software developer environment that is deployed on modern supercomputers. Moreover, the scale and complexity of high-end systems creates a new set of challenges for application developers. Computational scientists are facing system characteristics that will significantly impact the programmability and scalability of applications. In order to address these issues, software architects need to take a holistic view of the entire system and deliver a high-level programming environment that can help maximize programmability, while not losing sight of performance portability. In this talk, I will discuss the current trends in computer architecture and their implications in application development and will present Cray’s high level parallel programming environment for performance and programmability on current and future supercomputers. I will also discuss some of the challenges and open research problems that need to be addressed in order to build a software developer environment for extreme-scale systems that helps users solve multi-disciplinary and multi-scale problems with high levels of performance, programmability, and scalability.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122391308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Bin-Based Bitstream Partitioning Approach for Parallel CABAC Decoding in Next Generation Video Coding 下一代视频编码中并行CABAC解码的一种基于bin的比特流分割方法

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00112

Philipp Habermann, C. C. Chi, M. Alvarez-Mesa, B. Juurlink

Context-based Adaptive Binary Arithmetic Coding (CABAC) is one of the main throughput bottlenecks in video decoding due to its sequential nature and the lack of data-level parallelism. High-level parallelization techniques can be used in most state-of-the-art video codecs, but they usually require a full replication of the decoding hardware and decrease the coding efficiency. We present a Bin-based Bitstream Partitioning (B3P) scheme to enable additional thread-level parallelism in CABAC decoding. Binary symbols are distributed over eight bitstream partitions that can be decoded simultaneously. The implementation and evaluation are based on the High Efficiency Video Coding Standard (HEVC/H.265). Significant speedups up to 8.5x are achieved for CABAC decoding while only 9.2% extra cell area is required and the bitstream overhead remains below 1% for high bitrates. The B3P hardware decoder can process up to 3.94 Gbins/s. Compared to state-of-the-art related work, we achieve higher throughput with slightly lower hardware cost and similar coding efficiency.

基于上下文的自适应二进制算术编码(CABAC)由于其序列性和缺乏数据级并行性而成为视频解码的主要吞吐量瓶颈之一。高级并行化技术可以用于大多数最先进的视频编解码器，但它们通常需要完全复制解码硬件并降低编码效率。我们提出了一种基于二进制的比特流分区(B3P)方案，以在CABAC解码中实现额外的线程级并行性。二进制符号分布在八个可以同时解码的比特流分区上。基于高效视频编码标准(HEVC/H.265)实现和评估。CABAC解码实现了高达8.5倍的显著加速，而只需要9.2%的额外单元面积，并且对于高比特率，比特流开销保持在1%以下。B3P硬件解码器的处理速度可达3.94 gins /s。与最先进的相关工作相比，我们以更低的硬件成本和相似的编码效率实现了更高的吞吐量。

引用次数: 2

Combining Prefetch Control and Cache Partitioning to Improve Multicore Performance 结合预取控制和Cache分区，提高多核性能

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00103

Gongjin Sun, Junjie Shen, A. Veidenbaum

Modern commercial multi-core processors are equipped with multiple hardware prefetchers on each core. The prefetchers can significantly improve application performance. However, shared resources, such as last-level cache (LLC) and off-chip memory bandwidth and controller, can lead to prefetch interference. Multiple techniques have been proposed to reduce such interference and improve the performance isolation across cores, such as coordinated control among prefetchers and cache partitioning (CP). Each of them has its advantages and disadvantages. This paper proposes combining these two techniques in a coordinated way. Prefetchers and LLC are treated as separate resources and a multi-resource management mechanism is proposed to control prefetching and cache partitioning. This control mechanism is implemented as a Linux kernel module and can be applied to a wide variety of prefetch architectures. An implementation on Intel Xeon E5 v4 processor shows that combining LLC partitioning and prefetch throttling provides a significant improvement in performance and fairness.

现代商用多核处理器在每个核上配备了多个硬件预取器。预取器可以显著提高应用程序的性能。然而，共享资源，如最后一级缓存(LLC)和片外内存带宽和控制器，可能导致预取干扰。已经提出了多种技术来减少这种干扰并提高跨核的性能隔离，例如预取器之间的协调控制和缓存分区(CP)。每一种都有其优点和缺点。本文建议将这两种技术协调地结合起来。将预取器和LLC作为独立的资源，提出了一种多资源管理机制来控制预取和缓存分区。这种控制机制是作为Linux内核模块实现的，可以应用于各种各样的预取体系结构。在Intel Xeon E5 v4处理器上的实现表明，将LLC分区和预取节流相结合可以显著提高性能和公平性。

引用次数: 10

Slate: Enabling Workload-Aware Efficient Multiprocessing for Modern GPGPUs Slate:为现代gpgpu启用工作负载感知高效多处理

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00035

Tyler N. Allen, Xizhou Feng, Rong Ge

As GPUs now contribute the majority of computing power for HPC and data centers, improving GPU utilization becomes an important research problem. Sharing GPU among multiple kernels is an effective approach but requires judicious kernel selection and scheduling for optimal gains. In this paper, we present Slate, a software-based workload-aware GPU multiprocessing framework that enables concurrent kernels from different processes to share GPU devices. Slate selects concurrent kernels that have complementary resource demands at run time to minimize interference for individual kernels and improve GPU resource utilization. Slate adjusts the size of application kernels on-the-fly so that kernels readily share, release, and claim resources based on GPU status. It further controls overhead including data transfers and synchronization. We have built a prototype of Slate and evaluated it on a system with a NVIDIA Titan Xp card. Our experiments show that Slate improves system throughput by 11% on average and up to 35% at the best scenario for the tested applications, in comparison to NVIDIA MultiProcess Service (MPS) that uses hardware scheduling and the leftover policy for resource sharing.

由于GPU目前为高性能计算和数据中心贡献了大部分计算能力，因此提高GPU的利用率成为一个重要的研究问题。在多个内核之间共享GPU是一种有效的方法，但需要明智的内核选择和调度以获得最佳收益。在本文中，我们提出了Slate，一个基于软件的工作负载感知GPU多处理框架，它使来自不同进程的并发内核能够共享GPU设备。Slate选择在运行时具有互补资源需求的并发内核，以最大限度地减少对单个内核的干扰并提高GPU资源利用率。Slate动态调整应用程序内核的大小，以便内核可以根据GPU状态轻松地共享、释放和声明资源。它进一步控制开销，包括数据传输和同步。我们已经建立了一个Slate的原型，并在使用NVIDIA Titan Xp卡的系统上对其进行了评估。我们的实验表明，与使用硬件调度和剩余策略进行资源共享的NVIDIA多进程服务(MPS)相比，Slate将系统吞吐量平均提高了11%，在测试应用程序的最佳场景下可提高35%。

{"title":"Slate: Enabling Workload-Aware Efficient Multiprocessing for Modern GPGPUs","authors":"Tyler N. Allen, Xizhou Feng, Rong Ge","doi":"10.1109/IPDPS.2019.00035","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00035","url":null,"abstract":"As GPUs now contribute the majority of computing power for HPC and data centers, improving GPU utilization becomes an important research problem. Sharing GPU among multiple kernels is an effective approach but requires judicious kernel selection and scheduling for optimal gains. In this paper, we present Slate, a software-based workload-aware GPU multiprocessing framework that enables concurrent kernels from different processes to share GPU devices. Slate selects concurrent kernels that have complementary resource demands at run time to minimize interference for individual kernels and improve GPU resource utilization. Slate adjusts the size of application kernels on-the-fly so that kernels readily share, release, and claim resources based on GPU status. It further controls overhead including data transfers and synchronization. We have built a prototype of Slate and evaluated it on a system with a NVIDIA Titan Xp card. Our experiments show that Slate improves system throughput by 11% on average and up to 35% at the best scenario for the tested applications, in comparison to NVIDIA MultiProcess Service (MPS) that uses hardware scheduling and the leftover policy for resource sharing.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2020 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114822560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

An Architecture and Stochastic Method for Database Container Placement in the Edge-Fog-Cloud Continuum 边缘-雾-云连续体中数据库容器放置的体系结构和随机方法

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00050

Petar Kochovski, R. Sakellariou, M. Bajec, P. Drobintsev, V. Stankovski

Databases as software components may be used to serve a variety of smart applications. Currently, the Internet of Things (IoT), Artificial Intelligence (AI) and Cloud technologies are used in the course of projects such as the Horizon 2020 EU-Korea DECENTER project in order to implement four smart applications in the domains of Smart Homes, Smart Cities, Smart Construction and Robot Logistics. In these smart applications the Big Data pipeline starts from various sensor and video streams to which AI and feature extraction methods are applied. The resulting information is stored in database containers, which have to be placed on Edge, Fog or Cloud infrastructures. The placement decision depends on complex application requirements, including Quality of Service (QoS) requirements. Information that must be considered when making placement decisions includes the expected workload, the list of candidate infrastructures, geolocation, connectivity and similar. Software engineers currently perform such decisions manually, which usually leads to QoS threshold violations. This paper aims to automate the process of making such decisions. Therefore, the goals of this paper are to: (1) develop a decision making method for database container placement; (2) formally verify each placement decision and provide probability assurances to the software engineer for high QoS; and (3) design and implement a new architecture that automates the whole process. A new optimisation method is introduced, which is based on the theory and practice of stochastic Markov Decision Processes (MDP). It uses as input monitoring data from the container runtime, the expected workload and user-related metrics in order to automatically construct a probabilistic finite automaton. The generated automaton is used for both automated decision making and placement success verification. The method is implemented in Java. It also uses the PRISM model-checking tool. Kubernetes is used in order to automate the whole process when orchestrating database containers across Edge, Fog and Cloud infrastructures. Experiments are performed for NoSQL Cassandra database containers for three representative workloads of 50000 (workload 1), 200000 (workload 2) and 500000 (workload 3) CRUD database operations. Five computing infrastructures serve as candidates for database container placement. The new MDP-based method is compared with the widely used Analytic Hierarchy Process (AHP) method. The obtained results are used to analyse container placement decisions. When using the new MDP based method there were no QoS violations in any of the placement cases, while when using the AHP based method the placement results in some QoS threshold violations in all workload cases. Due to its properties, the new MDP method is particularly suitable for implementation. The paper also describes a multi-tier distributed computing system that uses multi-level (infrastructure, container, application) monitoring metrics and Kubernetes

数据库作为软件组件可用于服务于各种智能应用程序。目前，为了实现智能家居、智能城市、智能建筑、机器人物流等4个领域的智能应用，在“地平线2020”韩欧DECENTER项目中使用了物联网(IoT)、人工智能(AI)和云技术。在这些智能应用中，大数据管道从各种传感器和视频流开始，其中应用了人工智能和特征提取方法。结果信息存储在数据库容器中，这些容器必须放置在边缘、雾或云基础设施上。放置决策取决于复杂的应用程序需求，包括服务质量(QoS)需求。在做出安置决定时必须考虑的信息包括预期工作量、候选基础设施列表、地理位置、连通性等。软件工程师目前手动执行此类决策，这通常会导致QoS阈值违规。本文旨在使做出此类决策的过程自动化。因此，本文的目标是:(1)开发一种数据库容器放置的决策方法;(2)正式验证每个放置决策，为软件工程师提供高QoS的概率保证;(3)设计并实现一个新的体系结构，使整个过程自动化。基于随机马尔可夫决策过程的理论和实践，提出了一种新的优化方法。它使用来自容器运行时的监控数据、预期的工作负载和与用户相关的指标作为输入，以便自动构建概率有限自动机。生成的自动机用于自动决策制定和放置成功验证。该方法是用Java实现的。它还使用PRISM模型检查工具。Kubernetes是为了在跨边缘、雾和云基础设施编排数据库容器时自动化整个过程。在NoSQL Cassandra数据库容器上进行了50000(工作负载1)、200000(工作负载2)和500000(工作负载3)三种具有代表性的CRUD数据库操作实验。五种计算基础设施可作为数据库容器放置的候选。将基于mdp的新方法与广泛使用的层次分析法进行了比较。所得结果用于分析容器放置决策。当使用新的基于MDP的方法时，在任何放置情况下都没有QoS违规，而当使用基于AHP的方法时，在所有工作负载情况下的放置都会导致一些QoS阈值违规。由于其特性，新的MDP方法特别适合于实现。本文还描述了一个多层分布式计算系统，该系统使用多层次(基础设施、容器、应用程序)监控指标和Kubernetes，以便跨边缘、雾和云节点编排数据库容器。该体系结构展示了完全自动化的决策制定和高QoS容器操作。

{"title":"An Architecture and Stochastic Method for Database Container Placement in the Edge-Fog-Cloud Continuum","authors":"Petar Kochovski, R. Sakellariou, M. Bajec, P. Drobintsev, V. Stankovski","doi":"10.1109/IPDPS.2019.00050","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00050","url":null,"abstract":"Databases as software components may be used to serve a variety of smart applications. Currently, the Internet of Things (IoT), Artificial Intelligence (AI) and Cloud technologies are used in the course of projects such as the Horizon 2020 EU-Korea DECENTER project in order to implement four smart applications in the domains of Smart Homes, Smart Cities, Smart Construction and Robot Logistics. In these smart applications the Big Data pipeline starts from various sensor and video streams to which AI and feature extraction methods are applied. The resulting information is stored in database containers, which have to be placed on Edge, Fog or Cloud infrastructures. The placement decision depends on complex application requirements, including Quality of Service (QoS) requirements. Information that must be considered when making placement decisions includes the expected workload, the list of candidate infrastructures, geolocation, connectivity and similar. Software engineers currently perform such decisions manually, which usually leads to QoS threshold violations. This paper aims to automate the process of making such decisions. Therefore, the goals of this paper are to: (1) develop a decision making method for database container placement; (2) formally verify each placement decision and provide probability assurances to the software engineer for high QoS; and (3) design and implement a new architecture that automates the whole process. A new optimisation method is introduced, which is based on the theory and practice of stochastic Markov Decision Processes (MDP). It uses as input monitoring data from the container runtime, the expected workload and user-related metrics in order to automatically construct a probabilistic finite automaton. The generated automaton is used for both automated decision making and placement success verification. The method is implemented in Java. It also uses the PRISM model-checking tool. Kubernetes is used in order to automate the whole process when orchestrating database containers across Edge, Fog and Cloud infrastructures. Experiments are performed for NoSQL Cassandra database containers for three representative workloads of 50000 (workload 1), 200000 (workload 2) and 500000 (workload 3) CRUD database operations. Five computing infrastructures serve as candidates for database container placement. The new MDP-based method is compared with the widely used Analytic Hierarchy Process (AHP) method. The obtained results are used to analyse container placement decisions. When using the new MDP based method there were no QoS violations in any of the placement cases, while when using the AHP based method the placement results in some QoS threshold violations in all workload cases. Due to its properties, the new MDP method is particularly suitable for implementation. The paper also describes a multi-tier distributed computing system that uses multi-level (infrastructure, container, application) monitoring metrics and Kubernetes","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115756348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

[Copyright notice] (版权)

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/ipdps.2019.00003

引用次数: 0

C-GDR: High-Performance Container-Aware GPUDirect MPI Communication Schemes on RDMA Networks C-GDR:基于RDMA网络的高性能容器感知GPUDirect MPI通信方案

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00034

Jie Zhang, Xiaoyi Lu, Ching-Hsiang Chu, D. Panda

In recent years, GPU-based platforms have received significant success for parallel applications. In addition to highly optimized computation kernels on GPUs, the cost of data movement on GPU clusters plays critical roles in delivering high performance for end applications. Many recent studies have been proposed to optimize the performance of GPU-or CUDA-aware communication runtimes and these designs have been widely adopted in the emerging GPU-based applications. These studies mainly focus on improving the communication performance on native environments, i.e., physical machines, however GPU-based communication schemes on cloud environments are not well studied yet. This paper first investigates the performance characteristics of state-of-the-art GPU-based communication schemes on both native and container-based environments, which show a significant demand to design high-performance container-aware communication schemes in GPU-enabled runtimes to deliver near-native performance for end applications on clouds. Next, we propose the C-GDR approach to design high-performance Container-aware GPUDirect communication schemes on RDMA networks. C-GDR allows communication runtimes to successfully detect process locality, GPU residency, NUMA, architecture information, and communication pattern to enable intelligent and dynamic selection of the best communication and data movement schemes on GPU-enabled clouds. We have integrated C-GDR with the MVAPICH2 library. Our evaluations show that MVAPICH2 with C-GDR has clear performance benefits on container-based cloud environments, compared to default MVAPICH2-GDR and Open MPI. For instance, our proposed C-GDR can outperform default MVAPICH2-GDR schemes by up to 66% on micro-benchmarks and up to 26% on HPC applications over a container-based environment.

近年来，基于gpu的平台在并行应用方面取得了重大成功。除了GPU上高度优化的计算内核外，GPU集群上的数据移动成本在为最终应用程序提供高性能方面起着至关重要的作用。最近提出了许多优化gpu或cuda感知通信运行时性能的研究，这些设计已被广泛采用于新兴的基于gpu的应用中。这些研究主要集中在提高原生环境(即物理机)上的通信性能，而基于gpu的云环境下的通信方案还没有很好的研究。本文首先研究了最先进的基于gpu的通信方案在本地和基于容器的环境中的性能特征，这表明了在支持gpu的运行时中设计高性能容器感知通信方案的重要需求，以便为云上的终端应用程序提供接近本地的性能。接下来，我们提出了在RDMA网络上设计高性能容器感知GPUDirect通信方案的C-GDR方法。C-GDR允许通信运行时成功地检测进程位置、GPU驻留、NUMA、架构信息和通信模式，从而在支持GPU的云上智能和动态地选择最佳通信和数据移动方案。我们将C-GDR与MVAPICH2库集成在一起。我们的评估表明，与默认的MVAPICH2- gdr和Open MPI相比，带有C-GDR的MVAPICH2在基于容器的云环境中具有明显的性能优势。例如，我们提出的C-GDR在微基准测试上的性能比默认的MVAPICH2-GDR方案高出66%，在基于容器的HPC应用程序上的性能高出26%。

{"title":"C-GDR: High-Performance Container-Aware GPUDirect MPI Communication Schemes on RDMA Networks","authors":"Jie Zhang, Xiaoyi Lu, Ching-Hsiang Chu, D. Panda","doi":"10.1109/IPDPS.2019.00034","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00034","url":null,"abstract":"In recent years, GPU-based platforms have received significant success for parallel applications. In addition to highly optimized computation kernels on GPUs, the cost of data movement on GPU clusters plays critical roles in delivering high performance for end applications. Many recent studies have been proposed to optimize the performance of GPU-or CUDA-aware communication runtimes and these designs have been widely adopted in the emerging GPU-based applications. These studies mainly focus on improving the communication performance on native environments, i.e., physical machines, however GPU-based communication schemes on cloud environments are not well studied yet. This paper first investigates the performance characteristics of state-of-the-art GPU-based communication schemes on both native and container-based environments, which show a significant demand to design high-performance container-aware communication schemes in GPU-enabled runtimes to deliver near-native performance for end applications on clouds. Next, we propose the C-GDR approach to design high-performance Container-aware GPUDirect communication schemes on RDMA networks. C-GDR allows communication runtimes to successfully detect process locality, GPU residency, NUMA, architecture information, and communication pattern to enable intelligent and dynamic selection of the best communication and data movement schemes on GPU-enabled clouds. We have integrated C-GDR with the MVAPICH2 library. Our evaluations show that MVAPICH2 with C-GDR has clear performance benefits on container-based cloud environments, compared to default MVAPICH2-GDR and Open MPI. For instance, our proposed C-GDR can outperform default MVAPICH2-GDR schemes by up to 66% on micro-benchmarks and up to 26% on HPC applications over a container-based environment.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128990522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

ParILUT - A Parallel Threshold ILU for GPUs ParILUT -用于gpu的并行门限ILU

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00033

H. Anzt, T. Ribizel, Goran Flegar, Edmond Chow, J. Dongarra

In this paper, we present the first algorithm for computing threshold ILU factorizations on GPU architectures. The proposed ParILUT-GPU algorithm is based on interleaving parallel fixed-point iterations that approximate the incomplete factors for an existing nonzero pattern with a strategy that dynamically adapts the nonzero pattern to the problem characteristics. This requires the efficient selection of thresholds that separate the values to be dropped from the incomplete factors, and we design a novel selection algorithm tailored towards GPUs. All components of the ParILUT-GPU algorithm make heavy use of the features available in the latest NVIDIA GPU generations, and outperform existing multithreaded CPU implementations.

在本文中，我们提出了在GPU架构上计算阈值ILU分解的第一种算法。提出的ParILUT-GPU算法基于交错并行不动点迭代，该迭代近似现有非零模式的不完全因子，并采用动态调整非零模式以适应问题特征的策略。这需要有效地选择阈值，将要丢弃的值从不完整的因素中分离出来，我们设计了一种针对gpu的新颖选择算法。ParILUT-GPU算法的所有组件都充分利用了最新一代NVIDIA GPU的功能，并且优于现有的多线程CPU实现。

引用次数: 14

[Title page iii] [标题页iii]

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/ipdps.2019.00002

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀