2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)最新文献

英文中文

HiPC 2020 Industry Sponsors 重债穷国2020行业赞助商

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/hipc50609.2020.00012

引用次数: 0

A Parallel and Scalable Framework for Insider Threat Detection 一个并行和可扩展的内部威胁检测框架

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00024

Abdoulaye Diop, N. Emad, Thierry Winter

In this article, we propose an innovative method for the detection of insider threats. This method is based on a unite and conquer approach used to combine ensemble learning techniques, which have the particularity of being intrinsically parallel. Furthermore, it showcases multi-level parallelism properties, offers fault tolerance, and is suitable for heterogeneous architectures. To highlight our approach's efficacy, we present a use case of insider threat detection on a parallel platform. This experiment's results showed the benefits of this method relative to its improvement of classification AUC-score and its scalability.

在本文中，我们提出了一种检测内部威胁的创新方法。该方法基于统一和征服的方法，将集成学习技术结合起来，具有内在并行的特点。此外，它还展示了多级并行特性，提供了容错性，并且适合于异构体系结构。为了突出我们方法的有效性，我们给出了一个在并行平台上进行内部威胁检测的用例。实验结果表明了该方法在提高分类AUC-score和可扩展性方面的优势。

引用次数: 2

[Copyright notice] (版权)

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/hipc50609.2020.00003

引用次数: 0

Model Checking as a Service using Dynamic Resource Scaling 使用动态资源缩放的模型检查即服务

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00027

Surya Teja Palavalasa, Yuvraj Singh, Adhish Singla, Suresh Purini, Venkatesh Choppella

Model checking is now a standard technology for verifying large and complex systems. While there are a range of tools and techniques to verify various properties of a system under consideration, in this work, we restrict our attention to safety checking procedures using explicit state space generation. The necessary hardware resources required in this approach depends on the model complexity and the resulting state transition graph that gets generated. This cannot be estimated apriori. For reasonably realistic models, the available main memory in even high end servers may not be sufficient. Hence, we have to use distributed safety verification approaches on a cluster of nodes. However, the problem of estimating the minimum number of nodes in the cluster for the verification procedure to complete successfully remains unsolved. In this paper, we propose a dynamically scalable model checker using an actor based architecture. Using the proposed approach, an end user can invoke a model checker hosted on a cloud platform in a push button fashion. Our safety verification procedures automatically expands the cluster by requesting more virtual machines from the cloud provider. Finally, the user gets to pay only for the hardware resources he rented for the duration of the verification procedure. We refer to this as Model Checking as Service. We approach this problem by proposing an asynchronous algorithm for safety checking in actor framework. The actor based approach allows for scaling the resources on a need basis and redistributes the work load transparently through state migration. We tested our approach by developing a distributed version of SpinJA model checker using Akka actor framework. We conducted our experiments on Google Cloud Engine (GCE) platform wherein we scale our resources automatically using the GCE API. On large models such as anderson.8 from BEEM benchmark suite, our approach reduced the model checking cost in dollars by 8.6x while reducing the wall clock time to complete the safety checking procedure 5.5x times.

模型检查现在是验证大型复杂系统的标准技术。虽然有一系列的工具和技术来验证所考虑的系统的各种属性，但在这项工作中，我们将注意力限制在使用显式状态空间生成的安全检查过程上。这种方法所需的必要硬件资源取决于模型复杂性和生成的最终状态转换图。这是不能先验估计的。对于比较现实的模型，即使是高端服务器中的可用主内存也可能不够用。因此，我们必须在节点集群上使用分布式安全验证方法。然而，估计验证过程成功完成的集群中最小节点数的问题仍然没有解决。在本文中，我们提出了一个动态可扩展的模型检查器，使用基于参与者的体系结构。使用建议的方法，最终用户可以以按钮方式调用托管在云平台上的模型检查器。我们的安全验证程序通过向云提供商请求更多虚拟机来自动扩展集群。最后，用户只需为他在验证过程期间租用的硬件资源付费。我们将此称为模型检查即服务。我们通过在actor框架中提出一种异步的安全检查算法来解决这个问题。基于参与者的方法允许根据需要扩展资源，并通过状态迁移透明地重新分配工作负载。我们通过使用Akka actor框架开发SpinJA模型检查器的分布式版本来测试我们的方法。我们在谷歌云引擎(GCE)平台上进行了实验，其中我们使用GCE API自动扩展资源。在大型模型上，比如安德森。我们的方法将模型检查成本(以美元计算)降低了8.6倍，同时将完成安全检查程序的挂钟时间减少了5.5倍。

{"title":"Model Checking as a Service using Dynamic Resource Scaling","authors":"Surya Teja Palavalasa, Yuvraj Singh, Adhish Singla, Suresh Purini, Venkatesh Choppella","doi":"10.1109/HiPC50609.2020.00027","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00027","url":null,"abstract":"Model checking is now a standard technology for verifying large and complex systems. While there are a range of tools and techniques to verify various properties of a system under consideration, in this work, we restrict our attention to safety checking procedures using explicit state space generation. The necessary hardware resources required in this approach depends on the model complexity and the resulting state transition graph that gets generated. This cannot be estimated apriori. For reasonably realistic models, the available main memory in even high end servers may not be sufficient. Hence, we have to use distributed safety verification approaches on a cluster of nodes. However, the problem of estimating the minimum number of nodes in the cluster for the verification procedure to complete successfully remains unsolved. In this paper, we propose a dynamically scalable model checker using an actor based architecture. Using the proposed approach, an end user can invoke a model checker hosted on a cloud platform in a push button fashion. Our safety verification procedures automatically expands the cluster by requesting more virtual machines from the cloud provider. Finally, the user gets to pay only for the hardware resources he rented for the duration of the verification procedure. We refer to this as Model Checking as Service. We approach this problem by proposing an asynchronous algorithm for safety checking in actor framework. The actor based approach allows for scaling the resources on a need basis and redistributes the work load transparently through state migration. We tested our approach by developing a distributed version of SpinJA model checker using Akka actor framework. We conducted our experiments on Google Cloud Engine (GCE) platform wherein we scale our resources automatically using the GCE API. On large models such as anderson.8 from BEEM benchmark suite, our approach reduced the model checking cost in dollars by 8.6x while reducing the wall clock time to complete the safety checking procedure 5.5x times.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132199701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SparsePipe: Parallel Deep Learning for 3D Point Clouds SparsePipe: 3D点云的并行深度学习

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00019

Keke Zhai, Pan He, Tania Banerjee-Mishra, A. Rangarajan, S. Ranka

We propose SparsePipe, an efficient and asynchronous parallelism approach for handling 3D point clouds with multi-GPU training. SparsePipe is built to support 3D sparse data such as point clouds. It achieves this by adopting generalized convolutions with sparse tensor representation to build expressive high-dimensional convolutional neural networks. Compared to dense solutions, the new models can efficiently process irregular point clouds without densely sliding over the entire space, significantly reducing the memory requirements and allowing higher resolutions of the underlying 3D volumes for better performance. SparsePipe exploits intra-batch parallelism that partitions input data into multiple processors and further improves the training throughput with inter-batch pipelining to overlap communication and computing. Besides, it suitably partitions the model when the GPUs are heterogeneous such that the computing is load-balanced with reduced communication overhead. Using experimental results on an eight-GPU platform, we show that SparsePipe can parallelize effectively and obtain better performance on current point cloud benchmarks for both training and inference, compared to its dense solutions.

我们提出了SparsePipe，一种高效的异步并行方法，用于处理多gpu训练的3D点云。SparsePipe的构建是为了支持3D稀疏数据，如点云。它通过采用稀疏张量表示的广义卷积来构建富有表现力的高维卷积神经网络。与密集的解决方案相比，新模型可以有效地处理不规则的点云，而不会在整个空间上密集滑动，大大降低了内存需求，并允许更高的底层3D体分辨率以获得更好的性能。SparsePipe利用批内并行性，将输入数据划分到多个处理器中，并通过批间流水线进一步提高训练吞吐量，从而重叠通信和计算。此外，它在gpu是异构的情况下对模型进行了适当的分区，从而使计算在减少通信开销的同时实现了负载均衡。通过在8个gpu平台上的实验结果，我们表明，与密集解决方案相比，SparsePipe可以有效地并行化，并在当前的点云基准测试中获得更好的训练和推理性能。

{"title":"SparsePipe: Parallel Deep Learning for 3D Point Clouds","authors":"Keke Zhai, Pan He, Tania Banerjee-Mishra, A. Rangarajan, S. Ranka","doi":"10.1109/HiPC50609.2020.00019","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00019","url":null,"abstract":"We propose SparsePipe, an efficient and asynchronous parallelism approach for handling 3D point clouds with multi-GPU training. SparsePipe is built to support 3D sparse data such as point clouds. It achieves this by adopting generalized convolutions with sparse tensor representation to build expressive high-dimensional convolutional neural networks. Compared to dense solutions, the new models can efficiently process irregular point clouds without densely sliding over the entire space, significantly reducing the memory requirements and allowing higher resolutions of the underlying 3D volumes for better performance. SparsePipe exploits intra-batch parallelism that partitions input data into multiple processors and further improves the training throughput with inter-batch pipelining to overlap communication and computing. Besides, it suitably partitions the model when the GPUs are heterogeneous such that the computing is load-balanced with reduced communication overhead. Using experimental results on an eight-GPU platform, we show that SparsePipe can parallelize effectively and obtain better performance on current point cloud benchmarks for both training and inference, compared to its dense solutions.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134333135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Message from the Program Chairs 来自项目主席的信息

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/hipc50609.2020.00006

B. Uçar, G. Agrawal

引用次数: 0

Exploring Task Parallelism for the Multilevel Fast Multipole Algorithm 探索多层次快速多极算法的任务并行性

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00018

Michael P. Lingg, S. Hughey, Doga Dikbayir, B. Shanker, H. Aktulga

The Multi-Level Fast Multipole Algorithm (MLFMA), a variant of the fast multiple method (FMM) for problems with oscillatory potentials, significantly accelerates the solution of problems based on wave physics, such as those in electromagnetics and acoustics. Existing shared memory parallel approaches for MLFMA have adopted the bulk synchronous parallel (BSP) model. While the BSP approach has served well so far, it is prone to significant thread synchronization overheads, but more importantly fails to leverage the communication/computation overlap opportunities due to complicated data dependencies in MLFMA. In this paper, we develop a task parallel MLFMA implementation for shared memory architectures, and discuss optimizations to improve its performance. We then evaluate the new task parallel MLFMA implementation against a BSP implementation for a number of geometries. Our findings suggest that task parallelism is generally superior to the BSP model, and considering its potential advantages over the BSP model in a hybrid parallel setting, we see it to be a promising approach in addressing the scalability issues of MLFMA in large scale computations.

多级快速多极算法(MLFMA)是求解振荡电位问题的快速多重法(FMM)的一种变体，它能显著加快求解电磁学和声学等基于波动物理的问题。现有的MLFMA共享内存并行方法都采用了批量同步并行(BSP)模型。虽然BSP方法到目前为止表现良好，但它容易产生显著的线程同步开销，但更重要的是，由于MLFMA中复杂的数据依赖关系，它无法利用通信/计算重叠的机会。在本文中，我们开发了一个用于共享内存架构的任务并行MLFMA实现，并讨论了优化以提高其性能。然后，我们针对许多几何形状的BSP实现评估新的任务并行MLFMA实现。我们的研究结果表明，任务并行性通常优于BSP模型，并且考虑到其在混合并行设置中优于BSP模型的潜在优势，我们认为它是解决大规模计算中MLFMA可扩展性问题的一种有前途的方法。

{"title":"Exploring Task Parallelism for the Multilevel Fast Multipole Algorithm","authors":"Michael P. Lingg, S. Hughey, Doga Dikbayir, B. Shanker, H. Aktulga","doi":"10.1109/HiPC50609.2020.00018","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00018","url":null,"abstract":"The Multi-Level Fast Multipole Algorithm (MLFMA), a variant of the fast multiple method (FMM) for problems with oscillatory potentials, significantly accelerates the solution of problems based on wave physics, such as those in electromagnetics and acoustics. Existing shared memory parallel approaches for MLFMA have adopted the bulk synchronous parallel (BSP) model. While the BSP approach has served well so far, it is prone to significant thread synchronization overheads, but more importantly fails to leverage the communication/computation overlap opportunities due to complicated data dependencies in MLFMA. In this paper, we develop a task parallel MLFMA implementation for shared memory architectures, and discuss optimizations to improve its performance. We then evaluate the new task parallel MLFMA implementation against a BSP implementation for a number of geometries. Our findings suggest that task parallelism is generally superior to the BSP model, and considering its potential advantages over the BSP model in a hybrid parallel setting, we see it to be a promising approach in addressing the scalability issues of MLFMA in large scale computations.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132181642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

PufferFish: NUMA-Aware Work-stealing Library using Elastic Tasks PufferFish:使用弹性任务的NUMA-Aware工作窃取库

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00039

Vivek Kumar

Due to the challenges in providing adequate memory access to many cores on a single processor, Multi-Die and Multi-Socket based multicore systems are becoming mainstream. These systems offer cache-coherent Non-Uniform Memory Access (NUMA) across several memory banks and cache hierarchy to increase memory capacity and bandwidth. Random work-stealing is a widely used technique for dynamic load balancing of tasks on multicore processors. However, it scales poorly on such NUMA systems for memory-bound applications due to cache misses and remote memory access latency. Hierarchical Place Tree (HPT) [1] is a popular approach for improving the locality of a task-based parallel programming model, albeit it requires the programmer to map the dynamically unfolding tasks over a NUMA system evenly. Specifying data-affinity hints provides a more natural way to map the tasks than HPT. Still, a scalable work-stealing implementation for the same is mostly unexplored for modern NUMA systems. This paper presents PufferFish, a new async-finish parallel programming model and work-stealing runtime for NUMA systems that provide a close coupling of the data-affinity hints provided for an asynchronous task with the HPTs in Habanero C/C++ library (HClib). PufferFish introduces Hierarchical Elastic Tasks (HET) that improves the locality by shrinking itself to run on a single worker inside a place or puffing up across multiple workers depending on the work imbalance at a particular place in an HPT. We use a set of widely used memory-bound benchmarks exhibiting regular and irregular execution graphs for evaluating PufferFish. On these benchmarks, we show that PufferFish achieves a geometric mean speedup of 1.5× and 1.9× over HPT implementation in HClib and random work-stealing in CilkPlus, respectively, on a 32-core NUMA AMD EPYC processor.

由于在单个处理器上为多个核心提供足够的内存访问的挑战，基于多模和多插槽的多核系统正在成为主流。这些系统提供跨多个内存库和缓存层次结构的缓存一致非统一内存访问(NUMA)，以增加内存容量和带宽。随机工作窃取是一种广泛应用于多核处理器任务动态负载平衡的技术。然而，由于缓存丢失和远程内存访问延迟，它在这样的NUMA系统上对于内存受限的应用程序伸缩性很差。分层位置树(HPT)[1]是一种用于改进基于任务的并行编程模型的局部性的流行方法，尽管它要求程序员在NUMA系统上均匀地映射动态展开的任务。指定数据关联提示提供了一种比HPT更自然的映射任务的方法。尽管如此，对于现代NUMA系统来说，同样的可扩展工作窃取实现基本上还没有被探索过。本文提出了PufferFish，一个新的异步完成并行编程模型和用于NUMA系统的偷工运行时，它提供了为异步任务提供的数据关联提示与Habanero C/ c++库(HClib)中的HPTs的紧密耦合。PufferFish引入了分层弹性任务(HET)，它通过缩小自身以在一个地方的单个工人上运行，或者根据HPT中特定位置的工作不平衡在多个工人上膨胀来提高局部性。我们使用了一组广泛使用的内存约束基准，展示了规则和不规则的执行图来评估PufferFish。在这些基准测试中，我们表明PufferFish在32核NUMA AMD EPYC处理器上分别比HClib中的HPT实现和CilkPlus中的随机工作窃取分别实现了1.5倍和1.9倍的几何平均加速。

{"title":"PufferFish: NUMA-Aware Work-stealing Library using Elastic Tasks","authors":"Vivek Kumar","doi":"10.1109/HiPC50609.2020.00039","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00039","url":null,"abstract":"Due to the challenges in providing adequate memory access to many cores on a single processor, Multi-Die and Multi-Socket based multicore systems are becoming mainstream. These systems offer cache-coherent Non-Uniform Memory Access (NUMA) across several memory banks and cache hierarchy to increase memory capacity and bandwidth. Random work-stealing is a widely used technique for dynamic load balancing of tasks on multicore processors. However, it scales poorly on such NUMA systems for memory-bound applications due to cache misses and remote memory access latency. Hierarchical Place Tree (HPT) [1] is a popular approach for improving the locality of a task-based parallel programming model, albeit it requires the programmer to map the dynamically unfolding tasks over a NUMA system evenly. Specifying data-affinity hints provides a more natural way to map the tasks than HPT. Still, a scalable work-stealing implementation for the same is mostly unexplored for modern NUMA systems. This paper presents PufferFish, a new async-finish parallel programming model and work-stealing runtime for NUMA systems that provide a close coupling of the data-affinity hints provided for an asynchronous task with the HPTs in Habanero C/C++ library (HClib). PufferFish introduces Hierarchical Elastic Tasks (HET) that improves the locality by shrinking itself to run on a single worker inside a place or puffing up across multiple workers depending on the work imbalance at a particular place in an HPT. We use a set of widely used memory-bound benchmarks exhibiting regular and irregular execution graphs for evaluating PufferFish. On these benchmarks, we show that PufferFish achieves a geometric mean speedup of 1.5× and 1.9× over HPT implementation in HClib and random work-stealing in CilkPlus, respectively, on a 32-core NUMA AMD EPYC processor.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114356415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Algorithms for Preemptive Co-scheduling of Kernels on GPUs gpu上内核的抢占式协同调度算法

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00033

Lionel Eyraud-Dubois, C. Bentes

Modern GPUs allow concurrent kernel execution and preemption to improve hardware utilization and responsiveness. Currently, the decision on the simultaneous execution of kernels is performed by the hardware, which can lead to unreasonable use of resources. In this work, we tackle the problem of co-scheduling for GPUs in high competition scenarios. We propose a novel graph-based preemptive co-scheduling algorithm, with the focus on reducing the number of preemptions. We show that the optimal preemptive makespan can be computed by solving a Linear Program in polynomial time. Based on this solution we propose graph theoretical model and an algorithm to build preemptive schedules which minimize the number of preemptions. We show, however, that finding the minimum amount of preemptions among all preemptive solutions of optimal makespan is a NP-hard problem. We performed experiments on real-world GPU applications and our approach can achieve optimal makespan by preempting 6 to 9% of the tasks.

现代gpu允许并发内核执行和抢占，以提高硬件利用率和响应能力。目前，内核是否同时执行是由硬件来决定的，这可能导致资源的不合理使用。在这项工作中，我们解决了gpu在高竞争场景下的协同调度问题。提出了一种新的基于图的抢占式协同调度算法，该算法的重点是减少抢占的数量。我们证明了在多项式时间内求解一个线性规划可以计算出最优的抢占式最大跨度。在此基础上，提出了一种图论模型，并提出了一种最小化抢占次数的抢占调度算法。然而，我们证明了在所有最优最大时间跨度的抢占解中找到最小的抢占量是一个np困难问题。我们在现实世界的GPU应用程序上进行了实验，我们的方法可以通过抢占6%到9%的任务来实现最佳的makespan。

引用次数: 1

Performance Optimization and Scalability Analysis of the MGB Hydrological Model MGB水文模型的性能优化与可扩展性分析

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00017

H. Freitas, C. Mendes, A. Ilic

Hydrological models are extensively used in applications such as water resources, climate change, land use, and forecast systems. The focus of this paper is performance optimization of the MGB hydrological model, which is widely employed to simulate water flows in large-scale watersheds. The optimization strategies that we selected include AVX-512 vectorization, thread-parallelism on multi-core CPUs (OpenMP), and data-parallelism on many-core GPUs (CUDA). We conducted experiments for real-world input datasets on state-of-the-art HPC systems based on Intel's Skylake CPUs and NVIDIA GPUs. In addition, a Roofline model characterization for these datasets confirmed performance improvements of up to 37.5x on the most time-consuming part of the code and 8.6x on the full MGB model. The work proposed herein shows that careful optimizations are needed for hydrological models to achieve a significant fraction of the performance potential in modern processors.

水文模型广泛应用于水资源、气候变化、土地利用和预测系统等领域。MGB水文模型是一种广泛应用于大尺度流域水流模拟的水文模型，本文的研究重点是该模型的性能优化。我们选择的优化策略包括AVX-512矢量化、多核cpu上的线程并行性(OpenMP)和多核gpu上的数据并行性(CUDA)。我们在基于Intel的Skylake cpu和NVIDIA gpu的最先进的HPC系统上进行了真实输入数据集的实验。此外，针对这些数据集的rooline模型表征证实，在最耗时的代码部分，性能提高了37.5倍，在完整的MGB模型上，性能提高了8.6倍。本文提出的工作表明，需要对水文模型进行仔细优化，以实现现代处理器中性能潜力的很大一部分。

引用次数: 1

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀