Workshop Proceedings of the 51st International Conference on Parallel Processing最新文献

英文中文

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548513

Meng-Shiuan Shih, H.M. Lai, Chao-Lin Lee, Chung-Kai Chen, Jenq-Kuen Lee

The use of parallel processing with vector processors is indispensable. The RISC-V vector extension (RVV) is a highly anticipated extension due to the demand for growing AI applications. The modularity and extensibility make RISC-V a popular instruction set in the industry. Compared to SIMD instruction, vector instructions use fewer instructions with a larger register size which can handle multiple registers within one instruction, resulting in higher performance. With the vector grouping mechanism called vector length multiplier (LMUL) provided by RVV, RVV can combine multiple vector registers into one group so that the processor can increase the throughput of processing data under the same issue rate. However, due to the register pressure, the vector length is not always positively relative to the performance. Therefore, in this paper, we develop an LMUL predicator with register-pressure-aware models to accurately assign the proper LMUL for different programs. The algorithm is based on a priority-based register allocation algorithm and considers the cost of the register pressures and program use patterns. This design helps assign the proper vector length multiplier in compile time for RVV. The experiment result shows that, with a total of 76 vectorization cases of TSVC, the proposed register pressure aware length multiplier achieves 73 correct predictions of the optimal value of Length Multiplier.

使用矢量处理器的并行处理是必不可少的。由于人工智能应用的需求不断增长，RISC-V矢量扩展(RVV)是一个备受期待的扩展。模块化和可扩展性使RISC-V成为业界流行的指令集。与SIMD指令相比，矢量指令使用更少的指令和更大的寄存器大小，可以在一条指令内处理多个寄存器，从而获得更高的性能。通过RVV提供的向量长度乘子(vector length multiplier, LMUL)的向量分组机制，RVV可以将多个向量寄存器组合成一组，从而提高处理器在相同发放率下处理数据的吞吐量。然而，由于寄存器压力，矢量长度并不总是与性能成正相关。因此，在本文中，我们开发了一个具有寄存器压力感知模型的LMUL预测器，以准确地为不同的程序分配适当的LMUL。该算法基于基于优先级的寄存器分配算法，并考虑了寄存器压力的代价和程序使用模式。这种设计有助于在编译时为RVV分配适当的向量长度乘法器。实验结果表明，在总共76个TSVC矢量化案例中，所提出的寄存器压力感知长度乘法器实现了长度乘法器最优值的73个正确预测。

{"title":"Register-Pressure Aware Predicator for Length Multiplier of RVV","authors":"Meng-Shiuan Shih, H.M. Lai, Chao-Lin Lee, Chung-Kai Chen, Jenq-Kuen Lee","doi":"10.1145/3547276.3548513","DOIUrl":"https://doi.org/10.1145/3547276.3548513","url":null,"abstract":"The use of parallel processing with vector processors is indispensable. The RISC-V vector extension (RVV) is a highly anticipated extension due to the demand for growing AI applications. The modularity and extensibility make RISC-V a popular instruction set in the industry. Compared to SIMD instruction, vector instructions use fewer instructions with a larger register size which can handle multiple registers within one instruction, resulting in higher performance. With the vector grouping mechanism called vector length multiplier (LMUL) provided by RVV, RVV can combine multiple vector registers into one group so that the processor can increase the throughput of processing data under the same issue rate. However, due to the register pressure, the vector length is not always positively relative to the performance. Therefore, in this paper, we develop an LMUL predicator with register-pressure-aware models to accurately assign the proper LMUL for different programs. The algorithm is based on a priority-based register allocation algorithm and considers the cost of the register pressures and program use patterns. This design helps assign the proper vector length multiplier in compile time for RVV. The experiment result shows that, with a total of 76 vectorization cases of TSVC, the proposed register pressure aware length multiplier achieves 73 correct predictions of the optimal value of Length Multiplier.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130531749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DenMG: Density-Based Member Generation for Ensemble Clustering 基于密度的集成聚类成员生成

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548520

Xueqin Du, Yulin He, Philippe Fournier-Viger, J. Huang

Ensemble clustering is a popular approach for identifying clusters in data, which combines results from multiple clustering algorithms to obtain more accurate and robust clusters. However, the performance of ensemble clustering algorithms greatly depends on the quality of its members. Based on this observation, this paper proposes a density-based member generation (DenMG) algorithm that selects ensemble members by considering the distribution consistency. DenMG has two main components, which split sample points from a heterocluster and merge sample points to form a homocluster, respectively. The first component estimates two probability density functions (p.d.f.s) based on an heterocluster’s sample points, and represents them using a Gaussian distribution and a Gaussian mixture model. If random numbers generated by these two p.d.f.s are deemed to have different probability distributions, the heterocluster is split into smaller clusters. The second component merges clusters that have high neighborhood densities into a homocluster. This is done using an opposite-oriented criterion that measures neighborhood density. A series of experiments were conducted to demonstrate the feasibility and effectiveness of the proposed ensemble member generation algorithm. Results show that the proposed algorithm can generate high quality ensemble members and as a result yield better clustering than five state-of-the-art ensemble clustering algorithms.

集成聚类是一种流行的数据聚类识别方法，它将多个聚类算法的结果结合在一起，以获得更准确和鲁棒的聚类。然而，集成聚类算法的性能在很大程度上取决于其成员的质量。基于此，本文提出了一种基于密度的成员生成(DenMG)算法，该算法通过考虑集合成员的分布一致性来选择集合成员。DenMG有两个主要组件，分别从异聚类中分离样本点和合并样本点形成同聚类。第一部分基于异质簇的样本点估计两个概率密度函数，并使用高斯分布和高斯混合模型表示它们。如果这两个p.d.f.产生的随机数被认为具有不同的概率分布，则异聚类被分成更小的聚类。第二个组件将具有高邻域密度的集群合并成一个同质集群。这是通过测量邻里密度的反向标准来完成的。通过一系列实验验证了所提出的集成成员生成算法的可行性和有效性。结果表明，该算法能够生成高质量的集成成员，聚类效果优于现有的5种集成聚类算法。

{"title":"DenMG: Density-Based Member Generation for Ensemble Clustering","authors":"Xueqin Du, Yulin He, Philippe Fournier-Viger, J. Huang","doi":"10.1145/3547276.3548520","DOIUrl":"https://doi.org/10.1145/3547276.3548520","url":null,"abstract":"Ensemble clustering is a popular approach for identifying clusters in data, which combines results from multiple clustering algorithms to obtain more accurate and robust clusters. However, the performance of ensemble clustering algorithms greatly depends on the quality of its members. Based on this observation, this paper proposes a density-based member generation (DenMG) algorithm that selects ensemble members by considering the distribution consistency. DenMG has two main components, which split sample points from a heterocluster and merge sample points to form a homocluster, respectively. The first component estimates two probability density functions (p.d.f.s) based on an heterocluster’s sample points, and represents them using a Gaussian distribution and a Gaussian mixture model. If random numbers generated by these two p.d.f.s are deemed to have different probability distributions, the heterocluster is split into smaller clusters. The second component merges clusters that have high neighborhood densities into a homocluster. This is done using an opposite-oriented criterion that measures neighborhood density. A series of experiments were conducted to demonstrate the feasibility and effectiveness of the proposed ensemble member generation algorithm. Results show that the proposed algorithm can generate high quality ensemble members and as a result yield better clustering than five state-of-the-art ensemble clustering algorithms.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"238 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133683100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Application Showcases for TVM with NeuroPilot on Mobile Devices 移动设备上的TVM与NeuroPilot的应用展示

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548514

Sheng-Yuan Cheng, Chun-Ping Chung, Robert Lai, Jenq-Kuen Lee

With the increasing demand for machine learning inference on mobile devices, more platforms are emerging to provide AI inferences on mobile devices. One of the popular ones is TVM, which is an end-to-end AI compiler. The major drawback is TVM doesn’t support all manufacturer-supplied accelerators. On the other hand, an AI solution for MediaTek’s platform, NeuroPilot, offers inference on mobile devices with high performance. Nevertheless, NeuroPilot does not support all of the common machine learning frameworks. Therefore, we want to take advantage of both sides. This way, the solution could accept a variety of machine learning frameworks, including Tensorflow, Pytorch, ONNX, and MxNet and utilize the AI accelerator from MediaTek. We adopt the TVM BYOC flow to implement the solution. In order to illustrate the ability to accept different machine learning frameworks for different tasks, we used three different models to build an application showcase in this work: the face anti-spoofing model from PyTorch, the emotion detection model from Keras, and the object detection model from Tflite. Since these models have dependencies while running inference, we propose a prototype of pipeline algorithm to improve the inference performance of the application showcase.

随着移动设备对机器学习推理的需求不断增加，越来越多的平台开始在移动设备上提供人工智能推理。其中一个流行的是TVM，它是一个端到端的AI编译器。主要缺点是TVM不支持所有制造商提供的加速器。另一方面，联发科平台的人工智能解决方案NeuroPilot在高性能移动设备上提供推理。然而，NeuroPilot并不支持所有常见的机器学习框架。因此，我们希望利用双方的优势。这样，解决方案可以接受各种机器学习框架，包括Tensorflow, Pytorch, ONNX和MxNet，并利用联发科的AI加速器。我们采用TVM BYOC流程来实现该解决方案。为了说明不同任务接受不同机器学习框架的能力，我们在这项工作中使用了三种不同的模型来构建应用程序展示:PyTorch的面部反欺骗模型，Keras的情感检测模型和Tflite的对象检测模型。由于这些模型在运行推理时存在依赖关系，我们提出了一种管道算法的原型，以提高应用程序展示的推理性能。

{"title":"Application Showcases for TVM with NeuroPilot on Mobile Devices","authors":"Sheng-Yuan Cheng, Chun-Ping Chung, Robert Lai, Jenq-Kuen Lee","doi":"10.1145/3547276.3548514","DOIUrl":"https://doi.org/10.1145/3547276.3548514","url":null,"abstract":"With the increasing demand for machine learning inference on mobile devices, more platforms are emerging to provide AI inferences on mobile devices. One of the popular ones is TVM, which is an end-to-end AI compiler. The major drawback is TVM doesn’t support all manufacturer-supplied accelerators. On the other hand, an AI solution for MediaTek’s platform, NeuroPilot, offers inference on mobile devices with high performance. Nevertheless, NeuroPilot does not support all of the common machine learning frameworks. Therefore, we want to take advantage of both sides. This way, the solution could accept a variety of machine learning frameworks, including Tensorflow, Pytorch, ONNX, and MxNet and utilize the AI accelerator from MediaTek. We adopt the TVM BYOC flow to implement the solution. In order to illustrate the ability to accept different machine learning frameworks for different tasks, we used three different models to build an application showcase in this work: the face anti-spoofing model from PyTorch, the emotion detection model from Keras, and the object detection model from Tflite. Since these models have dependencies while running inference, we propose a prototype of pipeline algorithm to improve the inference performance of the application showcase.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133645450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cygnus - World First Multihybrid Accelerated Cluster with GPU and FPGA Coupling Cygnus -世界上第一个GPU和FPGA耦合的多混合加速集群

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548629

T. Boku, N. Fujita, Ryohei Kobayashi, O. Tatebe

In this paper, we describe the concept, system architecture, supporting system software, and applications on our world-first supercomputer with multihybrid accelerators using GPU and FPGA coupling, named Cygnus, which runs at Center for Computational Sciences, University of Tsukuba. A special group of 32 nodes is configured as a multihybrid accelerated computing system named Albireo part although Cygnus is constructed with over 80 computation nodes as a GPU-accelerated PC cluster. Each node of the Albireo part is equipped with four NVIDIA V100 GPU cards and two Intel Stratix10 FPGA cards in addition to two sockets of Intel Xeon Gold CPU where all nodes are connected by four lanes of InfiniBand HDR100 interconnection HCA in the full bisection bandwidth of NVIDIA HDR200 switches. Beside this ordinary interconnection network, all FPGA cards in Albireo part are connected by a special 2-Dimensional Torus network with direct optical links on each FPGA for constructing a very high throughput and low latency of FPGA-centric interconnection network. To the best of our knowledge, Cygnus is the world’s first production-level PC cluster to realize multihybrid acceleration with the GPU and FPGA combination. Unlike other GPU-accelerated clusters, users can program parallel codes where each process exploits both or either of the GPU and/or FPGA devices based on the characteristics of their applications. We developed various supporting system software such as inter-FPGA network routing system, DMA engine for GPU-FPGA direct communication managed by FPGA, and multihybrid accelerated programming framework because the programming method of such a complicated system has not been standardized. Further, we developed the first real application on Cygnus for fundamental astrophysics simulation to fully utilize GPU and FPGA together for very efficient acceleration. We describe the overall concept and construction of the Cygnus cluster with a brief introduction of the several underlying hardware and software research studies that have already been published. We summarize how such a concept of GPU/FPGA coworking will usher in a new era of accelerated supercomputing.

在本文中，我们描述了概念，系统架构，支持系统软件，以及在我们的世界上第一台使用GPU和FPGA耦合的多混合加速器的超级计算机Cygnus上的应用，该计算机运行在筑波大学计算科学中心。一个特殊的32个节点组被配置为一个名为Albireo部分的多混合加速计算系统，而Cygnus是一个由80多个计算节点组成的gpu加速PC集群。Albireo部件的每个节点配备4张NVIDIA V100 GPU卡和2张Intel Stratix10 FPGA卡，外加2个Intel Xeon Gold CPU插槽，所有节点之间通过InfiniBand HDR100互连HCA的4条通道连接在NVIDIA HDR200交换机的全等分带宽上。除了这个普通的互连网络，Albireo部分的所有FPGA卡都通过一个特殊的二维环面网络连接，每个FPGA上都有直接的光链路，以构建一个非常高的吞吐量和低延迟的以FPGA为中心的互连网络。据我们所知，Cygnus是世界上第一个通过GPU和FPGA组合实现多混合加速的生产级PC集群。与其他GPU加速集群不同，用户可以编写并行代码，其中每个进程根据其应用程序的特性同时或其中一个利用GPU和/或FPGA设备。由于这种复杂系统的编程方法尚未标准化，我们开发了多种支持系统软件，如FPGA间网络路由系统、FPGA管理的GPU-FPGA直接通信的DMA引擎、多混合加速编程框架等。此外，我们在Cygnus上开发了第一个用于基础天体物理模拟的实际应用程序，以充分利用GPU和FPGA一起实现非常高效的加速。我们描述了天鹅座星团的整体概念和结构，并简要介绍了已经发表的几个基础硬件和软件研究。我们总结了GPU/FPGA协同工作的概念将如何引领加速超级计算的新时代。

{"title":"Cygnus - World First Multihybrid Accelerated Cluster with GPU and FPGA Coupling","authors":"T. Boku, N. Fujita, Ryohei Kobayashi, O. Tatebe","doi":"10.1145/3547276.3548629","DOIUrl":"https://doi.org/10.1145/3547276.3548629","url":null,"abstract":"In this paper, we describe the concept, system architecture, supporting system software, and applications on our world-first supercomputer with multihybrid accelerators using GPU and FPGA coupling, named Cygnus, which runs at Center for Computational Sciences, University of Tsukuba. A special group of 32 nodes is configured as a multihybrid accelerated computing system named Albireo part although Cygnus is constructed with over 80 computation nodes as a GPU-accelerated PC cluster. Each node of the Albireo part is equipped with four NVIDIA V100 GPU cards and two Intel Stratix10 FPGA cards in addition to two sockets of Intel Xeon Gold CPU where all nodes are connected by four lanes of InfiniBand HDR100 interconnection HCA in the full bisection bandwidth of NVIDIA HDR200 switches. Beside this ordinary interconnection network, all FPGA cards in Albireo part are connected by a special 2-Dimensional Torus network with direct optical links on each FPGA for constructing a very high throughput and low latency of FPGA-centric interconnection network. To the best of our knowledge, Cygnus is the world’s first production-level PC cluster to realize multihybrid acceleration with the GPU and FPGA combination. Unlike other GPU-accelerated clusters, users can program parallel codes where each process exploits both or either of the GPU and/or FPGA devices based on the characteristics of their applications. We developed various supporting system software such as inter-FPGA network routing system, DMA engine for GPU-FPGA direct communication managed by FPGA, and multihybrid accelerated programming framework because the programming method of such a complicated system has not been standardized. Further, we developed the first real application on Cygnus for fundamental astrophysics simulation to fully utilize GPU and FPGA together for very efficient acceleration. We describe the overall concept and construction of the Cygnus cluster with a brief introduction of the several underlying hardware and software research studies that have already been published. We summarize how such a concept of GPU/FPGA coworking will usher in a new era of accelerated supercomputing.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133463304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU AMD GPU上基于原子的HIP整数和约简研究

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548627

Zheming Jin, J. Vetter

Integer sum reduction is a primitive operation commonly used in scientific computing. Implementing a parallel reduction on a GPU often involves concurrent memory accesses using atomic operations and synchronization of work-items in a work-group. For a better understanding of these operations, we redesigned micro-kernels in the HIP programming language to measure the time of atomic operations over global memory, the cost of barrier synchronization, and reduction within a work-group to shared local memory using one atomic addition per work-item on a compute unit in an AMD MI100 GPU. Then, we describe the implementations of the reduction kernels with vectorized memory accesses, parameterized workload sizes, and vendor's library APIs. Our experimental results show that 1) there is a performance tradeoff between the cost of barrier synchronization and the amount of parallelism from atomic operations over shared local memory when we increase the size of a work-group. 2) a reduction kernel with vectorized memory accesses and vector data types is approximately 3% faster for the large problem size than the kernels written with the vendor's library APIs. 3) the compiler needs to assist the hardware processor with data dependency resolution at the level of instruction set architecture. 4) the power consumption of the kernel execution on the GPU fluctuates between 277 Watts and 301 Watts and the dynamic power of other GPU activities is at most 31 Watts.

整数和约简是科学计算中常用的一种基本运算。在GPU上实现并行缩减通常涉及使用原子操作和工作组中工作项同步的并发内存访问。为了更好地理解这些操作，我们重新设计了HIP编程语言中的微内核，以测量全局内存上原子操作的时间、屏障同步的成本，以及在AMD MI100 GPU的计算单元上使用每个工作项的原子加法来减少工作组内共享本地内存。然后，我们用向量化内存访问、参数化工作负载大小和供应商的库api描述了缩减内核的实现。我们的实验结果表明，1)当我们增加工作组的大小时，屏障同步的成本和共享本地内存上原子操作的并行性之间存在性能权衡。2)具有向量化内存访问和矢量数据类型的精简内核在处理大型问题时比使用供应商的库api编写的内核快约3%。3)编译器需要协助硬件处理器在指令集体系结构层面进行数据依赖解析。4) GPU上内核执行的功耗在277瓦到301瓦之间波动，其他GPU活动的动态功耗最多为31瓦。

{"title":"A Study on Atomics-based Integer Sum Reduction in HIP on AMD GPU","authors":"Zheming Jin, J. Vetter","doi":"10.1145/3547276.3548627","DOIUrl":"https://doi.org/10.1145/3547276.3548627","url":null,"abstract":"Integer sum reduction is a primitive operation commonly used in scientific computing. Implementing a parallel reduction on a GPU often involves concurrent memory accesses using atomic operations and synchronization of work-items in a work-group. For a better understanding of these operations, we redesigned micro-kernels in the HIP programming language to measure the time of atomic operations over global memory, the cost of barrier synchronization, and reduction within a work-group to shared local memory using one atomic addition per work-item on a compute unit in an AMD MI100 GPU. Then, we describe the implementations of the reduction kernels with vectorized memory accesses, parameterized workload sizes, and vendor's library APIs. Our experimental results show that 1) there is a performance tradeoff between the cost of barrier synchronization and the amount of parallelism from atomic operations over shared local memory when we increase the size of a work-group. 2) a reduction kernel with vectorized memory accesses and vector data types is approximately 3% faster for the large problem size than the kernels written with the vendor's library APIs. 3) the compiler needs to assist the hardware processor with data dependency resolution at the level of instruction set architecture. 4) the power consumption of the kernel execution on the GPU fluctuates between 277 Watts and 301 Watts and the dynamic power of other GPU activities is at most 31 Watts.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133479207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallel Beam Search for Combinatorial Optimization 组合优化的平行梁搜索

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-07-17 DOI: 10.1145/3547276.3548633

Nikolaus Frohner, Jan Gmys, N. Melab, G. Raidl, E. Talbi

Inspired by the recent success of parallelized exact methods to solve difficult scheduling problems, we present a general parallel beam search framework for combinatorial optimization problems. Beam search is a constructive metaheuristic traversing a search tree layer by layer while keeping in each layer a bounded number of promising nodes to consider many partial solutions in parallel. We propose a variant which is suitable for intra-node parallelization by multithreading with data parallelism. Diversification and inter-node parallelization are combined by performing multiple randomized runs on independent workers communicating via MPI. For sufficiently large problem instances and beam widths our prototypical implementation in the JIT-compiled Julia language admits speed-ups between 30–42 × on 46 cores with uniform memory access for two difficult classical problems, namely Permutation Flow Shop Scheduling (PFSP) with flowtime objective and the Traveling Tournament Problem (TTP). This allowed us to perform large beam width runs to find 11 new best feasible solutions for 22 difficult TTP benchmark instances up to 20 teams with an average wallclock runtime of about one hour per instance.

受并行精确方法在解决复杂调度问题上的成功启发，我们提出了一种通用的并行束搜索框架，用于组合优化问题。束搜索是一种建设性的元启发式算法，它一层一层地遍历搜索树，同时在每一层中保留有限数量的有希望的节点，以并行地考虑许多部分解。提出了一种适用于节点内并行化的数据并行化多线程算法。通过在通过MPI通信的独立工作者上执行多个随机运行，将多样化和节点间并行化相结合。对于足够大的问题实例和波束宽度，我们在jit编译的Julia语言中的原型实现允许在46个内核上使用统一的内存访问加速30-42倍，用于两个困难的经典问题，即具有流时间目标的置换流水车间调度(PFSP)和旅行比赛问题(TTP)。这使我们能够执行大波束宽度运行，为22个困难的TTP基准实例(最多20个团队)找到11个新的最佳可行解决方案，每个实例的平均时钟运行时间约为1小时。

{"title":"Parallel Beam Search for Combinatorial Optimization","authors":"Nikolaus Frohner, Jan Gmys, N. Melab, G. Raidl, E. Talbi","doi":"10.1145/3547276.3548633","DOIUrl":"https://doi.org/10.1145/3547276.3548633","url":null,"abstract":"Inspired by the recent success of parallelized exact methods to solve difficult scheduling problems, we present a general parallel beam search framework for combinatorial optimization problems. Beam search is a constructive metaheuristic traversing a search tree layer by layer while keeping in each layer a bounded number of promising nodes to consider many partial solutions in parallel. We propose a variant which is suitable for intra-node parallelization by multithreading with data parallelism. Diversification and inter-node parallelization are combined by performing multiple randomized runs on independent workers communicating via MPI. For sufficiently large problem instances and beam widths our prototypical implementation in the JIT-compiled Julia language admits speed-ups between 30–42 × on 46 cores with uniform memory access for two difficult classical problems, namely Permutation Flow Shop Scheduling (PFSP) with flowtime objective and the Traveling Tournament Problem (TTP). This allowed us to perform large beam width runs to find 11 new best feasible solutions for 22 difficult TTP benchmark instances up to 20 teams with an average wallclock runtime of about one hour per instance.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115070212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

The OpenMP Cluster Programming Model OpenMP集群编程模型

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-07-12 DOI: 10.1145/3547276.3548444

H. Yviquel, M. Pereira, E. Francesquini, G. Valarini, Gustavo Leite, Pedro Rosso, Rodrigo Ceccato, Carla Cusihualpa, Vitoria Dias, S. Rigo, Alan Souza, G. Araújo

Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP’s offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra- and inter-node parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application.

尽管有各种各样的研究计划和提出的编程模型，高性能计算集群中并行编程的有效解决方案仍然依赖于不同编程模型(例如，OpenMP和MPI)、语言(例如，c++和CUDA)和专门的运行时(例如，Charm++和Legion)的复杂组合。另一方面，任务并行已被证明是一种高效且无缝的集群编程模型。本文介绍了OpenMP集群(OMPC)，这是一种扩展OpenMP用于集群编程的任务并行模型。OMPC利用OpenMP的卸载标准在分布式系统的节点之间分发带注释的代码区域。为此，它将基于mpi的数据分布和负载平衡机制隐藏在OpenMP任务依赖项后面。鉴于它与OpenMP的遵从性，OMPC允许应用程序使用相同的编程模型来利用节点内和节点间的并行性，从而简化了开发过程和维护。我们使用Task Bench(一个专注于任务并行性的综合基准)对OMPC进行了评估，并将其性能与其他分布式运行时进行了比较。实验结果表明，在CCR和可扩展性实验中，OMPC的性能分别比Charm++提高了1.53倍和2.43倍。实验还表明，OMPC的性能在Task Bench和实际地震成像应用中都是弱尺度的。

{"title":"The OpenMP Cluster Programming Model","authors":"H. Yviquel, M. Pereira, E. Francesquini, G. Valarini, Gustavo Leite, Pedro Rosso, Rodrigo Ceccato, Carla Cusihualpa, Vitoria Dias, S. Rigo, Alan Souza, G. Araújo","doi":"10.1145/3547276.3548444","DOIUrl":"https://doi.org/10.1145/3547276.3548444","url":null,"abstract":"Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP’s offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra- and inter-node parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128838926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Accelerated Computation and Tracking of AC Optimal Power Flow Solutions Using GPUs 基于gpu的交流最优潮流加速计算与跟踪

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2021-10-13 DOI: 10.1145/3547276.3548631

Youngdae Kim, Kibaek Kim

We present a scalable solution method based on an alternating direction method of multipliers and graphics processing units (GPUs) for rapidly computing and tracking a solution of alternating current optimal power flow (ACOPF) problem. Such a fast computation is particularly useful for mitigating the negative impact of frequent load and generation fluctuations on the optimal operation of a large electrical grid. To this end, we decompose a given ACOPF problem by grid components, resulting in a large number of small independent nonlinear nonconvex optimization subproblems. The computation time of these subproblems is significantly accelerated by employing the massive parallel computing capability of GPUs. In addition, the warm-start ability of our method leads to faster convergence, making the method particularly suitable for fast tracking of optimal solutions. We demonstrate the performance of our method on a 70,000 bus system by solving associated optimal power flow problems with both cold start and warm start.

提出了一种基于乘法器和图形处理单元(gpu)交替方向法的可扩展求解方法，用于快速计算和跟踪交流最优潮流(ACOPF)问题的解。这种快速计算对于减轻频繁负荷和发电波动对大型电网最佳运行的负面影响特别有用。为此，我们将给定的ACOPF问题按网格分量分解，得到大量独立的小的非线性非凸优化子问题。利用gpu的大规模并行计算能力，大大加快了这些子问题的计算速度。此外，我们的方法的热启动能力导致更快的收敛，使该方法特别适合于最优解的快速跟踪。通过解决冷启动和热启动相关的最优潮流问题，我们在一个70000总线系统上证明了该方法的性能。

引用次数: 9

Workshop Proceedings of the 51st International Conference on Parallel Processing 第51届并行处理国际会议论文集

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 1900-01-01 DOI: 10.1145/3547276

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Workshop Proceedings of the 51st International Conference on Parallel Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀