Workshop Proceedings of the 51st International Conference on Parallel Processing最新文献_第2页

The Support of MLIR HLS Adaptor for LLVM IR MLIR HLS适配器对LLVM IR的支持

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548515

Geng-Ming Liang, Chuan-Yue Yuan, Meng-Shiun Yuan, Tai-Liang Chen, Kuan-Hsun Chen, Jenq-Kuen Lee

Since the emergence of MLIR, High-level Synthesis (HLS) tools started to design in multi-level abstractions. Unlike the traditional HLS tools that are based on a single abstraction (e.g. LLVM), optimizations in different levels of abstraction could benefit from cross-layer optimizations to get better results. Although current HLS tools with MLIR can generate HLS C/C++ to do synthesis, we believe that a direct IR transformation from MLIR to LLVM will keep more expression details. In this paper, we propose an adaptor for LLVM IR, which can optimize the IR, generated from MLIR, into HLS readable IR. Without the gap of unsupported syntax between different versions, developers could focus on their specialization. Our preliminary results show that the MLIR flow via our adaptor can generate comparable performance results with the version by MLIR HLS tools generating HLS C++ codes. The experiment is performed with Xilinx Vitis and HLS tools.

自MLIR出现以来，高级综合(High-level Synthesis, HLS)工具开始进行多层次抽象设计。与基于单一抽象(例如LLVM)的传统HLS工具不同，不同抽象级别的优化可以从跨层优化中受益，从而获得更好的结果。虽然目前带有MLIR的HLS工具可以生成HLS C/ c++进行合成，但我们认为，从MLIR到LLVM的直接IR转换将保留更多的表达式细节。在本文中，我们提出了一个LLVM IR的适配器，该适配器可以将MLIR生成的IR优化为HLS可读IR。没有了不同版本之间不支持语法的差距，开发人员可以专注于他们的专门化。我们的初步结果表明，通过我们的适配器的MLIR流可以产生与MLIR HLS工具生成HLS c++代码的版本相当的性能结果。实验采用Xilinx Vitis和HLS工具进行。

引用次数: 1

Structured Concurrency: A Review 结构化并发:综述

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548519

Yi-An Chen, Yi-Ping You

Today, mobile applications use thousands of concurrent tasks to process multiple sensor inputs to ensure a better user experience. With this demand, the ability to manage these concurrent tasks efficiently and easily is becoming a new challenge, especially in their lifetimes. Structured concurrency is a technique that reduces the complexity of managing a large number of concurrent tasks. There have been several languages or libraries (e.g., Kotlin, Swift, and Trio) that support such a paradigm for better concurrency management. It is worth noting that structured concurrency has been consistently implemented on top of coroutines across all these languages and libraries. However, there are no documents or studies in the literature that indicate why and how coroutines are relevant to structured concurrency. In contrast, the mainstream community views structured concurrency as a successor to structured programming; that is, the concept of “structure” extends from ordinary programming to concurrent programming. Nevertheless, such a viewpoint does not explain, as the concept of structured concurrency came out more than 40 years later after structured programming was introduced in the early 1970s, whereas concurrent programming started in the 1960s. In this paper, we introduce a new theory to complement the origin of structured concurrency from historical and technical perspectives—it is the foundation established by coroutines that gives birth to structured concurrency.

如今，移动应用程序使用数千个并发任务来处理多个传感器输入，以确保更好的用户体验。有了这种需求，高效、轻松地管理这些并发任务的能力成为一个新的挑战，特别是在它们的生命周期中。结构化并发是一种降低管理大量并发任务复杂性的技术。已经有几种语言或库(例如Kotlin、Swift和Trio)支持这种范例，以实现更好的并发管理。值得注意的是，结构化并发是在所有这些语言和库的协同程序之上一致实现的。然而，在文献中没有任何文档或研究表明协同程序为什么以及如何与结构化并发相关。相反，主流社区将结构化并发视为结构化编程的继承者;也就是说，“结构”的概念从普通编程扩展到了并发编程。然而，这种观点并不能解释，因为结构化并发的概念是在结构化编程于20世纪70年代早期引入后40多年才出现的，而并发编程始于20世纪60年代。在本文中，我们从历史和技术的角度引入了一个新的理论来补充结构化并发的起源——它是协同程序建立的基础，产生了结构化并发。

{"title":"Structured Concurrency: A Review","authors":"Yi-An Chen, Yi-Ping You","doi":"10.1145/3547276.3548519","DOIUrl":"https://doi.org/10.1145/3547276.3548519","url":null,"abstract":"Today, mobile applications use thousands of concurrent tasks to process multiple sensor inputs to ensure a better user experience. With this demand, the ability to manage these concurrent tasks efficiently and easily is becoming a new challenge, especially in their lifetimes. Structured concurrency is a technique that reduces the complexity of managing a large number of concurrent tasks. There have been several languages or libraries (e.g., Kotlin, Swift, and Trio) that support such a paradigm for better concurrency management. It is worth noting that structured concurrency has been consistently implemented on top of coroutines across all these languages and libraries. However, there are no documents or studies in the literature that indicate why and how coroutines are relevant to structured concurrency. In contrast, the mainstream community views structured concurrency as a successor to structured programming; that is, the concept of “structure” extends from ordinary programming to concurrent programming. Nevertheless, such a viewpoint does not explain, as the concept of structured concurrency came out more than 40 years later after structured programming was introduced in the early 1970s, whereas concurrent programming started in the 1960s. In this paper, we introduce a new theory to complement the origin of structured concurrency from historical and technical perspectives—it is the foundation established by coroutines that gives birth to structured concurrency.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127440949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps 在功率上限下优化现代gpu的硬件资源分区和任务分配

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548630

Eishi Arima, Minjoon Kang, Issa Saba, J. Weidendorfer, C. Trinitis, Martin Schulz

CPU-GPU heterogeneous systems are now commonly used in HPC (High-Performance Computing). However, improving the utilization and energy-efficiency of such systems is still one of the most critical issues. As one single program typically cannot fully utilize all resources within a node/chip, co-scheduling (or co-locating) multiple programs with complementary resource requirements is a promising solution. Meanwhile, as power consumption has become the first-class design constraint for HPC systems, such co-scheduling techniques should be well-tailored for power-constrained environments. To this end, the industry recently started supporting hardware-level resource partitioning features on modern GPUs for realizing efficient co-scheduling, which can operate with existing power capping features. For example, NVidia’s MIG (Multi-Instance GPU) partitions one single GPU into multiple instances at the granularity of a GPC (Graphics Processing Cluster). In this paper, we explicitly target the combination of hardware-level GPU partitioning features and power capping for power-constrained HPC systems. We provide a systematic methodology to optimize the combination of chip partitioning, job allocations, as well as power capping based on our scalability/interference modeling while taking a variety of aspects into account, such as compute/memory intensity and utilization in heterogeneous computational resources (e.g., Tensor Cores). The experimental result indicates that our approach is successful in selecting a near optimal combination across multiple different workloads.

CPU-GPU异构系统现在广泛应用于高性能计算(HPC)。然而，提高这些系统的利用率和能源效率仍然是最关键的问题之一。由于单个程序通常不能充分利用节点/芯片内的所有资源，因此具有互补资源需求的多个程序协同调度(或共定位)是一种很有前途的解决方案。同时，由于功耗已经成为高性能计算系统的首要设计约束，这种协同调度技术应该针对功耗受限的环境进行量身定制。为此，业界最近开始在现代gpu上支持硬件级资源分区功能，以实现高效的协同调度，这可以与现有的功率封顶功能一起使用。例如，NVidia的MIG(多实例GPU)以GPC(图形处理集群)的粒度将单个GPU划分为多个实例。在本文中，我们明确地针对硬件级GPU分区特征和功率限制的HPC系统的组合。我们提供了一个系统的方法来优化芯片分区，作业分配的组合，以及基于我们的可扩展性/干扰建模的功率上限，同时考虑到各种方面，如计算/内存强度和异构计算资源(例如，张量核心)的利用率。实验结果表明，我们的方法可以成功地在多个不同的工作负载中选择接近最优的组合。

{"title":"Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps","authors":"Eishi Arima, Minjoon Kang, Issa Saba, J. Weidendorfer, C. Trinitis, Martin Schulz","doi":"10.1145/3547276.3548630","DOIUrl":"https://doi.org/10.1145/3547276.3548630","url":null,"abstract":"CPU-GPU heterogeneous systems are now commonly used in HPC (High-Performance Computing). However, improving the utilization and energy-efficiency of such systems is still one of the most critical issues. As one single program typically cannot fully utilize all resources within a node/chip, co-scheduling (or co-locating) multiple programs with complementary resource requirements is a promising solution. Meanwhile, as power consumption has become the first-class design constraint for HPC systems, such co-scheduling techniques should be well-tailored for power-constrained environments. To this end, the industry recently started supporting hardware-level resource partitioning features on modern GPUs for realizing efficient co-scheduling, which can operate with existing power capping features. For example, NVidia’s MIG (Multi-Instance GPU) partitions one single GPU into multiple instances at the granularity of a GPC (Graphics Processing Cluster). In this paper, we explicitly target the combination of hardware-level GPU partitioning features and power capping for power-constrained HPC systems. We provide a systematic methodology to optimize the combination of chip partitioning, job allocations, as well as power capping based on our scalability/interference modeling while taking a variety of aspects into account, such as compute/memory intensity and utilization in heterogeneous computational resources (e.g., Tensor Cores). The experimental result indicates that our approach is successful in selecting a near optimal combination across multiple different workloads.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130408063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Designing Hierarchical Multi-HCA Aware Allgather in MPI MPI中分层多hca感知集合的设计

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548524

Tu Tran, Benjamin Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, D. Panda

To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, resulting in a ”multi-rail” network. The second and third-placed systems of the Top500 use two adapters per node; recently, the ThetaGPU system at Argonne National Laboratory (ANL) uses eight adapters per node. With such an availability of networking resources, it is a non-trivial task to utilize all of them. The Message Passing Interface (MPI) is a dominant model for high-performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we take up this task and propose hierarchical, multi-HCA aware Allgather designs; Allgather is a communication-intensive collective widely used in applications like matrix multiplication and other collectives. The proposed designs fully utilize all the available network adapters within a node and provides high overlap between inter-node and intra-node communication. At the micro-benchmark level, our new schemes achieve performance improvement for both single node and multiple node communication. We see inter-node improvements up to 62% and 61% better than HPC-X and MVAPICH2-X for 1024 processes. The design for inter-node communication also boosts the performance of Ring Allreduce by 56% and 44% compared to HPC-X and MVAPICH2-X. At the application level, the enhanced Allgather shows 1.98x and 1.42x improvement in a matrix-vector multiplication kernel when compared to HPC-X and MVAPICH2-X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2-X.

为了加速节点之间的通信，超级计算机现在为每个节点配备了多个网络适配器，从而形成了“多轨”网络。Top500中排名第二和第三的系统每个节点使用两个适配器;最近，阿贡国家实验室(ANL)的ThetaGPU系统在每个节点上使用8个适配器。有了如此多的网络资源，利用所有这些资源是一项非常重要的任务。消息传递接口(Message Passing Interface, MPI)是高性能计算集群的主流模型。并非所有MPI集合都利用所有资源，随着给定集群中带宽和适配器数量的增加，这一点变得更加明显。在这项工作中，我们承担了这个任务，提出了分层的、多hca感知的Allgather设计;Allgather是一个通信密集型集合，广泛用于矩阵乘法和其他集合等应用程序。所提出的设计充分利用节点内所有可用的网络适配器，并提供节点间和节点内通信的高度重叠。在微基准测试水平上，我们的新方案实现了单节点和多节点通信的性能改进。对于1024个进程，我们看到节点间的改进比HPC-X和MVAPICH2-X分别提高了62%和61%。与HPC-X和MVAPICH2-X相比，节点间通信的设计还使Ring Allreduce的性能提高了56%和44%。在应用程序层面，与HPC-X和MVAPICH2-X相比，增强的Allgather在矩阵向量乘法内核上的性能提高了1.98倍和1.42倍，而Allreduce在深度学习训练中比MVAPICH2-X的性能提高了7.83%。

{"title":"Designing Hierarchical Multi-HCA Aware Allgather in MPI","authors":"Tu Tran, Benjamin Michalowicz, B. Ramesh, H. Subramoni, A. Shafi, D. Panda","doi":"10.1145/3547276.3548524","DOIUrl":"https://doi.org/10.1145/3547276.3548524","url":null,"abstract":"To accelerate the communication between nodes, supercomputers are now equipped with multiple network adapters per node, resulting in a ”multi-rail” network. The second and third-placed systems of the Top500 use two adapters per node; recently, the ThetaGPU system at Argonne National Laboratory (ANL) uses eight adapters per node. With such an availability of networking resources, it is a non-trivial task to utilize all of them. The Message Passing Interface (MPI) is a dominant model for high-performance computing clusters. Not all MPI collectives utilize all resources, and this becomes more apparent with advances in bandwidth and adapter count in a given cluster. In this work, we take up this task and propose hierarchical, multi-HCA aware Allgather designs; Allgather is a communication-intensive collective widely used in applications like matrix multiplication and other collectives. The proposed designs fully utilize all the available network adapters within a node and provides high overlap between inter-node and intra-node communication. At the micro-benchmark level, our new schemes achieve performance improvement for both single node and multiple node communication. We see inter-node improvements up to 62% and 61% better than HPC-X and MVAPICH2-X for 1024 processes. The design for inter-node communication also boosts the performance of Ring Allreduce by 56% and 44% compared to HPC-X and MVAPICH2-X. At the application level, the enhanced Allgather shows 1.98x and 1.42x improvement in a matrix-vector multiplication kernel when compared to HPC-X and MVAPICH2-X, and Allreduce performs up to 7.83% better in deep learning training against MVAPICH2-X.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"180 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134311827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Execution Flow Aware Profiling for ROS-based Autonomous Vehicle Software 基于ros的自动驾驶软件执行流程感知分析

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548516

Shao-Hua Wang, Chia-Heng Tu, C. Huang, J. Juang

The complexity of the Robot Operating System (ROS) based autonomous software grows as autonomous vehicles get more intelligent. It is a big challenge for system designers to rapidly understand runtime behaviors and performance of such sophisticated software because the conventional tools are insufficient for characterizing the high-level interactions of the modules within the software. In this paper, a new graphical representation, execution flow graph, is devised to represent the execution sequences and related performance statistics of the ROS modules. The execution flow aware profiling is applied on the autonomous software, Autoware and Navigation Stack, with encouraging results.

随着自动驾驶汽车越来越智能，基于机器人操作系统(ROS)的自动驾驶软件的复杂性也在增长。对于系统设计人员来说，快速理解这些复杂软件的运行时行为和性能是一个巨大的挑战，因为传统的工具不足以描述软件中模块的高级交互。本文设计了一种新的图形表示方式——执行流图，来表示ROS模块的执行顺序和相关性能统计数据。在自主软件Autoware和Navigation Stack上应用了执行流感知分析，取得了良好的效果。

引用次数: 0

A Software/Hardware Co-design Local Irregular Sparsity Method for Accelerating CNNs on FPGA 在FPGA上加速cnn的软硬件协同设计局部不规则稀疏度方法

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548521

Jiangwei Shang, Zhan Zhang, Chuanyou Li, Kun Zhang, Lei Qian, Hongwei Liu

Convolutional neural networks (CNNs) have been widely used in different areas. The success of CNNs comes with a huge amount of parameters and computations, and nowaday CNNs still keep moving toward larger structures. Although larger structures often bring about better inference accuracy, the increasing size also slows the inference speed down. Recently, various parameter sparsity methods have been proposed to accelerate CNNs by reducing the number of parameters and computations. Existing sparsity methods could be classified into two categories: unstructured and structured. Unstructured sparsity methods easily cause irregularity and thus have a suboptimal speedup. On the other hand, the structured sparsity methods could keep regularity by pruning the parameters following a certain pattern but result in low sparsity. In this paper, we propose a software/hardware co-design approach to bring local irregular sparsity into CNNs. Benefiting from the local irregularity, we design a row-wise computing engine, RConv Engine, to achieve workload balance and remarkable speedup. The experimental results show that our software/hardware co-design method can achieve a 10.9x speedup than the state-of-the-art methods with a negligible accuracy loss.

卷积神经网络(cnn)在不同领域得到了广泛的应用。cnn的成功离不开大量的参数和计算，现在cnn还在不断向更大的结构发展。虽然更大的结构往往带来更好的推理精度，但尺寸的增加也会减慢推理速度。近年来，人们提出了各种参数稀疏度方法，通过减少参数数量和计算量来加速cnn。现有的稀疏性方法可以分为两类:非结构化和结构化。非结构化稀疏性方法容易导致不规则性，因此具有次优加速。另一方面，结构化稀疏度方法可以按照一定的模式对参数进行剪枝，从而保持正则性，但稀疏度较低。本文提出了一种将局部不规则稀疏性引入cnn的软硬件协同设计方法。利用局部不规则性，我们设计了一个逐行计算引擎RConv engine，实现了工作负载的平衡和显著的加速。实验结果表明，我们的软件/硬件协同设计方法可以实现比最先进的方法提高10.9倍的速度，而精度损失可以忽略不计。

{"title":"A Software/Hardware Co-design Local Irregular Sparsity Method for Accelerating CNNs on FPGA","authors":"Jiangwei Shang, Zhan Zhang, Chuanyou Li, Kun Zhang, Lei Qian, Hongwei Liu","doi":"10.1145/3547276.3548521","DOIUrl":"https://doi.org/10.1145/3547276.3548521","url":null,"abstract":"Convolutional neural networks (CNNs) have been widely used in different areas. The success of CNNs comes with a huge amount of parameters and computations, and nowaday CNNs still keep moving toward larger structures. Although larger structures often bring about better inference accuracy, the increasing size also slows the inference speed down. Recently, various parameter sparsity methods have been proposed to accelerate CNNs by reducing the number of parameters and computations. Existing sparsity methods could be classified into two categories: unstructured and structured. Unstructured sparsity methods easily cause irregularity and thus have a suboptimal speedup. On the other hand, the structured sparsity methods could keep regularity by pruning the parameters following a certain pattern but result in low sparsity. In this paper, we propose a software/hardware co-design approach to bring local irregular sparsity into CNNs. Benefiting from the local irregularity, we design a row-wise computing engine, RConv Engine, to achieve workload balance and remarkable speedup. The experimental results show that our software/hardware co-design method can achieve a 10.9x speedup than the state-of-the-art methods with a negligible accuracy loss.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115241898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Two-Stage Pre-processing for License Recognition 许可证识别的两阶段预处理

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548441

J. Zhang, Cheng-Tsung Chan, Minmin Sun

Various financial insurance and investment application websites require customers to upload identity documents, such as vehicle licenses, to verify their identities. Manual verification of these documents is costly. Hence, there is a clear demand for automatic document recognition. This study proposes a two-stage method to pre-process a vehicle license for a better text recognition. In the first stage, the distortion that often appears in photographed documents is repaired. In the second stage, each data field is carefully located. The subsequent captured fields are then processed by a commercial text recognition software. Due to the sensitivity of vehicle licenses, it is difficult to collect enough data for model training. Consequently, artificial vehicle licenses are synthesized for model training to mitigate overfitting. In addition, an encoder is applied to reduce the background noise, remove the border crossing over text, and make the blurred text clearer before text recognition. The proposed method on a real dataset shows that the accuracy is close to 90%.

各种金融保险和投资应用网站要求客户上传身份证明文件，如车辆牌照，以验证其身份。手工验证这些文件的成本很高。因此，对自动文档识别有明确的需求。本研究提出了一种两阶段的机动车牌照预处理方法，以获得更好的文本识别效果。在第一阶段，对拍摄文件中经常出现的失真进行修复。在第二阶段，仔细定位每个数据字段。随后捕获的字段然后由商业文本识别软件处理。由于车辆牌照的敏感性，很难收集到足够的数据进行模型训练。因此，合成人工车辆牌照用于模型训练，以减轻过拟合。此外，在文本识别之前，使用编码器来降低背景噪声，去除文本上的边框，使模糊的文本更清晰。在实际数据集上，该方法的准确率接近90%。

引用次数: 0

A framework for low communication approaches for large scale 3D convolution 一种用于大规模三维卷积的低通信方法框架

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548626

Anuva Kulkarni, Jelena Kovacevic, F. Franchetti

Large-scale 3D convolutions computed using parallel Fast Fourier Transforms (FFTs) demand multiple all-to-all communication steps, which cause bottlenecks on computing clusters. Since data transfer speeds to/from memory have not increased proportionally to computational capacity (in terms of FLOPs), 3D FFTs become bounded by communication and are difficult to scale, especially on modern heterogeneous computing platforms consisting of accelerators like GPUs. Existing HPC frameworks focus on optimizing the isolated FFT algorithm or communication patterns, but still require multiple all-to-all communication steps during convolution. In this work, we present a strategy for scalable convolution such that it avoids multiple all-to-all exchanges, and also optimizes necessary communication. We provide proof-of-concept results under assumptions of a use case, the MASSIF Hooke’s law simulation convolution kernel. Our method localizes computation by exploiting properties of the data, and approximates the convolution result by data compression, resulting in increased scalability of 3D convolution. Our preliminary results show scalability of 8 times more than traditional methods in the same compute resources without adversely affecting result accuracy. Our method can be adapted for first-principle scientific simulations and leverages cross-disciplinary knowledge of the application, the data and computing to perform large-scale convolution while avoiding communication bottlenecks. In order to make our approach widely usable and adaptable for emerging challenges, we discuss the use of FFTX, a novel framework which can be used for platform-agnostic specification and optimization for algorithmic approaches similar to ours.

使用并行快速傅里叶变换(fft)计算大规模三维卷积需要多个全对全的通信步骤，这对计算集群造成了瓶颈。由于从内存到内存的数据传输速度并没有随着计算能力的增加而成比例地增加(就FLOPs而言)，3D fft受到通信的限制，很难扩展，特别是在由gpu等加速器组成的现代异构计算平台上。现有的高性能计算框架侧重于优化孤立的FFT算法或通信模式，但在卷积过程中仍然需要多个全对全通信步骤。在这项工作中，我们提出了一种可扩展卷积的策略，这样它就避免了多个所有对所有的交换，并且还优化了必要的通信。我们提供了一个用例假设下的概念验证结果，MASSIF胡克定律模拟卷积核。该方法利用数据的特性实现计算的局部化，并通过数据压缩逼近卷积结果，提高了三维卷积的可扩展性。我们的初步结果表明，在相同的计算资源下，该方法的可扩展性是传统方法的8倍，而不会对结果的准确性产生不利影响。我们的方法可以适用于第一性原理的科学模拟，并利用应用程序的跨学科知识，数据和计算来执行大规模卷积，同时避免通信瓶颈。为了使我们的方法广泛可用并适应新出现的挑战，我们讨论了FFTX的使用，FFTX是一种新颖的框架，可用于与平台无关的规范和优化类似于我们的算法方法。

{"title":"A framework for low communication approaches for large scale 3D convolution","authors":"Anuva Kulkarni, Jelena Kovacevic, F. Franchetti","doi":"10.1145/3547276.3548626","DOIUrl":"https://doi.org/10.1145/3547276.3548626","url":null,"abstract":"Large-scale 3D convolutions computed using parallel Fast Fourier Transforms (FFTs) demand multiple all-to-all communication steps, which cause bottlenecks on computing clusters. Since data transfer speeds to/from memory have not increased proportionally to computational capacity (in terms of FLOPs), 3D FFTs become bounded by communication and are difficult to scale, especially on modern heterogeneous computing platforms consisting of accelerators like GPUs. Existing HPC frameworks focus on optimizing the isolated FFT algorithm or communication patterns, but still require multiple all-to-all communication steps during convolution. In this work, we present a strategy for scalable convolution such that it avoids multiple all-to-all exchanges, and also optimizes necessary communication. We provide proof-of-concept results under assumptions of a use case, the MASSIF Hooke’s law simulation convolution kernel. Our method localizes computation by exploiting properties of the data, and approximates the convolution result by data compression, resulting in increased scalability of 3D convolution. Our preliminary results show scalability of 8 times more than traditional methods in the same compute resources without adversely affecting result accuracy. Our method can be adapted for first-principle scientific simulations and leverages cross-disciplinary knowledge of the application, the data and computing to perform large-scale convolution while avoiding communication bottlenecks. In order to make our approach widely usable and adaptable for emerging challenges, we discuss the use of FFTX, a novel framework which can be used for platform-agnostic specification and optimization for algorithmic approaches similar to ours.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132073449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Pipeline Pattern Detection Technique in Polly 波利中的管道模式检测技术

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548445

Delaram Talaashrafi, J. Doerfert, M. M. Maza

The polyhedral model has repeatedly shown how it facilitates various loop transformations, including loop parallelization, loop tiling, and software pipelining. However, parallelism is almost exclusively exploited on a per-loop basis without much work on detecting cross-loop parallelization opportunities. While many problems can be scheduled such that loop dimensions are dependence-free, the resulting loop parallelism does not necessarily maximize concurrent execution, especially not for unbalanced problems. In this work, we introduce a polyhedral-model-based analysis and scheduling algorithm that exposes and utilizes cross-loop parallelization through tasking. This work exploits pipeline patterns between iterations in different loop nests, and it is well suited to handle imbalanced iterations. Our LLVM/Polly-based prototype performs schedule modifications and code generation targeting a minimal, language agnostic tasking layer. We present results using an implementation of this API with the OpenMP task construct. For different computation patterns, we achieved speed-ups of up to 3.5 × on a quad-core processor while LLVM/Polly alone fails to exploit the parallelism.

多面体模型反复展示了它如何促进各种循环转换，包括循环并行化、循环平铺和软件流水线。然而，并行性几乎完全是在每个循环的基础上利用的，没有太多的工作来检测跨循环并行化的机会。虽然可以安排许多问题，使循环维度不依赖，但是产生的循环并行性不一定最大化并发执行，特别是对于不平衡的问题。在这项工作中，我们介绍了一种基于多面体模型的分析和调度算法，该算法通过任务处理暴露并利用了交叉循环并行化。这项工作利用了不同循环巢中迭代之间的管道模式，它非常适合处理不平衡的迭代。我们基于LLVM/ poly的原型执行计划修改和代码生成，目标是一个最小的、语言无关的任务层。我们用OpenMP任务构造实现了这个API，并给出了结果。对于不同的计算模式，我们在四核处理器上实现了高达3.5倍的加速，而LLVM/Polly本身无法利用并行性。

{"title":"A Pipeline Pattern Detection Technique in Polly","authors":"Delaram Talaashrafi, J. Doerfert, M. M. Maza","doi":"10.1145/3547276.3548445","DOIUrl":"https://doi.org/10.1145/3547276.3548445","url":null,"abstract":"The polyhedral model has repeatedly shown how it facilitates various loop transformations, including loop parallelization, loop tiling, and software pipelining. However, parallelism is almost exclusively exploited on a per-loop basis without much work on detecting cross-loop parallelization opportunities. While many problems can be scheduled such that loop dimensions are dependence-free, the resulting loop parallelism does not necessarily maximize concurrent execution, especially not for unbalanced problems. In this work, we introduce a polyhedral-model-based analysis and scheduling algorithm that exposes and utilizes cross-loop parallelization through tasking. This work exploits pipeline patterns between iterations in different loop nests, and it is well suited to handle imbalanced iterations. Our LLVM/Polly-based prototype performs schedule modifications and code generation targeting a minimal, language agnostic tasking layer. We present results using an implementation of this API with the OpenMP task construct. For different computation patterns, we achieved speed-ups of up to 3.5 × on a quad-core processor while LLVM/Polly alone fails to exploit the parallelism.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131955309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Frequency Recovery in Power Grids using High-Performance Computing 基于高性能计算的电网频率恢复

Workshop Proceedings of the 51st International Conference on Parallel Processing

Pub Date : 2022-08-29 DOI: 10.1145/3547276.3548632

Vishwas Rao, A. Subramanyam, Michel Schanen, Youngdae Kim, Ignas Šatkauskas, M. Anitescu

Maintaining electric power system stability is paramount, especially in extreme contingencies involving unexpected outages of multiple generators or transmission lines that are typical during severe weather events. Such outages often lead to large supply-demand mismatches followed by subsequent system frequency deviations from their nominal value. The extent of frequency deviations is an important metric of system resilience, and its timely mitigation is a central goal of power system operation and control. This paper develops a novel nonlinear model predictive control (NMPC) method to minimize frequency deviations when the grid is affected by an unforeseen loss of multiple components. Our method is based on a novel multi-period alternating current optimal power flow (ACOPF) formulation that accurately models both nonlinear electric power flow physics and the primary and secondary frequency response of generator control mechanisms. We develop a distributed parallel Julia package for solving the large-scale nonlinear optimization problems that result from our NMPC method and thereby address realistic test instances on existing high-performance computing architectures. Our method demonstrates superior performance in terms of frequency recovery over existing industry practices, where generator levels are set based on the solution of single-period classical ACOPF models.

保持电力系统的稳定性是至关重要的，特别是在极端突发事件中，涉及多台发电机或传输线的意外停机，这是在恶劣天气事件中常见的。这种中断常常导致大量的供需不匹配，随后系统频率偏离其标称值。频率偏差程度是衡量系统恢复能力的重要指标，及时缓解频率偏差是电力系统运行和控制的中心目标。本文提出了一种新的非线性模型预测控制(NMPC)方法，以最大限度地减少当电网受到不可预见的多分量损失影响时的频率偏差。我们的方法是基于一种新的多周期交流最优潮流(ACOPF)公式，该公式准确地模拟了非线性潮流物理和发电机控制机构的一次和二次频率响应。我们开发了一个分布式并行Julia包，用于解决NMPC方法导致的大规模非线性优化问题，从而解决现有高性能计算架构上的实际测试实例。与现有的工业实践相比，我们的方法在频率恢复方面表现出优越的性能，其中发电机水平是基于单周期经典ACOPF模型的解决方案设置的。

{"title":"Frequency Recovery in Power Grids using High-Performance Computing","authors":"Vishwas Rao, A. Subramanyam, Michel Schanen, Youngdae Kim, Ignas Šatkauskas, M. Anitescu","doi":"10.1145/3547276.3548632","DOIUrl":"https://doi.org/10.1145/3547276.3548632","url":null,"abstract":"Maintaining electric power system stability is paramount, especially in extreme contingencies involving unexpected outages of multiple generators or transmission lines that are typical during severe weather events. Such outages often lead to large supply-demand mismatches followed by subsequent system frequency deviations from their nominal value. The extent of frequency deviations is an important metric of system resilience, and its timely mitigation is a central goal of power system operation and control. This paper develops a novel nonlinear model predictive control (NMPC) method to minimize frequency deviations when the grid is affected by an unforeseen loss of multiple components. Our method is based on a novel multi-period alternating current optimal power flow (ACOPF) formulation that accurately models both nonlinear electric power flow physics and the primary and secondary frequency response of generator control mechanisms. We develop a distributed parallel Julia package for solving the large-scale nonlinear optimization problems that result from our NMPC method and thereby address realistic test instances on existing high-performance computing architectures. Our method demonstrates superior performance in terms of frequency recovery over existing industry practices, where generator levels are set based on the solution of single-period classical ACOPF models.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127699588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0