Proceedings of the 37th International Conference on Supercomputing最新文献_第3页

Multi-GPU Communication Schemes for Iterative Solvers: When CPUs are Not in Charge 迭代求解器的多gpu通信方案:当cpu不负责时

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593713

Ismayil Ismayilov, Javid Baydamirli, Doğan Sağbili, M. Wahib, D. Unat

This paper proposes a fully autonomous execution model for multi-GPU applications that completely excludes the involvement of the CPU beyond the initial kernel launch. In a typical multi-GPU application, the host serves as the orchestrator of execution by directly launching kernels, issuing communication calls, and acting as a synchronizer for devices. We argue that this orchestration, or control flow path, causes undue overhead and can be delegated entirely to devices to improve performance in applications that require communication among peers. For the proposed CPU-free execution model, we leverage existing techniques such as persistent kernels, thread block specialization, device-side barriers, and device-initiated communication routines to write fully autonomous multi-GPU code and achieve significantly reduced communication overheads. We demonstrate our proposed model on two broadly used iterative solvers, 2D/3D Jacobi stencil and Conjugate Gradient(CG). Compared to the CPU-controlled baselines, the CPU-free model can improve 3D stencil communication latency by 58.8% and provide a 1.63x speedup for CG on 8 NVIDIA A100 GPUs. The project code is available at https://github.com/ParCoreLab/CPU-Free-model.

本文提出了一种多gpu应用程序的完全自主执行模型，该模型完全排除了CPU在初始内核启动后的参与。在典型的多gpu应用程序中，主机通过直接启动内核、发出通信调用和充当设备的同步器来充当执行的协调器。我们认为，这种编排或控制流路径会导致不必要的开销，并且可以完全委托给设备，以提高需要在对等体之间进行通信的应用程序的性能。对于提议的无cpu执行模型，我们利用现有的技术，如持久内核、线程块专门化、设备端屏障和设备启动的通信例程来编写完全自主的多gpu代码，并显著降低通信开销。我们在两个广泛使用的迭代求解器，2D/3D雅可比模板和共轭梯度(CG)上展示了我们提出的模型。与cpu控制的基线相比，无cpu模型可以将3D模板通信延迟提高58.8%，并在8个NVIDIA A100 gpu上提供1.63倍的CG加速。项目代码可从https://github.com/ParCoreLab/CPU-Free-model获得。

{"title":"Multi-GPU Communication Schemes for Iterative Solvers: When CPUs are Not in Charge","authors":"Ismayil Ismayilov, Javid Baydamirli, Doğan Sağbili, M. Wahib, D. Unat","doi":"10.1145/3577193.3593713","DOIUrl":"https://doi.org/10.1145/3577193.3593713","url":null,"abstract":"This paper proposes a fully autonomous execution model for multi-GPU applications that completely excludes the involvement of the CPU beyond the initial kernel launch. In a typical multi-GPU application, the host serves as the orchestrator of execution by directly launching kernels, issuing communication calls, and acting as a synchronizer for devices. We argue that this orchestration, or control flow path, causes undue overhead and can be delegated entirely to devices to improve performance in applications that require communication among peers. For the proposed CPU-free execution model, we leverage existing techniques such as persistent kernels, thread block specialization, device-side barriers, and device-initiated communication routines to write fully autonomous multi-GPU code and achieve significantly reduced communication overheads. We demonstrate our proposed model on two broadly used iterative solvers, 2D/3D Jacobi stencil and Conjugate Gradient(CG). Compared to the CPU-controlled baselines, the CPU-free model can improve 3D stencil communication latency by 58.8% and provide a 1.63x speedup for CG on 8 NVIDIA A100 GPUs. The project code is available at https://github.com/ParCoreLab/CPU-Free-model.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122836302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Parallel Software for Million-scale Exact Kernel Regression 百万尺度精确核回归并行软件

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593737

Yu Chen, Lucca Skon, James R. McCombs, Zhenming Liu, A. Stathopoulos

We present the design and the implementation of a kernel principal component regression software that handles training datasets with a million or more observations. Kernel regressions are nonlinear and interpretable models that have wide downstream applications, and are shown to have a close connection to deep learning. Nevertheless, the exact regression of large-scale kernel models using currently available software has been notoriously difficult because it is both compute and memory intensive and it requires extensive tuning of hyperparameters. While in computational science distributed computing and iterative methods have been a mainstay of large scale software, they have not been widely adopted in kernel learning. Our software leverages existing high performance computing (HPC) techniques and develops new ones that address cross-cutting constraints between HPC and learning algorithms. It integrates three major components: (a) a state-of-the-art parallel eigenvalue iterative solver, (b) a block matrix-vector multiplication routine that employs both multi-threading and distributed memory parallelism and can be performed on-the-fly under limited memory, and (c) a software pipeline consisting of Python front-ends that control the HPC backbone and the hyperparameter optimization through a boosting optimizer. We perform feasibility studies by running the entire ImageNet dataset and a large asset pricing dataset.

我们提出了一个核主成分回归软件的设计和实现，该软件处理具有一百万或更多观测值的训练数据集。核回归是一种非线性和可解释的模型，具有广泛的下游应用，并且与深度学习有着密切的联系。然而，使用当前可用的软件对大规模内核模型进行精确的回归是出了名的困难，因为它需要大量的计算和内存，并且需要大量的超参数调优。虽然在计算科学中，分布式计算和迭代方法已经成为大规模软件的支柱，但它们尚未被广泛应用于核学习。我们的软件利用现有的高性能计算(HPC)技术，并开发新的解决HPC和学习算法之间的横切约束。它集成了三个主要组件:(a)一个最先进的并行特征值迭代求解器，(b)一个块矩阵向量乘法例程，它采用多线程和分布式内存并行性，可以在有限的内存下实时执行，以及(c)一个由Python前端组成的软件管道，它通过一个提升优化器控制HPC主干和超参数优化。我们通过运行整个ImageNet数据集和一个大型资产定价数据集来执行可行性研究。

{"title":"Parallel Software for Million-scale Exact Kernel Regression","authors":"Yu Chen, Lucca Skon, James R. McCombs, Zhenming Liu, A. Stathopoulos","doi":"10.1145/3577193.3593737","DOIUrl":"https://doi.org/10.1145/3577193.3593737","url":null,"abstract":"We present the design and the implementation of a kernel principal component regression software that handles training datasets with a million or more observations. Kernel regressions are nonlinear and interpretable models that have wide downstream applications, and are shown to have a close connection to deep learning. Nevertheless, the exact regression of large-scale kernel models using currently available software has been notoriously difficult because it is both compute and memory intensive and it requires extensive tuning of hyperparameters. While in computational science distributed computing and iterative methods have been a mainstay of large scale software, they have not been widely adopted in kernel learning. Our software leverages existing high performance computing (HPC) techniques and develops new ones that address cross-cutting constraints between HPC and learning algorithms. It integrates three major components: (a) a state-of-the-art parallel eigenvalue iterative solver, (b) a block matrix-vector multiplication routine that employs both multi-threading and distributed memory parallelism and can be performed on-the-fly under limited memory, and (c) a software pipeline consisting of Python front-ends that control the HPC backbone and the hyperparameter optimization through a boosting optimizer. We perform feasibility studies by running the entire ImageNet dataset and a large asset pricing dataset.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124346485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Performance Embeddings: A Similarity-Based Transfer Tuning Approach to Performance Optimization 性能嵌入:一种基于相似性的性能优化转移调优方法

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593714

Lukas Trümper, Tal Ben-Nun, Philipp Schaad, A. Calotoiu, T. Hoefler

Performance optimization is an increasingly challenging but often repetitive task. While each platform has its quirks, the underlying code transformations rely on data movement and computational characteristics that recur across applications. This paper proposes to leverage those similarities by constructing an embedding space for subprograms. The continuous space captures both static and dynamic properties of loop nests via symbolic code analysis and performance profiling, respectively. Performance embeddings enable direct knowledge transfer of performance tuning between applications, which can result from autotuning or tailored improvements. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils. Transfer tuning reduces the search complexity by up to four orders of magnitude and outperforms the MKL library in sparse-dense matrix multiplication. The results exhibit clear correspondences between program characteristics and optimizations, outperforming prior specialized state-of-the-art approaches and generalizing beyond their capabilities.

性能优化是一项越来越具有挑战性但又经常重复的任务。虽然每个平台都有自己的怪癖，但底层代码转换依赖于跨应用程序重复出现的数据移动和计算特征。本文提出通过构造子程序的嵌入空间来利用这些相似性。连续空间分别通过符号代码分析和性能分析捕获循环巢的静态和动态属性。性能嵌入可以在应用程序之间直接传递性能调优的知识，这可以通过自动调优或定制的改进来实现。我们在深度神经网络、密集和稀疏线性代数组成以及数值天气预报模板的案例研究中展示了这种转移调谐方法。传输调优将搜索复杂度降低了多达4个数量级，并且在稀疏密集矩阵乘法方面优于MKL库。结果显示了程序特征和优化之间的明确对应关系，优于先前的专业的最先进的方法，并且超越了它们的能力。

{"title":"Performance Embeddings: A Similarity-Based Transfer Tuning Approach to Performance Optimization","authors":"Lukas Trümper, Tal Ben-Nun, Philipp Schaad, A. Calotoiu, T. Hoefler","doi":"10.1145/3577193.3593714","DOIUrl":"https://doi.org/10.1145/3577193.3593714","url":null,"abstract":"Performance optimization is an increasingly challenging but often repetitive task. While each platform has its quirks, the underlying code transformations rely on data movement and computational characteristics that recur across applications. This paper proposes to leverage those similarities by constructing an embedding space for subprograms. The continuous space captures both static and dynamic properties of loop nests via symbolic code analysis and performance profiling, respectively. Performance embeddings enable direct knowledge transfer of performance tuning between applications, which can result from autotuning or tailored improvements. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils. Transfer tuning reduces the search complexity by up to four orders of magnitude and outperforms the MKL library in sparse-dense matrix multiplication. The results exhibit clear correspondences between program characteristics and optimizations, outperforming prior specialized state-of-the-art approaches and generalizing beyond their capabilities.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133933653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

GRAP: Group-level Resource Allocation Policy for Reconfigurable Dragonfly Network in HPC 高性能计算中可重构蜻蜓网络的组级资源分配策略

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593732

Guangnan Feng, Dezun Dong, Shizhen Zhao, Yutong Lu

Dragonfly is a highly scalable, low-diameter, and cost-efficient network topology, which has been adopted in new exascale High Performance Computing (HPC) systems. However, Dragonfly topology suffers from the limited direct links between groups. The reconfigurable network can solve this problem by reconfiguring topology to adjust the number of direct links between groups. While the performance improvement of a single job on reconfigurable HPC network has been evaluated in previous works, the performance of HPC workloads has not been studied because of the lack of an appropriate resource allocation policy. In this work, we propose Group-level Resource Allocation Policy (GRAP) to allocate both compute nodes and Reconfigurable Links for jobs in Reconfigurable Dragonfly Network (RDN). We start with formulating three design principles: reconfigurable network should be reconfiguration interference-free, guarantee connectivity and performance for each job, and satisfy varied resource requests. According to the principles, GRAP uses different strategies for small and large jobs, and contains three allocation modes for large jobs: Balance Mode, Custom Mode, and Adaptive Mode. Finally, we evaluate GRAP with the CODES network simulation framework and the Slurm Simulator using real workload traces. The results demonstrate that RDN coupled with GRAP achieves lower latency, higher bandwidth, and lower job wait time.

蜻蜓是一种高度可扩展、低直径和低成本的网络拓扑结构，已被用于新的百亿亿级高性能计算(HPC)系统。然而，蜻蜓拓扑结构的缺点是群体之间的直接连接有限。可重构网络可以通过重新配置拓扑来调整组间直连链路的数量，从而解决这一问题。虽然以前的研究已经对可重构高性能计算网络中单个作业的性能改进进行了评估，但由于缺乏适当的资源分配策略，尚未对高性能计算工作负载的性能进行研究。在这项工作中，我们提出了组级资源分配策略(GRAP)来为可重构蜻蜓网络(RDN)中的作业分配计算节点和可重构链路。我们首先制定了三个设计原则:可重构网络应该是无重构干扰的，保证每个作业的连接性和性能，并满足各种资源请求。根据原理，GRAP对小作业和大作业采用不同的分配策略，对大作业包含三种分配模式:Balance Mode、Custom Mode和Adaptive Mode。最后，我们使用CODES网络仿真框架和使用真实工作负载跟踪的Slurm模拟器来评估GRAP。结果表明，RDN与GRAP相结合可以实现更低的延迟、更高的带宽和更短的作业等待时间。

{"title":"GRAP: Group-level Resource Allocation Policy for Reconfigurable Dragonfly Network in HPC","authors":"Guangnan Feng, Dezun Dong, Shizhen Zhao, Yutong Lu","doi":"10.1145/3577193.3593732","DOIUrl":"https://doi.org/10.1145/3577193.3593732","url":null,"abstract":"Dragonfly is a highly scalable, low-diameter, and cost-efficient network topology, which has been adopted in new exascale High Performance Computing (HPC) systems. However, Dragonfly topology suffers from the limited direct links between groups. The reconfigurable network can solve this problem by reconfiguring topology to adjust the number of direct links between groups. While the performance improvement of a single job on reconfigurable HPC network has been evaluated in previous works, the performance of HPC workloads has not been studied because of the lack of an appropriate resource allocation policy. In this work, we propose Group-level Resource Allocation Policy (GRAP) to allocate both compute nodes and Reconfigurable Links for jobs in Reconfigurable Dragonfly Network (RDN). We start with formulating three design principles: reconfigurable network should be reconfiguration interference-free, guarantee connectivity and performance for each job, and satisfy varied resource requests. According to the principles, GRAP uses different strategies for small and large jobs, and contains three allocation modes for large jobs: Balance Mode, Custom Mode, and Adaptive Mode. Finally, we evaluate GRAP with the CODES network simulation framework and the Slurm Simulator using real workload traces. The results demonstrate that RDN coupled with GRAP achieves lower latency, higher bandwidth, and lower job wait time.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114838903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FLORIA: A Fast and Featherlight Approach for Predicting Cache Performance FLORIA:预测缓存性能的快速轻量级方法

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593740

Jun Xiao, Yaocheng Xiang, Xiaolin Wang, Yingwei Luo, A. Pimentel, Zhenlin Wang

The cache Miss Ratio Curve (MRC) serves a variety of purposes such as cache partitioning, application profiling and code tuning. In this work, we propose a new metric, called cache miss distribution, that describes cache miss behavior over cache sets, for predicting cache MRCs. Based on this metric, we present FLORIA, a software-based, online approach that approximates cache MRCs on commodity systems. By polluting a tunable number of cache lines in some selected cache sets using our designed microbenchmark, the cache miss distribution for the target workload is obtained via hardware performance counters with the support of precise event based sampling (PEBS). A model is developed to predict the MRC of the target workload based on its cache miss distribution. We evaluate FLORIA for systems consisting of a single application as well as a wide range of different workload mixes. Compared with the state-of-the-art approaches in predicting online MRCs, FLORIA achieves the highest average accuracy of 97.29% with negligible overhead. It also allows fast and accurate estimation of online MRC within 5ms, 20X faster than the state-of-the-art approaches. We also demonstrate that FLORIA can be applied to guiding cache partitioning for multiprogrammed workloads, helping to improve overall system performance.

cache Miss Ratio Curve (MRC)有多种用途，比如缓存分区、应用程序分析和代码调优。在这项工作中，我们提出了一个新的度量，称为缓存缺失分布，它描述了缓存集上的缓存缺失行为，用于预测缓存MRCs。基于这一指标，我们提出FLORIA，一种基于软件的在线方法，近似于商品系统上的缓存MRCs。通过使用我们设计的微基准测试在一些选定的缓存集中污染可调数量的缓存线，通过支持精确基于事件采样(PEBS)的硬件性能计数器获得目标工作负载的缓存缺失分布。建立了一个基于目标工作负载缓存缺失分布的MRC预测模型。我们对由单个应用程序组成的系统以及各种不同的工作负载组合进行FLORIA评估。与最先进的在线MRCs预测方法相比，FLORIA达到了97.29%的最高平均准确率，开销可以忽略不计。它还可以在5毫秒内快速准确地估计在线MRC，比最先进的方法快20倍。我们还演示了FLORIA可以应用于指导多程序工作负载的缓存分区，从而帮助提高整体系统性能。

{"title":"FLORIA: A Fast and Featherlight Approach for Predicting Cache Performance","authors":"Jun Xiao, Yaocheng Xiang, Xiaolin Wang, Yingwei Luo, A. Pimentel, Zhenlin Wang","doi":"10.1145/3577193.3593740","DOIUrl":"https://doi.org/10.1145/3577193.3593740","url":null,"abstract":"The cache Miss Ratio Curve (MRC) serves a variety of purposes such as cache partitioning, application profiling and code tuning. In this work, we propose a new metric, called cache miss distribution, that describes cache miss behavior over cache sets, for predicting cache MRCs. Based on this metric, we present FLORIA, a software-based, online approach that approximates cache MRCs on commodity systems. By polluting a tunable number of cache lines in some selected cache sets using our designed microbenchmark, the cache miss distribution for the target workload is obtained via hardware performance counters with the support of precise event based sampling (PEBS). A model is developed to predict the MRC of the target workload based on its cache miss distribution. We evaluate FLORIA for systems consisting of a single application as well as a wide range of different workload mixes. Compared with the state-of-the-art approaches in predicting online MRCs, FLORIA achieves the highest average accuracy of 97.29% with negligible overhead. It also allows fast and accurate estimation of online MRC within 5ms, 20X faster than the state-of-the-art approaches. We also demonstrate that FLORIA can be applied to guiding cache partitioning for multiprogrammed workloads, helping to improve overall system performance.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"263 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114233757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Roar: A Router Microarchitecture for In-network Allreduce 面向网络内Allreduce的路由器微架构

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593711

Ruiqi Wang, Dezun Dong, Fei Lei, Junchao Ma, Ketong Wu, KaiCheng Lu

The allreduce operation is the most commonly used collective operation in distributed or parallel applications. It aggregates data collected from distributed hosts and broadcasts the aggregated result back to them. In-network computing can accelerate allreduce by offloading this operation into network devices. However, existing in-network solutions face the challenge of high throughput, performance of aggregating large message and producing repeatable results. In this work, we propose a simple and effective router microarchitecture for in-network allreduce, which uses an RDMA protocol to improve its throughput. We further discuss strategies to tackle the aforementioned challenges. Our approach not only shows advantages in comparison with the state-of-the-art in-network solutions, but also accelerates allreduce at a near-optimal level compared to host-based algorithms, as demonstrated through experiments.

allreduce操作是分布式或并行应用程序中最常用的集合操作。它聚合从分布式主机收集的数据，并将聚合结果广播给它们。网络内计算可以通过将此操作卸载到网络设备上来加速所有的reduce。然而，现有的网络内解决方案面临着高吞吐量、聚合大消息的性能和产生可重复结果的挑战。在这项工作中，我们提出了一种简单有效的网络内allreduce路由器微架构，它使用RDMA协议来提高其吞吐量。我们将进一步讨论应对上述挑战的战略。我们的方法不仅显示了与最先进的网络内解决方案相比的优势，而且与基于主机的算法相比，我们的方法在接近最佳的水平上加速了allreduce，正如通过实验证明的那样。

引用次数: 0

BiRFIA: Selective Binary Rewriting for Function Interception on ARM 基于ARM的函数拦截的选择性二进制重写

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593701

Kelun Lei, Xin You, Hailong Yang, Zhongzhi Luan, D. Qian

Function interception of fully-optimized binaries is widely used for optimization with its ability to accurately collect runtime information and detect inefficiencies at the function level. However, the implementation of function interception with existing binary rewriting techniques still suffers from limited reliability and performance on ARM platform. In this paper, we propose BiRFIA, an efficient selective binary rewriting framework for function interception targeting highly optimized binaries on ARM platforms. BiRFIA performs static binary rewriting of specific functions and intercepts them through well-formed trampoline sections and external instrumentation libraries. Besides, BiRFIA places complex instrumentation code in the trampoline section and jumps to the trampoline section via an adaptive instruction eviction strategy, which significantly reduces the probability of unexpected errors. For evaluation, we develop two function interception tools based on BiRFIA, including a function performance event counter collector and a function parameter tracer. Guided by these tools, we optimize several benchmarks and real-world programs, yielding up to 8% performance speedup. Our evaluation result demonstrates that BiRFIA incurs negligible runtime overhead of 1.006× on average.

完全优化的二进制文件的函数拦截被广泛用于优化，因为它能够准确地收集运行时信息并检测函数级别的低效率。然而，现有的二进制重写技术在ARM平台上实现函数拦截的可靠性和性能仍然有限。在本文中，我们提出了BiRFIA，一个高效的选择性二进制重写框架，用于针对ARM平台上高度优化的二进制文件的函数拦截。BiRFIA执行特定函数的静态二进制重写，并通过格式良好的蹦床部分和外部仪器库拦截它们。此外，BiRFIA将复杂的仪表代码放在蹦床段，并通过自适应指令退出策略跳转到蹦床段，这大大降低了意外错误的概率。为了进行评估，我们开发了两个基于BiRFIA的函数拦截工具，包括函数性能事件计数器收集器和函数参数跟踪器。在这些工具的指导下，我们优化了几个基准测试和实际程序，产生了高达8%的性能加速。我们的评估结果表明，BiRFIA产生的运行时开销平均为1.006倍，可以忽略不计。

{"title":"BiRFIA: Selective Binary Rewriting for Function Interception on ARM","authors":"Kelun Lei, Xin You, Hailong Yang, Zhongzhi Luan, D. Qian","doi":"10.1145/3577193.3593701","DOIUrl":"https://doi.org/10.1145/3577193.3593701","url":null,"abstract":"Function interception of fully-optimized binaries is widely used for optimization with its ability to accurately collect runtime information and detect inefficiencies at the function level. However, the implementation of function interception with existing binary rewriting techniques still suffers from limited reliability and performance on ARM platform. In this paper, we propose BiRFIA, an efficient selective binary rewriting framework for function interception targeting highly optimized binaries on ARM platforms. BiRFIA performs static binary rewriting of specific functions and intercepts them through well-formed trampoline sections and external instrumentation libraries. Besides, BiRFIA places complex instrumentation code in the trampoline section and jumps to the trampoline section via an adaptive instruction eviction strategy, which significantly reduces the probability of unexpected errors. For evaluation, we develop two function interception tools based on BiRFIA, including a function performance event counter collector and a function parameter tracer. Guided by these tools, we optimize several benchmarks and real-world programs, yielding up to 8% performance speedup. Our evaluation result demonstrates that BiRFIA incurs negligible runtime overhead of 1.006× on average.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130115530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RT-kNNS Unbound: Using RT Cores to Accelerate Unrestricted Neighbor Search RT- knns Unbound:使用RT内核加速无限制邻居搜索

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-05-26 DOI: 10.1145/3577193.3593738

Vani Nagarajan, D. Mandarapu, Milind Kulkarni

The problem of identifying the k-Nearest Neighbors (kNNS) of a point has proven to be very useful both as a standalone application and as a subroutine in larger applications. Given its far-reaching applicability in areas such as machine learning and point clouds, extensive research has gone into leveraging GPU acceleration to solve this problem. Recent work has shown that using Ray Tracing cores in recent GPUs to accelerate kNNS is much more efficient compared to traditional acceleration using shader cores. However, the existing translation of kNNS to a ray tracing problem imposes a constraint on the search space for neighbors. Due to this, we can only use RT cores to accelerate fixed-radius kNNS, which requires the user to set a search radius a priori and hence can miss neighbors. In this work, we propose TrueKNN, the first unbounded RT-accelerated neighbor search. TrueKNN adopts an iterative approach where we incrementally grow the search space until all points have found their k neighbors. We show that our approach is orders of magnitude faster than existing approaches and can even be used to accelerate fixed-radius neighbor searches.

识别点的k近邻(kNNS)的问题已被证明是非常有用的，无论是作为独立的应用程序还是作为大型应用程序中的子例程。鉴于其在机器学习和点云等领域的广泛适用性，广泛的研究已经开始利用GPU加速来解决这个问题。最近的研究表明，在最近的gpu中使用光线追踪内核来加速kNNS比使用着色器内核的传统加速要有效得多。然而，现有的kNNS转换为光线追踪问题对邻居的搜索空间施加了限制。因此，我们只能使用RT核来加速固定半径的kNNS，这需要用户先验地设置搜索半径，因此可能会错过邻居。在这项工作中，我们提出了TrueKNN，这是第一个无界rt加速邻居搜索。TrueKNN采用了一种迭代的方法，我们逐渐增加搜索空间，直到所有的点都找到了它们的k个邻居。我们表明，我们的方法比现有的方法快几个数量级，甚至可以用来加速固定半径的邻居搜索。

{"title":"RT-kNNS Unbound: Using RT Cores to Accelerate Unrestricted Neighbor Search","authors":"Vani Nagarajan, D. Mandarapu, Milind Kulkarni","doi":"10.1145/3577193.3593738","DOIUrl":"https://doi.org/10.1145/3577193.3593738","url":null,"abstract":"The problem of identifying the k-Nearest Neighbors (kNNS) of a point has proven to be very useful both as a standalone application and as a subroutine in larger applications. Given its far-reaching applicability in areas such as machine learning and point clouds, extensive research has gone into leveraging GPU acceleration to solve this problem. Recent work has shown that using Ray Tracing cores in recent GPUs to accelerate kNNS is much more efficient compared to traditional acceleration using shader cores. However, the existing translation of kNNS to a ray tracing problem imposes a constraint on the search space for neighbors. Due to this, we can only use RT cores to accelerate fixed-radius kNNS, which requires the user to set a search radius a priori and hence can miss neighbors. In this work, we propose TrueKNN, the first unbounded RT-accelerated neighbor search. TrueKNN adopts an iterative approach where we incrementally grow the search space until all points have found their k neighbors. We show that our approach is orders of magnitude faster than existing approaches and can even be used to accelerate fixed-radius neighbor searches.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114675088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FMI: Fast and Cheap Message Passing for Serverless Functions FMI:无服务器功能的快速和廉价的消息传递

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-05-15 DOI: 10.1145/3577193.3593718

Marcin Copik, Roman Böhringer, A. Calotoiu, T. Hoefler

Serverless functions provide elastic scaling and a fine-grained billing model, making Function-as-a-Service (FaaS) an attractive programming model. However, for distributed jobs that benefit from large-scale and dynamic parallelism, the lack of fast and cheap communication is a major limitation. Individual functions cannot communicate directly, group operations do not exist, and users resort to manual implementations of storage-based communication. This results in communication times multiple orders of magnitude slower than those found in HPC systems. We overcome this limitation and present the FaaS Message Interface (FMI). FMI is an easy-to-use, high-performance framework for general-purpose point-to-point and collective communication in FaaS applications. We support different communication channels and offer a model-driven channel selection according to performance and cost expectations. We model the interface after MPI and show that message passing can be integrated into serverless applications with minor changes, providing portable communication closer to that offered by high-performance systems. In our experiments, FMI can speed up communication for a distributed machine learning FaaS application by up to 162x, while simultaneously reducing cost by up to 397 times.

无服务器功能提供弹性伸缩和细粒度计费模型，使功能即服务(FaaS)成为一种有吸引力的编程模型。然而，对于受益于大规模和动态并行的分布式作业，缺乏快速和廉价的通信是一个主要限制。单个功能不能直接通信，组操作不存在，用户依靠手动实现基于存储的通信。这导致通信时间比在HPC系统中发现的要慢多个数量级。我们克服了这一限制，提出了FaaS消息接口(FMI)。FMI是一个易于使用的高性能框架，用于FaaS应用程序中的通用点对点和集体通信。我们支持不同的通信渠道，并根据性能和成本预期提供模型驱动的渠道选择。我们在MPI之后对接口进行了建模，并展示了消息传递可以通过微小的更改集成到无服务器应用程序中，从而提供更接近高性能系统所提供的可移植通信。在我们的实验中，FMI可以将分布式机器学习FaaS应用程序的通信速度提高高达162倍，同时将成本降低高达397倍。

{"title":"FMI: Fast and Cheap Message Passing for Serverless Functions","authors":"Marcin Copik, Roman Böhringer, A. Calotoiu, T. Hoefler","doi":"10.1145/3577193.3593718","DOIUrl":"https://doi.org/10.1145/3577193.3593718","url":null,"abstract":"Serverless functions provide elastic scaling and a fine-grained billing model, making Function-as-a-Service (FaaS) an attractive programming model. However, for distributed jobs that benefit from large-scale and dynamic parallelism, the lack of fast and cheap communication is a major limitation. Individual functions cannot communicate directly, group operations do not exist, and users resort to manual implementations of storage-based communication. This results in communication times multiple orders of magnitude slower than those found in HPC systems. We overcome this limitation and present the FaaS Message Interface (FMI). FMI is an easy-to-use, high-performance framework for general-purpose point-to-point and collective communication in FaaS applications. We support different communication channels and offer a model-driven channel selection according to performance and cost expectations. We model the interface after MPI and show that message passing can be integrated into serverless applications with minor changes, providing portable communication closer to that offered by high-performance systems. In our experiments, FMI can speed up communication for a distributed machine learning FaaS application by up to 162x, while simultaneously reducing cost by up to 397 times.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127449566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Revisiting Temporal Blocking Stencil Optimizations 重新审视时态阻塞模板优化

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-05-12 DOI: 10.1145/3577193.3593716

Lingqi Zhang, M. Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, S. Matsuoka

Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, temporal blocking is an optimization that combines a batch of time steps to process them together. Under the observation that GPUs are evolving to resemble CPUs in some aspects, we revisit temporal blocking optimizations for GPUs. We explore how temporal blocking schemes can be adapted to the new features in the recent Nvidia GPUs, including large scratchpad memory, hardware prefetching, and device-wide synchronization. We propose a novel temporal blocking method, EBISU, which champions low device occupancy to drive aggressive deep temporal blocking on large tiles that are executed tile-by-tile. We compare EBISU with state-of-the-art temporal blocking libraries: STENCILGEN and AN5D. We also compare with state-of-the-art stencil auto-tuning tools that are equipped with temporal blocking optimizations: ARTEMIS and DRSTENCIL. Over a wide range of stencil benchmarks, EBISU achieves speedups up to 2.53x and a geometric mean speedup of 1.49x over the best state-of-the-art performance in each stencil benchmark.

迭代模板在高性能计算(HPC)应用中广泛使用。鉴于GPU加速的超级计算机的普及，许多努力已经投入到优化模板GPU内核。为了提高数据的局部性，时间阻塞是一种将一批时间步合并在一起处理的优化方法。在观察到gpu在某些方面正在演变成类似cpu的情况下，我们重新审视gpu的时间阻塞优化。我们探讨了时间阻塞方案如何适应最新Nvidia gpu的新功能，包括大刮擦板内存，硬件预取和设备范围同步。我们提出了一种新的时间阻塞方法，EBISU，它支持低设备占用来驱动大块上逐块执行的侵略性深度时间阻塞。我们将EBISU与最先进的时间块库进行比较:STENCILGEN和AN5D。我们还比较了配备了时间阻塞优化的最先进的模板自动调优工具:ARTEMIS和DRSTENCIL。在广泛的模具基准测试中，EBISU在每个模具基准测试中实现了高达2.53倍的加速和1.49倍的几何平均加速，达到了最先进的性能。

{"title":"Revisiting Temporal Blocking Stencil Optimizations","authors":"Lingqi Zhang, M. Wahib, Peng Chen, Jintao Meng, Xiao Wang, Toshio Endo, S. Matsuoka","doi":"10.1145/3577193.3593716","DOIUrl":"https://doi.org/10.1145/3577193.3593716","url":null,"abstract":"Iterative stencils are used widely across the spectrum of High Performance Computing (HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given the prevalence of GPU-accelerated supercomputers. To improve the data locality, temporal blocking is an optimization that combines a batch of time steps to process them together. Under the observation that GPUs are evolving to resemble CPUs in some aspects, we revisit temporal blocking optimizations for GPUs. We explore how temporal blocking schemes can be adapted to the new features in the recent Nvidia GPUs, including large scratchpad memory, hardware prefetching, and device-wide synchronization. We propose a novel temporal blocking method, EBISU, which champions low device occupancy to drive aggressive deep temporal blocking on large tiles that are executed tile-by-tile. We compare EBISU with state-of-the-art temporal blocking libraries: STENCILGEN and AN5D. We also compare with state-of-the-art stencil auto-tuning tools that are equipped with temporal blocking optimizations: ARTEMIS and DRSTENCIL. Over a wide range of stencil benchmarks, EBISU achieves speedups up to 2.53x and a geometric mean speedup of 1.49x over the best state-of-the-art performance in each stencil benchmark.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123771875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0