2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第6页

Harnessing the Crowd for Autotuning High-Performance Computing Applications 利用人群自动调优高性能计算应用程序

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00069

Younghyun Cho, J. Demmel, Jacob King, X. Li, Yang Liu, Hengrui Luo

This paper presents GPTuneCrowd, a crowd-based autotuning framework for tuning high-performance computing applications. GPTuneCrowd collects performance data from various users using a user-friendly tuner interface. GPTuneCrowd then presents novel autotuning techniques, based on transfer learning and parameter sensitivity analysis, to maximize tuning quality using collected data from the crowd. This paper shows several real-world case studies of GPTuneCrowd. Our evaluation shows that GPTuneCrowd’s transfer learning improves the tuned performance of ScaLAPACK’s PDGEQRF by 1.57x and a plasma fusion code NIMROD by 2.97x, over a non-transfer learning autotuner. We use GPTuneCrowd’s sensitivity analysis to reduce the search space of SuperLU_DIST and Hypre. Tuning on the reduced search space achieves 1.17x and 1.35x better tuned performance of SuperLU_DIST and Hypre, respectively, compared to the original search space.

本文提出了GPTuneCrowd，一个基于群体的自动调优框架，用于调优高性能计算应用程序。GPTuneCrowd使用用户友好的调谐器界面从各种用户收集性能数据。GPTuneCrowd随后提出了基于迁移学习和参数敏感性分析的新颖自动调谐技术，利用从人群中收集的数据最大限度地提高调谐质量。本文展示了GPTuneCrowd的几个实际案例研究。我们的评估表明，与非迁移学习自动调谐器相比，GPTuneCrowd的迁移学习将ScaLAPACK的PDGEQRF的调谐性能提高了1.57倍，将等离子体融合代码NIMROD的调谐性能提高了2.97倍。我们使用GPTuneCrowd的灵敏度分析来减少SuperLU_DIST和hyperpre的搜索空间。与原始搜索空间相比，对缩减后的搜索空间进行调优，SuperLU_DIST和hyper的调优性能分别提高了1.17倍和1.35倍。

引用次数: 3

Satellite Collision Detection using Spatial Data Structures 基于空间数据结构的卫星碰撞检测

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00078

C. Hellwig, Fabian Czappa, Martin Michel, R. Bertrand, F. Wolf

In recent years, the number of artificial objects in Earth orbit has increased rapidly due to lower launch costs and new applications for satellites. More and more governments and private companies are discovering space for their own purposes. Private companies are using space as a new business field, launching thousands of satellites into orbit to offer services like worldwide Internet access. Consequently, the probability of collisions and, thus, the degradation of the orbital environment is rapidly increasing. To avoid devastating collisions at an early stage, efficient algorithms are required to identify satellites approaching each other. Traditional deterministic filter-based conjunction detection algorithms compare each satellite to every other satellite and pass them through a chain of orbital filters. Unfortunately, this leads to a runtime complexity of O(n2). In this paper, we propose two alternative approaches that rely on spatial data structures and thus allow us to exploit modern hardware’s parallelism efficiently. Firstly, we introduce a purely grid-based variant that relies on non-blocking atomic hash maps to identify conjunctions. Secondly, we present a hybrid method that combines this approach with traditional filter chains. Both implementations make it possible to identify conjunctions in a large population with millions of satellites with high precision in a comparatively short time. While the grid-based variant is characterized by lower memory consumption, the hybrid variant is faster if enough memory is available.

近年来，由于发射成本的降低和卫星的新应用，地球轨道上的人造物体数量迅速增加。越来越多的政府和私人公司正在为自己的目的探索太空。私营公司正把太空作为一个新的商业领域，向轨道发射数千颗卫星，提供全球互联网接入等服务。因此，碰撞的可能性以及轨道环境的退化正在迅速增加。为了在早期阶段避免毁灭性的碰撞，需要有效的算法来识别彼此接近的卫星。传统的基于确定性滤波器的连接检测算法将每颗卫星与其他卫星进行比较，并将它们通过一系列轨道滤波器。不幸的是，这会导致运行时复杂度为0 (n2)。在本文中，我们提出了两种依赖于空间数据结构的替代方法，从而允许我们有效地利用现代硬件的并行性。首先，我们引入了一个纯粹基于网格的变体，它依赖于非阻塞原子哈希映射来识别连词。其次，我们提出了一种将该方法与传统滤波器链相结合的混合方法。这两种实现都可以在相对较短的时间内以高精度识别数百万颗卫星的大量人口中的连击。虽然基于网格的变体的特点是内存消耗较低，但如果有足够的内存可用，混合变体的速度更快。

{"title":"Satellite Collision Detection using Spatial Data Structures","authors":"C. Hellwig, Fabian Czappa, Martin Michel, R. Bertrand, F. Wolf","doi":"10.1109/IPDPS54959.2023.00078","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00078","url":null,"abstract":"In recent years, the number of artificial objects in Earth orbit has increased rapidly due to lower launch costs and new applications for satellites. More and more governments and private companies are discovering space for their own purposes. Private companies are using space as a new business field, launching thousands of satellites into orbit to offer services like worldwide Internet access. Consequently, the probability of collisions and, thus, the degradation of the orbital environment is rapidly increasing. To avoid devastating collisions at an early stage, efficient algorithms are required to identify satellites approaching each other. Traditional deterministic filter-based conjunction detection algorithms compare each satellite to every other satellite and pass them through a chain of orbital filters. Unfortunately, this leads to a runtime complexity of O(n2). In this paper, we propose two alternative approaches that rely on spatial data structures and thus allow us to exploit modern hardware’s parallelism efficiently. Firstly, we introduce a purely grid-based variant that relies on non-blocking atomic hash maps to identify conjunctions. Secondly, we present a hybrid method that combines this approach with traditional filter chains. Both implementations make it possible to identify conjunctions in a large population with millions of satellites with high precision in a comparatively short time. While the grid-based variant is characterized by lower memory consumption, the hybrid variant is faster if enough memory is available.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129819452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing Cloud Computing Resource Usage for Hemodynamic Simulation 优化血流动力学模拟的云计算资源使用

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00063

William Ladd, Christopher W Jensen, M. Vardhan, Jeff Ames, J. Hammond, E. Draeger, A. Randles

Cloud computing resources are becoming an increasingly attractive option for simulation workflows but require users to assess a wider variety of hardware options and associated costs than required by traditional in-house hardware or fixed allocations at leadership computing facilities. The pay-as-you-go model used by cloud providers gives users the opportunity to make more nuanced cost-benefit decisions at runtime by choosing hardware that best matches a given workload, but creates the risk of suboptimal allocation strategies or inadvertent cost overruns. In this work, we propose the use of an iteratively-refined performance model to optimize cloud simulation campaigns against overall cost, throughput, or maximum time to solution. Hemodynamic simulations represent an excellent use case for these assessments, as the relative costs and dominant terms in the performance model can vary widely with hardware, numerical parameters and physics models. Performance and scaling behavior of hemodynamic simulations on multiple cloud services as well as a traditional compute cluster are collected and evaluated, and an initial performance model is proposed along with a strategy for dynamically refining it with additional experimental data.

云计算资源正成为模拟工作流程的一个越来越有吸引力的选择，但与传统的内部硬件或领导计算设施的固定分配相比，用户需要评估更多种类的硬件选项和相关成本。云提供商使用的即用即付模型让用户有机会在运行时通过选择最适合给定工作负载的硬件来做出更细微的成本效益决策，但会产生次优分配策略或无意中成本超支的风险。在这项工作中，我们建议使用迭代改进的性能模型来针对总体成本、吞吐量或解决方案的最大时间优化云模拟活动。血流动力学模拟是这些评估的一个极好的用例，因为性能模型中的相对成本和主要术语可能因硬件、数值参数和物理模型而有很大差异。收集和评估了在多个云服务以及传统计算集群上的血流动力学模拟的性能和缩放行为，并提出了初始性能模型以及使用额外实验数据动态改进模型的策略。

{"title":"Optimizing Cloud Computing Resource Usage for Hemodynamic Simulation","authors":"William Ladd, Christopher W Jensen, M. Vardhan, Jeff Ames, J. Hammond, E. Draeger, A. Randles","doi":"10.1109/IPDPS54959.2023.00063","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00063","url":null,"abstract":"Cloud computing resources are becoming an increasingly attractive option for simulation workflows but require users to assess a wider variety of hardware options and associated costs than required by traditional in-house hardware or fixed allocations at leadership computing facilities. The pay-as-you-go model used by cloud providers gives users the opportunity to make more nuanced cost-benefit decisions at runtime by choosing hardware that best matches a given workload, but creates the risk of suboptimal allocation strategies or inadvertent cost overruns. In this work, we propose the use of an iteratively-refined performance model to optimize cloud simulation campaigns against overall cost, throughput, or maximum time to solution. Hemodynamic simulations represent an excellent use case for these assessments, as the relative costs and dominant terms in the performance model can vary widely with hardware, numerical parameters and physics models. Performance and scaling behavior of hemodynamic simulations on multiple cloud services as well as a traditional compute cluster are collected and evaluated, and an initial performance model is proposed along with a strategy for dynamically refining it with additional experimental data.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130813785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Porting a Computational Fluid Dynamics Code with AMR to Large-scale GPU Platforms 基于AMR的计算流体动力学代码移植到大规模GPU平台

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00066

J. H. Davis, Justin Shafner, Daniel Nichols, N. Grube, P. Martin, A. Bhatele

Accurate modeling of turbulent hypersonic flows has tremendous scientific and commercial value, and applies to atmospheric flight, supersonic combustion, materials discovery and climate prediction. In this paper, we describe our experiences in extending the capabilities of and modernizing CRoCCo, an MPI-based, CPU-only compressible computational fluid dynamics code. We extend CRoCCo to support block-structured adaptive mesh refinement using a highly-scalable AMR library, AMReX, and add support for a fully curvilinear solver. We also port the computational kernels in CRoCCo to GPUs to enable scaling on modern exascale systems. We present our techniques for overcoming performance challenges and evaluate the updated code, CRoCCo v2.0, on the Summit system, demonstrating a 6× to 44× speedup over the CPU-only version.

高超声速湍流流动的精确建模具有巨大的科学和商业价值，适用于大气飞行、超声速燃烧、材料发现和气候预测。在本文中，我们描述了我们在扩展CRoCCo的功能和现代化方面的经验，CRoCCo是一个基于mpi的，仅cpu可压缩的计算流体动力学代码。我们扩展了CRoCCo，使用高度可扩展的AMR库AMReX来支持块结构的自适应网格细化，并添加了对全曲线求解器的支持。我们还将CRoCCo中的计算内核移植到gpu上，以便在现代百亿亿级系统上进行扩展。我们介绍了克服性能挑战的技术，并在Summit系统上评估了更新后的代码CRoCCo v2.0，演示了比仅使用cpu的版本提高6倍到44倍的速度。

引用次数: 0

FedBIAD: Communication-Efficient and Accuracy-Guaranteed Federated Learning with Bayesian Inference-Based Adaptive Dropout 基于贝叶斯推理的自适应退出的高效通信和保证准确性的联邦学习

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00056

Jingjing Xue, Min Liu, Sheng Sun, Yuwei Wang, Hui Jiang, Xue Jiang

Federated Learning (FL) emerges as a distributed machine learning paradigm without end-user data transmission, effectively avoiding privacy leakage. Participating devices in FL are usually bandwidth-constrained, and the uplink is much slower than the downlink in wireless networks, which causes a severe uplink communication bottleneck. A prominent direction to alleviate this problem is federated dropout, which drops fractional weights of local models. However, existing federated dropout studies focus on random or ordered dropout and lack theoretical support, resulting in unguaranteed performance. In this paper, we propose Federated learning with Bayesian Inference-based Adaptive Dropout (FedBIAD), which regards weight rows of local models as probability distributions and adaptively drops partial weight rows based on importance indicators correlated with the trend of local training loss. By applying FedBIAD, each client adaptively selects a high-quality dropping pattern with accurate approximations and only transmits parameters of non-dropped weight rows to mitigate uplink costs while improving accuracy. Theoretical analysis demonstrates that the convergence rate of the average generalization error of FedBIAD is minimax optimal up to a squared logarithmic factor. Extensive experiments on image classification and next-word prediction show that compared with status quo approaches, FedBIAD provides 2× uplink reduction with an accuracy increase of up to 2.41% even on non-Independent and Identically Distributed (non-IID) data, which brings up to 72% decrease in training time.

联邦学习(FL)作为一种不需要终端用户数据传输的分布式机器学习范式而出现，有效地避免了隐私泄露。FL的参与设备通常受带宽限制，且上行速度比无线网络中的下行速度慢得多，造成了严重的上行通信瓶颈。缓解这一问题的一个重要方向是联邦dropout，它降低了局部模型的分数权重。然而，现有的联邦退学研究多集中于随机退学或有序退学，缺乏理论支持，导致性能得不到保证。本文提出了基于贝叶斯推理的自适应Dropout (FedBIAD)联邦学习方法，该方法将局部模型的权重行视为概率分布，并根据与局部训练损失趋势相关的重要指标自适应地丢弃部分权重行。通过应用FedBIAD，每个客户端自适应地选择具有精确近似的高质量丢弃模式，并且只传输未丢弃的权值行参数，从而在降低上行成本的同时提高准确性。理论分析表明，FedBIAD的平均泛化误差收敛速度在一个对数因子的平方范围内是极小极大最优的。大量的图像分类和下一词预测实验表明，与现有方法相比，FedBIAD在非独立同分布(non-Independent and Identically Distributed, non-IID)数据上的上行链路减少了2倍，准确率提高了2.41%，训练时间减少了72%。

{"title":"FedBIAD: Communication-Efficient and Accuracy-Guaranteed Federated Learning with Bayesian Inference-Based Adaptive Dropout","authors":"Jingjing Xue, Min Liu, Sheng Sun, Yuwei Wang, Hui Jiang, Xue Jiang","doi":"10.1109/IPDPS54959.2023.00056","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00056","url":null,"abstract":"Federated Learning (FL) emerges as a distributed machine learning paradigm without end-user data transmission, effectively avoiding privacy leakage. Participating devices in FL are usually bandwidth-constrained, and the uplink is much slower than the downlink in wireless networks, which causes a severe uplink communication bottleneck. A prominent direction to alleviate this problem is federated dropout, which drops fractional weights of local models. However, existing federated dropout studies focus on random or ordered dropout and lack theoretical support, resulting in unguaranteed performance. In this paper, we propose Federated learning with Bayesian Inference-based Adaptive Dropout (FedBIAD), which regards weight rows of local models as probability distributions and adaptively drops partial weight rows based on importance indicators correlated with the trend of local training loss. By applying FedBIAD, each client adaptively selects a high-quality dropping pattern with accurate approximations and only transmits parameters of non-dropped weight rows to mitigate uplink costs while improving accuracy. Theoretical analysis demonstrates that the convergence rate of the average generalization error of FedBIAD is minimax optimal up to a squared logarithmic factor. Extensive experiments on image classification and next-word prediction show that compared with status quo approaches, FedBIAD provides 2× uplink reduction with an accuracy increase of up to 2.41% even on non-Independent and Identically Distributed (non-IID) data, which brings up to 72% decrease in training time.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129357731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FIRST: Exploiting the Multi-Dimensional Attributes of Functions for Power-Aware Serverless Computing 第一:利用功能的多维属性进行无服务器计算

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00091

Lu Zhang, C. Li, Xinkai Wang, Weiqi Feng, Zheng Yu, Quan Chen, Jingwen Leng, Minyi Guo, Pu Yang, Shang Yue

Emerging cloud-native development models raise new challenges for managing server performance and power at microsecond scale. Compared with traditional cloud workloads, serverless functions exhibit unprecedented heterogeneity, variability, and dynamicity. Designing cloud-native power management schemes for serverless functions requires significant engineering effort. Current solutions remain sub-optimal since their orchestration process is often one-sided, lacking a systematic view. A key obstacle to truly efficient function deployment is the fundamental wide abstraction gap between the upper-layer request scheduling and the low-level hardware execution.In this work, we show that the optimal operating point (OOP) for energy efficiency cannot be attained without synthesizing the multi-dimensional attributes of functions. We present FIRST, a novel mechanism that enables servers to better orchestrate serverless functions. The key feature of FIRST is that it leverages a lightweight Internal Representation and meta-Scheduling (IRS) layer for collecting the maximum potential revenue from the servers. Specifically, FIRST follows a pipeline-style workflow. Its frontend components aim to analyze functions from different angles and expose their key features to the system. Meanwhile, its backend components are able to make informed function assignment decisions to avoid OOP divergence. We further demonstrate the way to create extensions based on FIRST to enable versatile cloud-native power management. In total, our design constitutes a flexible management layer that supports power-aware function deployment. We show that FIRST could allow 94% functions to be processed under the OOP, which brings up to 24% energy efficiency improvements.

新兴的云原生开发模型为管理微秒级的服务器性能和功耗提出了新的挑战。与传统的云工作负载相比，无服务器功能具有前所未有的异构性、可变性和动态性。为无服务器功能设计云原生电源管理方案需要大量的工程工作。当前的解决方案仍然不是最优的，因为它们的编排过程通常是片面的，缺乏系统的视图。真正有效的功能部署的一个关键障碍是在上层请求调度和底层硬件执行之间存在很大的抽象差距。在这项工作中，我们表明，如果不综合功能的多维属性，就无法获得能源效率的最佳工作点(OOP)。我们提出了FIRST，这是一种使服务器能够更好地编排无服务器功能的新机制。FIRST的关键特性是它利用轻量级的内部表示和元调度(IRS)层从服务器收集最大的潜在收益。具体来说，FIRST遵循流水线式工作流。其前端组件旨在从不同角度分析功能，并将其关键特性暴露给系统。同时，它的后端组件能够做出明智的功能分配决策，以避免OOP分歧。我们将进一步演示如何创建基于FIRST的扩展，以实现多用途的云原生电源管理。总的来说，我们的设计构成了一个灵活的管理层，支持功耗感知功能的部署。我们发现FIRST可以允许94%的功能在OOP下处理，这带来了24%的能源效率提高。

{"title":"FIRST: Exploiting the Multi-Dimensional Attributes of Functions for Power-Aware Serverless Computing","authors":"Lu Zhang, C. Li, Xinkai Wang, Weiqi Feng, Zheng Yu, Quan Chen, Jingwen Leng, Minyi Guo, Pu Yang, Shang Yue","doi":"10.1109/IPDPS54959.2023.00091","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00091","url":null,"abstract":"Emerging cloud-native development models raise new challenges for managing server performance and power at microsecond scale. Compared with traditional cloud workloads, serverless functions exhibit unprecedented heterogeneity, variability, and dynamicity. Designing cloud-native power management schemes for serverless functions requires significant engineering effort. Current solutions remain sub-optimal since their orchestration process is often one-sided, lacking a systematic view. A key obstacle to truly efficient function deployment is the fundamental wide abstraction gap between the upper-layer request scheduling and the low-level hardware execution.In this work, we show that the optimal operating point (OOP) for energy efficiency cannot be attained without synthesizing the multi-dimensional attributes of functions. We present FIRST, a novel mechanism that enables servers to better orchestrate serverless functions. The key feature of FIRST is that it leverages a lightweight Internal Representation and meta-Scheduling (IRS) layer for collecting the maximum potential revenue from the servers. Specifically, FIRST follows a pipeline-style workflow. Its frontend components aim to analyze functions from different angles and expose their key features to the system. Meanwhile, its backend components are able to make informed function assignment decisions to avoid OOP divergence. We further demonstrate the way to create extensions based on FIRST to enable versatile cloud-native power management. In total, our design constitutes a flexible management layer that supports power-aware function deployment. We show that FIRST could allow 94% functions to be processed under the OOP, which brings up to 24% energy efficiency improvements.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125484185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

SelB-k-NN: A Mini-Batch K-Nearest Neighbors Algorithm on AI Processors SelB-k-NN: AI处理器上的一种小批k近邻算法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00088

Yifeng Tang, Cho-Li Wang

The popularity of Artificial Intelligence (AI) motivates novel domain-specific hardware named AI processors. With a design trade-off, the AI processors feature incredible computation power for matrix multiplications and activations, while some leave other operations less powerful, e.g., scalar operations and vectorized comparisons & selections. For k-nearest neighbors (k-NN) algorithm, consisting of distance computation phase and k-selection phase, while the former is naturally accelerated, the previous efficient k-selection becomes problematic. Moreover, limited memory forces k-NN to adopt a mini-batch manner with tiling technique. As the distance computation’s results are the k-selection’s inputs, the former’s tiling shape determines that of the latter. Since the two phases execute on separate hardware units requiring different performance analyses, whether the former’s tiling strategies benefit the latter and entire k-NN is doubtful.To address the new challenges brought by the AI processors, this paper proposes SelB-k-NN (Selection-Bitonic-k-NN), a mini-batch algorithm inspired by selection sort and bitonic k-selection. SelB-k-NN avoids the expansion of the weakly-supported operations on the huge scale of datasets. To apply SelB-k-NN to various AI processors, we propose two algorithms to reduce the hardware support requirements. Since the matrix multiplication operates data with the specifically-designed memory hierarchy which k-selection does not share, the tiling shape of the former cannot guarantee the best execution of the latter and vice versa. By quantifying the runtime workload variations of k-selection, we formulate an optimization problem to search for the optimal tiling shapes of both phases with an offline pruning method, which reduces the search space in the preprocessing stage. Evaluations show that on Huawei Ascend 310 AI processor, SelB-k-NN achieves 2.01× speedup of the bitonic k-selection, 23.93× of the heap approach, 78.52× of the CPU approach. For mini-batch SelB-k-NN, the optimal tiling shapes for two phases respectively achieve 1.48× acceleration compared with the matrix multiplication tiling shapes and 1.14× with the k-selection tiling shapes, with 72.80% of the search space pruned.

人工智能(AI)的普及激发了新的领域专用硬件——AI处理器。通过设计权衡，AI处理器在矩阵乘法和激活方面具有令人难以置信的计算能力，而有些操作则使其他操作不那么强大，例如标量操作和矢量化比较和选择。对于由距离计算阶段和k-选择阶段组成的k-近邻(k-NN)算法来说，距离计算阶段的速度自然会加快，而k-近邻算法之前高效的k-选择就成了问题。此外，有限的内存迫使k-NN采用带有平铺技术的小批量方式。由于距离计算的结果是k-selection的输入，因此前者的平铺形状决定了后者的平铺形状。由于这两个阶段在不同的硬件单元上执行，需要不同的性能分析，因此前者的平铺策略是否有利于后者和整个k-NN是值得怀疑的。为了解决人工智能处理器带来的新挑战，本文提出了SelB-k-NN (selection - bitonic -k- nn)，这是一种受选择排序和bitonic k-selection启发的小型批处理算法。SelB-k-NN避免了弱支持操作在大规模数据集上的扩展。为了将SelB-k-NN应用于各种人工智能处理器，我们提出了两种算法来降低硬件支持要求。由于矩阵乘法使用k-selection不共享的专门设计的内存层次结构来操作数据，因此前者的平铺形状不能保证后者的最佳执行，反之亦然。通过量化k-selection的运行时工作量变化，我们制定了一个优化问题，利用离线剪枝方法搜索两个阶段的最优平铺形状，从而减少了预处理阶段的搜索空间。评估表明，在华为Ascend 310 AI处理器上，SelB-k-NN的速度比bitonic k-selection提高2.01倍，比heap方法提高23.93倍，比CPU方法提高78.52倍。对于小批量SelB-k-NN，两阶段的最优平铺形状分别比矩阵乘法平铺形状加速1.48倍，比k-选择平铺形状加速1.14倍，72.80%的搜索空间被修剪。

{"title":"SelB-k-NN: A Mini-Batch K-Nearest Neighbors Algorithm on AI Processors","authors":"Yifeng Tang, Cho-Li Wang","doi":"10.1109/IPDPS54959.2023.00088","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00088","url":null,"abstract":"The popularity of Artificial Intelligence (AI) motivates novel domain-specific hardware named AI processors. With a design trade-off, the AI processors feature incredible computation power for matrix multiplications and activations, while some leave other operations less powerful, e.g., scalar operations and vectorized comparisons & selections. For k-nearest neighbors (k-NN) algorithm, consisting of distance computation phase and k-selection phase, while the former is naturally accelerated, the previous efficient k-selection becomes problematic. Moreover, limited memory forces k-NN to adopt a mini-batch manner with tiling technique. As the distance computation’s results are the k-selection’s inputs, the former’s tiling shape determines that of the latter. Since the two phases execute on separate hardware units requiring different performance analyses, whether the former’s tiling strategies benefit the latter and entire k-NN is doubtful.To address the new challenges brought by the AI processors, this paper proposes SelB-k-NN (Selection-Bitonic-k-NN), a mini-batch algorithm inspired by selection sort and bitonic k-selection. SelB-k-NN avoids the expansion of the weakly-supported operations on the huge scale of datasets. To apply SelB-k-NN to various AI processors, we propose two algorithms to reduce the hardware support requirements. Since the matrix multiplication operates data with the specifically-designed memory hierarchy which k-selection does not share, the tiling shape of the former cannot guarantee the best execution of the latter and vice versa. By quantifying the runtime workload variations of k-selection, we formulate an optimization problem to search for the optimal tiling shapes of both phases with an offline pruning method, which reduces the search space in the preprocessing stage. Evaluations show that on Huawei Ascend 310 AI processor, SelB-k-NN achieves 2.01× speedup of the bitonic k-selection, 23.93× of the heap approach, 78.52× of the CPU approach. For mini-batch SelB-k-NN, the optimal tiling shapes for two phases respectively achieve 1.48× acceleration compared with the matrix multiplication tiling shapes and 1.14× with the k-selection tiling shapes, with 72.80% of the search space pruned.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117286368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Software-Defined, Fast and Strongly-Consistent Data Replication for RDMA-Based PM Datastores 基于rdma的PM数据存储的软件定义、快速和强一致的数据复制

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00019

Haodi Lu, Haikun Liu, Chencheng Ye, Xiaofei Liao, Fubing Mao, Yu Zhang, Hai Jin

Modern storage systems typically replicate data on multiple servers to provide high reliability and availability. However, most commercially-deployed datastores often fail to offer low latency, high throughput, and strong consistency at the same time. This paper presents Whale, a Remote Direct Memory Access (RDMA) based primary-backup replication system for in-memory datastores. Whale achieves both low latency and strong consistency by decoupling metadata multicasting from data replication for all backup nodes, and using an optimistic commitment mechanism to respond to client write requests earlier. Whale achieves high throughput by propagating writes from the primary node to backup nodes asynchronously via RDMA-optimized chain replication. To further reduce the cost of data replication, we design a log-structured datastore to fully exploit the advantages of one-sided RDMA and Persistent Memory (PM). We implement Whale on a cluster equipped with PM and InfiniBand RDMA networks. Experimental results show that Whale achieves much higher throughput and lower latency than state-of-the-art replication protocols.

现代存储系统通常在多台服务器上复制数据，以提供高可靠性和可用性。然而，大多数商业部署的数据存储往往不能同时提供低延迟、高吞吐量和强一致性。本文介绍了Whale，一个基于远程直接内存访问(RDMA)的内存数据存储主备份复制系统。Whale通过将所有备份节点的元数据组播与数据复制分离，并使用乐观承诺机制提前响应客户端写请求，实现了低延迟和强一致性。Whale通过rdma优化的链复制将写数据从主节点异步传播到备份节点，从而实现了高吞吐量。为了进一步降低数据复制的成本，我们设计了一个日志结构的数据存储，以充分利用单侧RDMA和持久内存(PM)的优势。我们在配备PM和InfiniBand RDMA网络的集群上实现Whale。实验结果表明，与最先进的复制协议相比，Whale实现了更高的吞吐量和更低的延迟。

{"title":"Software-Defined, Fast and Strongly-Consistent Data Replication for RDMA-Based PM Datastores","authors":"Haodi Lu, Haikun Liu, Chencheng Ye, Xiaofei Liao, Fubing Mao, Yu Zhang, Hai Jin","doi":"10.1109/IPDPS54959.2023.00019","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00019","url":null,"abstract":"Modern storage systems typically replicate data on multiple servers to provide high reliability and availability. However, most commercially-deployed datastores often fail to offer low latency, high throughput, and strong consistency at the same time. This paper presents Whale, a Remote Direct Memory Access (RDMA) based primary-backup replication system for in-memory datastores. Whale achieves both low latency and strong consistency by decoupling metadata multicasting from data replication for all backup nodes, and using an optimistic commitment mechanism to respond to client write requests earlier. Whale achieves high throughput by propagating writes from the primary node to backup nodes asynchronously via RDMA-optimized chain replication. To further reduce the cost of data replication, we design a log-structured datastore to fully exploit the advantages of one-sided RDMA and Persistent Memory (PM). We implement Whale on a cluster equipped with PM and InfiniBand RDMA networks. Experimental results show that Whale achieves much higher throughput and lower latency than state-of-the-art replication protocols.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121540927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Message from the IPDPS 2023 Program Chairs 来自IPDPS 2023项目主席的信息

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/ipdps54959.2023.00006

引用次数: 0

A Guaranteed Approximation Algorithm for Scheduling Fork-Joins with Communication Delay 考虑通信延迟的分叉连接调度的保证逼近算法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00087

P. Dutot, Yeu-Shin Fu, Nikhil Prasad, O. Sinnen

Scheduling task graphs with communication delay is a widely studied NP-hard problem. Many heuristics have been proposed, but there is no constant approximation algorithm for this classic model. In this paper, we focus on the scheduling of the important class of fork-join task graphs (describing many types of common computations) on homogeneous processors. For this sub-case, we propose a guaranteed algorithm with a $left( {1 + frac{m}{{m - 1}}} right)$-approximation factor, where m is the number of processors. The algorithm is not only the first constant approximation for an important sub-domain of the classic scheduling problem, it is also a practical algorithm that can obtain shorter makespans than known heuristics. To demonstrate this, we propose adaptations of known scheduling heuristic for the specific fork-join structure. In an extensive evaluation, we then implemented these algorithms and scheduled many fork-join graphs with up to thousands of tasks and various computation time distributions on up to hundreds of processors. Comparing the obtained results demonstrates the competitive nature of the proposed approximation algorithm.

具有通信延迟的任务图调度是一个被广泛研究的np困难问题。已经提出了许多启发式算法，但对于这个经典模型没有常数近似算法。在本文中，我们主要关注同构处理器上的一类重要的fork-join任务图(描述许多类型的常见计算)的调度。对于这种子情况，我们提出了一个具有$left( {1 + frac{m}{{m - 1}}} right)$ -近似因子的保证算法，其中m是处理器的数量。该算法不仅是经典调度问题的一个重要子域的第一常数近似，而且是一种比已知启发式算法能获得更短完工时间的实用算法。为了证明这一点，我们提出了针对特定fork-join结构的已知调度启发式调整。在一次广泛的评估中，我们实现了这些算法，并在数百个处理器上调度了多达数千个任务和各种计算时间分布的fork-join图。比较得到的结果表明了所提出的近似算法的竞争性。

{"title":"A Guaranteed Approximation Algorithm for Scheduling Fork-Joins with Communication Delay","authors":"P. Dutot, Yeu-Shin Fu, Nikhil Prasad, O. Sinnen","doi":"10.1109/IPDPS54959.2023.00087","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00087","url":null,"abstract":"Scheduling task graphs with communication delay is a widely studied NP-hard problem. Many heuristics have been proposed, but there is no constant approximation algorithm for this classic model. In this paper, we focus on the scheduling of the important class of fork-join task graphs (describing many types of common computations) on homogeneous processors. For this sub-case, we propose a guaranteed algorithm with a $left( {1 + frac{m}{{m - 1}}} right)$-approximation factor, where m is the number of processors. The algorithm is not only the first constant approximation for an important sub-domain of the classic scheduling problem, it is also a practical algorithm that can obtain shorter makespans than known heuristics. To demonstrate this, we propose adaptations of known scheduling heuristic for the specific fork-join structure. In an extensive evaluation, we then implemented these algorithms and scheduled many fork-join graphs with up to thousands of tasks and various computation time distributions on up to hundreds of processors. Comparing the obtained results demonstrates the competitive nature of the proposed approximation algorithm.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116472569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0