首页 > 最新文献

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
Smart Redbelly Blockchain: Reducing Congestion for Web3 智能红腹区块链:减少Web3的拥塞
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00098
Deepal Tennakoon, Yiding Hua, V. Gramoli
Decentralization promises to remedy the drawbacks of the web by executing decentralized applications (DApps) on blockchains. Unfortunately, modern blockchains cannot support realistic web application workloads mainly due to congestion.We introduce the Smart Redbelly Blockchain (SRBB), a provably correct permissionless blockchain that reduces congestion by (1) avoiding redundant propagation and validations of transactions with Transaction Validation and Propagation Reduction (TVPR) and (2) mitigating the propagation of invalid transactions within blocks by Byzantine nodes with a dedicated Reward-Penalty Mechanism (RPM). Our comparison of SRBB against Algorand, Avalanche, Diem, Ethereum, Quorum, and Solana, using the DIABLO benchmark suite, indicates that SRBB outperforms all these blockchains under real application workloads. Moreover, SRBB is the only blockchain to successfully execute real workloads of NASDAQ and Uber on a DApp without losing transactions. To demonstrate that TVPR and RPM are the causes of the improved performance, we compare SRBB with its naive baseline, which does not contain TVPR and RPM. Our results show that TVPR increases the throughput by 55× and divides the latency by 3.5, while RPM increases the throughput by 7% under flooding attacks. Finally, TVPR helps reduce transaction losses in the normal scenario while RPM goes further and mitigates transaction losses under flooding attacks.
去中心化有望通过在区块链上执行去中心化应用程序(DApps)来弥补网络的缺点。不幸的是,由于拥塞,现代区块链无法支持现实的web应用工作负载。我们介绍了智能红腹区块链(SRBB),这是一种可证明正确的无权限区块链,通过(1)避免交易的冗余传播和交易验证减少(TVPR)和(2)通过专用奖惩机制(RPM)减轻拜占庭节点在块内无效交易的传播来减少拥塞。我们使用DIABLO基准套件将SRBB与Algorand、Avalanche、Diem、Ethereum、Quorum和Solana进行比较,表明SRBB在实际应用工作负载下的性能优于所有这些区块链。此外,SRBB是唯一一个在DApp上成功执行纳斯达克和Uber真实工作负载而不丢失交易的区块链。为了证明TVPR和RPM是提高性能的原因,我们将SRBB与不包含TVPR和RPM的原始基线进行了比较。我们的研究结果表明,在洪水攻击下,TVPR将吞吐量提高了55倍,将延迟减少了3.5倍,而RPM将吞吐量提高了7%。最后,TVPR有助于减少正常情况下的事务损失,而RPM则更进一步,可以减轻泛洪攻击下的事务损失。
{"title":"Smart Redbelly Blockchain: Reducing Congestion for Web3","authors":"Deepal Tennakoon, Yiding Hua, V. Gramoli","doi":"10.1109/IPDPS54959.2023.00098","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00098","url":null,"abstract":"Decentralization promises to remedy the drawbacks of the web by executing decentralized applications (DApps) on blockchains. Unfortunately, modern blockchains cannot support realistic web application workloads mainly due to congestion.We introduce the Smart Redbelly Blockchain (SRBB), a provably correct permissionless blockchain that reduces congestion by (1) avoiding redundant propagation and validations of transactions with Transaction Validation and Propagation Reduction (TVPR) and (2) mitigating the propagation of invalid transactions within blocks by Byzantine nodes with a dedicated Reward-Penalty Mechanism (RPM). Our comparison of SRBB against Algorand, Avalanche, Diem, Ethereum, Quorum, and Solana, using the DIABLO benchmark suite, indicates that SRBB outperforms all these blockchains under real application workloads. Moreover, SRBB is the only blockchain to successfully execute real workloads of NASDAQ and Uber on a DApp without losing transactions. To demonstrate that TVPR and RPM are the causes of the improved performance, we compare SRBB with its naive baseline, which does not contain TVPR and RPM. Our results show that TVPR increases the throughput by 55× and divides the latency by 3.5, while RPM increases the throughput by 7% under flooding attacks. Finally, TVPR helps reduce transaction losses in the normal scenario while RPM goes further and mitigates transaction losses under flooding attacks.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114188769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
SelB-k-NN: A Mini-Batch K-Nearest Neighbors Algorithm on AI Processors SelB-k-NN: AI处理器上的一种小批k近邻算法
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00088
Yifeng Tang, Cho-Li Wang
The popularity of Artificial Intelligence (AI) motivates novel domain-specific hardware named AI processors. With a design trade-off, the AI processors feature incredible computation power for matrix multiplications and activations, while some leave other operations less powerful, e.g., scalar operations and vectorized comparisons & selections. For k-nearest neighbors (k-NN) algorithm, consisting of distance computation phase and k-selection phase, while the former is naturally accelerated, the previous efficient k-selection becomes problematic. Moreover, limited memory forces k-NN to adopt a mini-batch manner with tiling technique. As the distance computation’s results are the k-selection’s inputs, the former’s tiling shape determines that of the latter. Since the two phases execute on separate hardware units requiring different performance analyses, whether the former’s tiling strategies benefit the latter and entire k-NN is doubtful.To address the new challenges brought by the AI processors, this paper proposes SelB-k-NN (Selection-Bitonic-k-NN), a mini-batch algorithm inspired by selection sort and bitonic k-selection. SelB-k-NN avoids the expansion of the weakly-supported operations on the huge scale of datasets. To apply SelB-k-NN to various AI processors, we propose two algorithms to reduce the hardware support requirements. Since the matrix multiplication operates data with the specifically-designed memory hierarchy which k-selection does not share, the tiling shape of the former cannot guarantee the best execution of the latter and vice versa. By quantifying the runtime workload variations of k-selection, we formulate an optimization problem to search for the optimal tiling shapes of both phases with an offline pruning method, which reduces the search space in the preprocessing stage. Evaluations show that on Huawei Ascend 310 AI processor, SelB-k-NN achieves 2.01× speedup of the bitonic k-selection, 23.93× of the heap approach, 78.52× of the CPU approach. For mini-batch SelB-k-NN, the optimal tiling shapes for two phases respectively achieve 1.48× acceleration compared with the matrix multiplication tiling shapes and 1.14× with the k-selection tiling shapes, with 72.80% of the search space pruned.
人工智能(AI)的普及激发了新的领域专用硬件——AI处理器。通过设计权衡,AI处理器在矩阵乘法和激活方面具有令人难以置信的计算能力,而有些操作则使其他操作不那么强大,例如标量操作和矢量化比较和选择。对于由距离计算阶段和k-选择阶段组成的k-近邻(k-NN)算法来说,距离计算阶段的速度自然会加快,而k-近邻算法之前高效的k-选择就成了问题。此外,有限的内存迫使k-NN采用带有平铺技术的小批量方式。由于距离计算的结果是k-selection的输入,因此前者的平铺形状决定了后者的平铺形状。由于这两个阶段在不同的硬件单元上执行,需要不同的性能分析,因此前者的平铺策略是否有利于后者和整个k-NN是值得怀疑的。为了解决人工智能处理器带来的新挑战,本文提出了SelB-k-NN (selection - bitonic -k- nn),这是一种受选择排序和bitonic k-selection启发的小型批处理算法。SelB-k-NN避免了弱支持操作在大规模数据集上的扩展。为了将SelB-k-NN应用于各种人工智能处理器,我们提出了两种算法来降低硬件支持要求。由于矩阵乘法使用k-selection不共享的专门设计的内存层次结构来操作数据,因此前者的平铺形状不能保证后者的最佳执行,反之亦然。通过量化k-selection的运行时工作量变化,我们制定了一个优化问题,利用离线剪枝方法搜索两个阶段的最优平铺形状,从而减少了预处理阶段的搜索空间。评估表明,在华为Ascend 310 AI处理器上,SelB-k-NN的速度比bitonic k-selection提高2.01倍,比heap方法提高23.93倍,比CPU方法提高78.52倍。对于小批量SelB-k-NN,两阶段的最优平铺形状分别比矩阵乘法平铺形状加速1.48倍,比k-选择平铺形状加速1.14倍,72.80%的搜索空间被修剪。
{"title":"SelB-k-NN: A Mini-Batch K-Nearest Neighbors Algorithm on AI Processors","authors":"Yifeng Tang, Cho-Li Wang","doi":"10.1109/IPDPS54959.2023.00088","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00088","url":null,"abstract":"The popularity of Artificial Intelligence (AI) motivates novel domain-specific hardware named AI processors. With a design trade-off, the AI processors feature incredible computation power for matrix multiplications and activations, while some leave other operations less powerful, e.g., scalar operations and vectorized comparisons & selections. For k-nearest neighbors (k-NN) algorithm, consisting of distance computation phase and k-selection phase, while the former is naturally accelerated, the previous efficient k-selection becomes problematic. Moreover, limited memory forces k-NN to adopt a mini-batch manner with tiling technique. As the distance computation’s results are the k-selection’s inputs, the former’s tiling shape determines that of the latter. Since the two phases execute on separate hardware units requiring different performance analyses, whether the former’s tiling strategies benefit the latter and entire k-NN is doubtful.To address the new challenges brought by the AI processors, this paper proposes SelB-k-NN (Selection-Bitonic-k-NN), a mini-batch algorithm inspired by selection sort and bitonic k-selection. SelB-k-NN avoids the expansion of the weakly-supported operations on the huge scale of datasets. To apply SelB-k-NN to various AI processors, we propose two algorithms to reduce the hardware support requirements. Since the matrix multiplication operates data with the specifically-designed memory hierarchy which k-selection does not share, the tiling shape of the former cannot guarantee the best execution of the latter and vice versa. By quantifying the runtime workload variations of k-selection, we formulate an optimization problem to search for the optimal tiling shapes of both phases with an offline pruning method, which reduces the search space in the preprocessing stage. Evaluations show that on Huawei Ascend 310 AI processor, SelB-k-NN achieves 2.01× speedup of the bitonic k-selection, 23.93× of the heap approach, 78.52× of the CPU approach. For mini-batch SelB-k-NN, the optimal tiling shapes for two phases respectively achieve 1.48× acceleration compared with the matrix multiplication tiling shapes and 1.14× with the k-selection tiling shapes, with 72.80% of the search space pruned.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117286368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Message from the IPDPS 2023 Program Chairs 来自IPDPS 2023项目主席的信息
Pub Date : 2023-05-01 DOI: 10.1109/ipdps54959.2023.00006
{"title":"Message from the IPDPS 2023 Program Chairs","authors":"","doi":"10.1109/ipdps54959.2023.00006","DOIUrl":"https://doi.org/10.1109/ipdps54959.2023.00006","url":null,"abstract":"","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127699535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Harnessing the Crowd for Autotuning High-Performance Computing Applications 利用人群自动调优高性能计算应用程序
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00069
Younghyun Cho, J. Demmel, Jacob King, X. Li, Yang Liu, Hengrui Luo
This paper presents GPTuneCrowd, a crowd-based autotuning framework for tuning high-performance computing applications. GPTuneCrowd collects performance data from various users using a user-friendly tuner interface. GPTuneCrowd then presents novel autotuning techniques, based on transfer learning and parameter sensitivity analysis, to maximize tuning quality using collected data from the crowd. This paper shows several real-world case studies of GPTuneCrowd. Our evaluation shows that GPTuneCrowd’s transfer learning improves the tuned performance of ScaLAPACK’s PDGEQRF by 1.57x and a plasma fusion code NIMROD by 2.97x, over a non-transfer learning autotuner. We use GPTuneCrowd’s sensitivity analysis to reduce the search space of SuperLU_DIST and Hypre. Tuning on the reduced search space achieves 1.17x and 1.35x better tuned performance of SuperLU_DIST and Hypre, respectively, compared to the original search space.
本文提出了GPTuneCrowd,一个基于群体的自动调优框架,用于调优高性能计算应用程序。GPTuneCrowd使用用户友好的调谐器界面从各种用户收集性能数据。GPTuneCrowd随后提出了基于迁移学习和参数敏感性分析的新颖自动调谐技术,利用从人群中收集的数据最大限度地提高调谐质量。本文展示了GPTuneCrowd的几个实际案例研究。我们的评估表明,与非迁移学习自动调谐器相比,GPTuneCrowd的迁移学习将ScaLAPACK的PDGEQRF的调谐性能提高了1.57倍,将等离子体融合代码NIMROD的调谐性能提高了2.97倍。我们使用GPTuneCrowd的灵敏度分析来减少SuperLU_DIST和hyperpre的搜索空间。与原始搜索空间相比,对缩减后的搜索空间进行调优,SuperLU_DIST和hyper的调优性能分别提高了1.17倍和1.35倍。
{"title":"Harnessing the Crowd for Autotuning High-Performance Computing Applications","authors":"Younghyun Cho, J. Demmel, Jacob King, X. Li, Yang Liu, Hengrui Luo","doi":"10.1109/IPDPS54959.2023.00069","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00069","url":null,"abstract":"This paper presents GPTuneCrowd, a crowd-based autotuning framework for tuning high-performance computing applications. GPTuneCrowd collects performance data from various users using a user-friendly tuner interface. GPTuneCrowd then presents novel autotuning techniques, based on transfer learning and parameter sensitivity analysis, to maximize tuning quality using collected data from the crowd. This paper shows several real-world case studies of GPTuneCrowd. Our evaluation shows that GPTuneCrowd’s transfer learning improves the tuned performance of ScaLAPACK’s PDGEQRF by 1.57x and a plasma fusion code NIMROD by 2.97x, over a non-transfer learning autotuner. We use GPTuneCrowd’s sensitivity analysis to reduce the search space of SuperLU_DIST and Hypre. Tuning on the reduced search space achieves 1.17x and 1.35x better tuned performance of SuperLU_DIST and Hypre, respectively, compared to the original search space.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127218390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Software-Defined, Fast and Strongly-Consistent Data Replication for RDMA-Based PM Datastores 基于rdma的PM数据存储的软件定义、快速和强一致的数据复制
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00019
Haodi Lu, Haikun Liu, Chencheng Ye, Xiaofei Liao, Fubing Mao, Yu Zhang, Hai Jin
Modern storage systems typically replicate data on multiple servers to provide high reliability and availability. However, most commercially-deployed datastores often fail to offer low latency, high throughput, and strong consistency at the same time. This paper presents Whale, a Remote Direct Memory Access (RDMA) based primary-backup replication system for in-memory datastores. Whale achieves both low latency and strong consistency by decoupling metadata multicasting from data replication for all backup nodes, and using an optimistic commitment mechanism to respond to client write requests earlier. Whale achieves high throughput by propagating writes from the primary node to backup nodes asynchronously via RDMA-optimized chain replication. To further reduce the cost of data replication, we design a log-structured datastore to fully exploit the advantages of one-sided RDMA and Persistent Memory (PM). We implement Whale on a cluster equipped with PM and InfiniBand RDMA networks. Experimental results show that Whale achieves much higher throughput and lower latency than state-of-the-art replication protocols.
现代存储系统通常在多台服务器上复制数据,以提供高可靠性和可用性。然而,大多数商业部署的数据存储往往不能同时提供低延迟、高吞吐量和强一致性。本文介绍了Whale,一个基于远程直接内存访问(RDMA)的内存数据存储主备份复制系统。Whale通过将所有备份节点的元数据组播与数据复制分离,并使用乐观承诺机制提前响应客户端写请求,实现了低延迟和强一致性。Whale通过rdma优化的链复制将写数据从主节点异步传播到备份节点,从而实现了高吞吐量。为了进一步降低数据复制的成本,我们设计了一个日志结构的数据存储,以充分利用单侧RDMA和持久内存(PM)的优势。我们在配备PM和InfiniBand RDMA网络的集群上实现Whale。实验结果表明,与最先进的复制协议相比,Whale实现了更高的吞吐量和更低的延迟。
{"title":"Software-Defined, Fast and Strongly-Consistent Data Replication for RDMA-Based PM Datastores","authors":"Haodi Lu, Haikun Liu, Chencheng Ye, Xiaofei Liao, Fubing Mao, Yu Zhang, Hai Jin","doi":"10.1109/IPDPS54959.2023.00019","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00019","url":null,"abstract":"Modern storage systems typically replicate data on multiple servers to provide high reliability and availability. However, most commercially-deployed datastores often fail to offer low latency, high throughput, and strong consistency at the same time. This paper presents Whale, a Remote Direct Memory Access (RDMA) based primary-backup replication system for in-memory datastores. Whale achieves both low latency and strong consistency by decoupling metadata multicasting from data replication for all backup nodes, and using an optimistic commitment mechanism to respond to client write requests earlier. Whale achieves high throughput by propagating writes from the primary node to backup nodes asynchronously via RDMA-optimized chain replication. To further reduce the cost of data replication, we design a log-structured datastore to fully exploit the advantages of one-sided RDMA and Persistent Memory (PM). We implement Whale on a cluster equipped with PM and InfiniBand RDMA networks. Experimental results show that Whale achieves much higher throughput and lower latency than state-of-the-art replication protocols.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121540927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FIRST: Exploiting the Multi-Dimensional Attributes of Functions for Power-Aware Serverless Computing 第一:利用功能的多维属性进行无服务器计算
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00091
Lu Zhang, C. Li, Xinkai Wang, Weiqi Feng, Zheng Yu, Quan Chen, Jingwen Leng, Minyi Guo, Pu Yang, Shang Yue
Emerging cloud-native development models raise new challenges for managing server performance and power at microsecond scale. Compared with traditional cloud workloads, serverless functions exhibit unprecedented heterogeneity, variability, and dynamicity. Designing cloud-native power management schemes for serverless functions requires significant engineering effort. Current solutions remain sub-optimal since their orchestration process is often one-sided, lacking a systematic view. A key obstacle to truly efficient function deployment is the fundamental wide abstraction gap between the upper-layer request scheduling and the low-level hardware execution.In this work, we show that the optimal operating point (OOP) for energy efficiency cannot be attained without synthesizing the multi-dimensional attributes of functions. We present FIRST, a novel mechanism that enables servers to better orchestrate serverless functions. The key feature of FIRST is that it leverages a lightweight Internal Representation and meta-Scheduling (IRS) layer for collecting the maximum potential revenue from the servers. Specifically, FIRST follows a pipeline-style workflow. Its frontend components aim to analyze functions from different angles and expose their key features to the system. Meanwhile, its backend components are able to make informed function assignment decisions to avoid OOP divergence. We further demonstrate the way to create extensions based on FIRST to enable versatile cloud-native power management. In total, our design constitutes a flexible management layer that supports power-aware function deployment. We show that FIRST could allow 94% functions to be processed under the OOP, which brings up to 24% energy efficiency improvements.
新兴的云原生开发模型为管理微秒级的服务器性能和功耗提出了新的挑战。与传统的云工作负载相比,无服务器功能具有前所未有的异构性、可变性和动态性。为无服务器功能设计云原生电源管理方案需要大量的工程工作。当前的解决方案仍然不是最优的,因为它们的编排过程通常是片面的,缺乏系统的视图。真正有效的功能部署的一个关键障碍是在上层请求调度和底层硬件执行之间存在很大的抽象差距。在这项工作中,我们表明,如果不综合功能的多维属性,就无法获得能源效率的最佳工作点(OOP)。我们提出了FIRST,这是一种使服务器能够更好地编排无服务器功能的新机制。FIRST的关键特性是它利用轻量级的内部表示和元调度(IRS)层从服务器收集最大的潜在收益。具体来说,FIRST遵循流水线式工作流。其前端组件旨在从不同角度分析功能,并将其关键特性暴露给系统。同时,它的后端组件能够做出明智的功能分配决策,以避免OOP分歧。我们将进一步演示如何创建基于FIRST的扩展,以实现多用途的云原生电源管理。总的来说,我们的设计构成了一个灵活的管理层,支持功耗感知功能的部署。我们发现FIRST可以允许94%的功能在OOP下处理,这带来了24%的能源效率提高。
{"title":"FIRST: Exploiting the Multi-Dimensional Attributes of Functions for Power-Aware Serverless Computing","authors":"Lu Zhang, C. Li, Xinkai Wang, Weiqi Feng, Zheng Yu, Quan Chen, Jingwen Leng, Minyi Guo, Pu Yang, Shang Yue","doi":"10.1109/IPDPS54959.2023.00091","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00091","url":null,"abstract":"Emerging cloud-native development models raise new challenges for managing server performance and power at microsecond scale. Compared with traditional cloud workloads, serverless functions exhibit unprecedented heterogeneity, variability, and dynamicity. Designing cloud-native power management schemes for serverless functions requires significant engineering effort. Current solutions remain sub-optimal since their orchestration process is often one-sided, lacking a systematic view. A key obstacle to truly efficient function deployment is the fundamental wide abstraction gap between the upper-layer request scheduling and the low-level hardware execution.In this work, we show that the optimal operating point (OOP) for energy efficiency cannot be attained without synthesizing the multi-dimensional attributes of functions. We present FIRST, a novel mechanism that enables servers to better orchestrate serverless functions. The key feature of FIRST is that it leverages a lightweight Internal Representation and meta-Scheduling (IRS) layer for collecting the maximum potential revenue from the servers. Specifically, FIRST follows a pipeline-style workflow. Its frontend components aim to analyze functions from different angles and expose their key features to the system. Meanwhile, its backend components are able to make informed function assignment decisions to avoid OOP divergence. We further demonstrate the way to create extensions based on FIRST to enable versatile cloud-native power management. In total, our design constitutes a flexible management layer that supports power-aware function deployment. We show that FIRST could allow 94% functions to be processed under the OOP, which brings up to 24% energy efficiency improvements.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125484185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Satellite Collision Detection using Spatial Data Structures 基于空间数据结构的卫星碰撞检测
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00078
C. Hellwig, Fabian Czappa, Martin Michel, R. Bertrand, F. Wolf
In recent years, the number of artificial objects in Earth orbit has increased rapidly due to lower launch costs and new applications for satellites. More and more governments and private companies are discovering space for their own purposes. Private companies are using space as a new business field, launching thousands of satellites into orbit to offer services like worldwide Internet access. Consequently, the probability of collisions and, thus, the degradation of the orbital environment is rapidly increasing. To avoid devastating collisions at an early stage, efficient algorithms are required to identify satellites approaching each other. Traditional deterministic filter-based conjunction detection algorithms compare each satellite to every other satellite and pass them through a chain of orbital filters. Unfortunately, this leads to a runtime complexity of O(n2). In this paper, we propose two alternative approaches that rely on spatial data structures and thus allow us to exploit modern hardware’s parallelism efficiently. Firstly, we introduce a purely grid-based variant that relies on non-blocking atomic hash maps to identify conjunctions. Secondly, we present a hybrid method that combines this approach with traditional filter chains. Both implementations make it possible to identify conjunctions in a large population with millions of satellites with high precision in a comparatively short time. While the grid-based variant is characterized by lower memory consumption, the hybrid variant is faster if enough memory is available.
近年来,由于发射成本的降低和卫星的新应用,地球轨道上的人造物体数量迅速增加。越来越多的政府和私人公司正在为自己的目的探索太空。私营公司正把太空作为一个新的商业领域,向轨道发射数千颗卫星,提供全球互联网接入等服务。因此,碰撞的可能性以及轨道环境的退化正在迅速增加。为了在早期阶段避免毁灭性的碰撞,需要有效的算法来识别彼此接近的卫星。传统的基于确定性滤波器的连接检测算法将每颗卫星与其他卫星进行比较,并将它们通过一系列轨道滤波器。不幸的是,这会导致运行时复杂度为0 (n2)。在本文中,我们提出了两种依赖于空间数据结构的替代方法,从而允许我们有效地利用现代硬件的并行性。首先,我们引入了一个纯粹基于网格的变体,它依赖于非阻塞原子哈希映射来识别连词。其次,我们提出了一种将该方法与传统滤波器链相结合的混合方法。这两种实现都可以在相对较短的时间内以高精度识别数百万颗卫星的大量人口中的连击。虽然基于网格的变体的特点是内存消耗较低,但如果有足够的内存可用,混合变体的速度更快。
{"title":"Satellite Collision Detection using Spatial Data Structures","authors":"C. Hellwig, Fabian Czappa, Martin Michel, R. Bertrand, F. Wolf","doi":"10.1109/IPDPS54959.2023.00078","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00078","url":null,"abstract":"In recent years, the number of artificial objects in Earth orbit has increased rapidly due to lower launch costs and new applications for satellites. More and more governments and private companies are discovering space for their own purposes. Private companies are using space as a new business field, launching thousands of satellites into orbit to offer services like worldwide Internet access. Consequently, the probability of collisions and, thus, the degradation of the orbital environment is rapidly increasing. To avoid devastating collisions at an early stage, efficient algorithms are required to identify satellites approaching each other. Traditional deterministic filter-based conjunction detection algorithms compare each satellite to every other satellite and pass them through a chain of orbital filters. Unfortunately, this leads to a runtime complexity of O(n2). In this paper, we propose two alternative approaches that rely on spatial data structures and thus allow us to exploit modern hardware’s parallelism efficiently. Firstly, we introduce a purely grid-based variant that relies on non-blocking atomic hash maps to identify conjunctions. Secondly, we present a hybrid method that combines this approach with traditional filter chains. Both implementations make it possible to identify conjunctions in a large population with millions of satellites with high precision in a comparatively short time. While the grid-based variant is characterized by lower memory consumption, the hybrid variant is faster if enough memory is available.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129819452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Porting a Computational Fluid Dynamics Code with AMR to Large-scale GPU Platforms 基于AMR的计算流体动力学代码移植到大规模GPU平台
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00066
J. H. Davis, Justin Shafner, Daniel Nichols, N. Grube, P. Martin, A. Bhatele
Accurate modeling of turbulent hypersonic flows has tremendous scientific and commercial value, and applies to atmospheric flight, supersonic combustion, materials discovery and climate prediction. In this paper, we describe our experiences in extending the capabilities of and modernizing CRoCCo, an MPI-based, CPU-only compressible computational fluid dynamics code. We extend CRoCCo to support block-structured adaptive mesh refinement using a highly-scalable AMR library, AMReX, and add support for a fully curvilinear solver. We also port the computational kernels in CRoCCo to GPUs to enable scaling on modern exascale systems. We present our techniques for overcoming performance challenges and evaluate the updated code, CRoCCo v2.0, on the Summit system, demonstrating a 6× to 44× speedup over the CPU-only version.
高超声速湍流流动的精确建模具有巨大的科学和商业价值,适用于大气飞行、超声速燃烧、材料发现和气候预测。在本文中,我们描述了我们在扩展CRoCCo的功能和现代化方面的经验,CRoCCo是一个基于mpi的,仅cpu可压缩的计算流体动力学代码。我们扩展了CRoCCo,使用高度可扩展的AMR库AMReX来支持块结构的自适应网格细化,并添加了对全曲线求解器的支持。我们还将CRoCCo中的计算内核移植到gpu上,以便在现代百亿亿级系统上进行扩展。我们介绍了克服性能挑战的技术,并在Summit系统上评估了更新后的代码CRoCCo v2.0,演示了比仅使用cpu的版本提高6倍到44倍的速度。
{"title":"Porting a Computational Fluid Dynamics Code with AMR to Large-scale GPU Platforms","authors":"J. H. Davis, Justin Shafner, Daniel Nichols, N. Grube, P. Martin, A. Bhatele","doi":"10.1109/IPDPS54959.2023.00066","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00066","url":null,"abstract":"Accurate modeling of turbulent hypersonic flows has tremendous scientific and commercial value, and applies to atmospheric flight, supersonic combustion, materials discovery and climate prediction. In this paper, we describe our experiences in extending the capabilities of and modernizing CRoCCo, an MPI-based, CPU-only compressible computational fluid dynamics code. We extend CRoCCo to support block-structured adaptive mesh refinement using a highly-scalable AMR library, AMReX, and add support for a fully curvilinear solver. We also port the computational kernels in CRoCCo to GPUs to enable scaling on modern exascale systems. We present our techniques for overcoming performance challenges and evaluate the updated code, CRoCCo v2.0, on the Summit system, demonstrating a 6× to 44× speedup over the CPU-only version.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130949397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Cloud Computing Resource Usage for Hemodynamic Simulation 优化血流动力学模拟的云计算资源使用
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00063
William Ladd, Christopher W Jensen, M. Vardhan, Jeff Ames, J. Hammond, E. Draeger, A. Randles
Cloud computing resources are becoming an increasingly attractive option for simulation workflows but require users to assess a wider variety of hardware options and associated costs than required by traditional in-house hardware or fixed allocations at leadership computing facilities. The pay-as-you-go model used by cloud providers gives users the opportunity to make more nuanced cost-benefit decisions at runtime by choosing hardware that best matches a given workload, but creates the risk of suboptimal allocation strategies or inadvertent cost overruns. In this work, we propose the use of an iteratively-refined performance model to optimize cloud simulation campaigns against overall cost, throughput, or maximum time to solution. Hemodynamic simulations represent an excellent use case for these assessments, as the relative costs and dominant terms in the performance model can vary widely with hardware, numerical parameters and physics models. Performance and scaling behavior of hemodynamic simulations on multiple cloud services as well as a traditional compute cluster are collected and evaluated, and an initial performance model is proposed along with a strategy for dynamically refining it with additional experimental data.
云计算资源正成为模拟工作流程的一个越来越有吸引力的选择,但与传统的内部硬件或领导计算设施的固定分配相比,用户需要评估更多种类的硬件选项和相关成本。云提供商使用的即用即付模型让用户有机会在运行时通过选择最适合给定工作负载的硬件来做出更细微的成本效益决策,但会产生次优分配策略或无意中成本超支的风险。在这项工作中,我们建议使用迭代改进的性能模型来针对总体成本、吞吐量或解决方案的最大时间优化云模拟活动。血流动力学模拟是这些评估的一个极好的用例,因为性能模型中的相对成本和主要术语可能因硬件、数值参数和物理模型而有很大差异。收集和评估了在多个云服务以及传统计算集群上的血流动力学模拟的性能和缩放行为,并提出了初始性能模型以及使用额外实验数据动态改进模型的策略。
{"title":"Optimizing Cloud Computing Resource Usage for Hemodynamic Simulation","authors":"William Ladd, Christopher W Jensen, M. Vardhan, Jeff Ames, J. Hammond, E. Draeger, A. Randles","doi":"10.1109/IPDPS54959.2023.00063","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00063","url":null,"abstract":"Cloud computing resources are becoming an increasingly attractive option for simulation workflows but require users to assess a wider variety of hardware options and associated costs than required by traditional in-house hardware or fixed allocations at leadership computing facilities. The pay-as-you-go model used by cloud providers gives users the opportunity to make more nuanced cost-benefit decisions at runtime by choosing hardware that best matches a given workload, but creates the risk of suboptimal allocation strategies or inadvertent cost overruns. In this work, we propose the use of an iteratively-refined performance model to optimize cloud simulation campaigns against overall cost, throughput, or maximum time to solution. Hemodynamic simulations represent an excellent use case for these assessments, as the relative costs and dominant terms in the performance model can vary widely with hardware, numerical parameters and physics models. Performance and scaling behavior of hemodynamic simulations on multiple cloud services as well as a traditional compute cluster are collected and evaluated, and an initial performance model is proposed along with a strategy for dynamically refining it with additional experimental data.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130813785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
H-Cache: Traffic-Aware Hybrid Rule-Caching in Software-Defined Networks H-Cache:软件定义网络中流量感知混合规则缓存
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00017
Zeyu Luan, Qing Li, Yi Wang, Yong Jiang
Ternary Content Addressable Memory (TCAM) is an essential hardware component in SDN-enabled switches, which supports fast lookup speed and flexible matching patterns. However, TCAM’s limited storage capacity has long been a scalability challenge to enforce fine-grained forwarding policies in SDN. Based on the observation of traffic locality, the rule-caching mechanism employs a combination of TCAM and Random Access Memory (RAM) to maintain the forwarding rules of large and small flows, respectively. However, previous works cannot identify large flows timely and accurately, and suffer from high computational complexity when addressing rule dependencies in TCAM. Worse still, TCAM only caches the forwarding rules of large flows but ignores the latency requirements of small flows. Small flows encounter cache-miss in TCAM and then will be diverted to RAM, where they have to experience slow lookup processes. To jointly optimize the performance of both high-throughput large flows and latency-sensitive small flows, we propose a hybrid rule-caching framework, H-Cache, to scale traffic-aware forwarding policies in SDN. H-Cache identifies large flows through a collaboration of learning-based and threshold-based methods to achieve early detection and high accuracy, and proposes a time-efficient greedy heuristic to address rule dependencies. For small flows, H-Cache establishes default paths in TCAM to speed up their lookup processes, and also reduces their TCAM occupancy through label switching and region partitioning. Experiments with both real-world and synthetic datasets demonstrate that H-Cache increases TCAM utilization by an average of 11% and reduces the average completion time of small flows by almost 70%.
三元内容可寻址内存(TCAM)是支持sdn的交换机中必不可少的硬件组件,它支持快速查找速度和灵活的匹配模式。然而,TCAM有限的存储容量长期以来一直是在SDN中执行细粒度转发策略的可伸缩性挑战。基于对流量局域性的观察,规则缓存机制采用TCAM和RAM (Random Access Memory)相结合的方式分别维护大流量和小流量的转发规则。然而,以前的工作不能及时准确地识别大流量,并且在TCAM中处理规则依赖关系时计算复杂度高。更糟糕的是,TCAM只缓存大流量的转发规则,而忽略了小流量的延迟要求。小流在TCAM中遇到缓存缺失,然后将被转移到RAM,在那里它们必须经历缓慢的查找过程。为了共同优化高吞吐量大流和延迟敏感小流的性能,我们提出了一种混合规则缓存框架H-Cache,用于扩展SDN中流量感知转发策略。H-Cache通过基于学习和基于阈值的方法的协作来识别大流量,以实现早期检测和高精度,并提出了一种时间效率高的贪婪启发式方法来解决规则依赖性。对于小流量,H-Cache在TCAM中建立默认路径,以加快其查找过程,并通过标签交换和区域划分减少其TCAM占用。实际数据集和合成数据集的实验表明,H-Cache平均提高了11%的TCAM利用率,并将小流量的平均完井时间缩短了近70%。
{"title":"H-Cache: Traffic-Aware Hybrid Rule-Caching in Software-Defined Networks","authors":"Zeyu Luan, Qing Li, Yi Wang, Yong Jiang","doi":"10.1109/IPDPS54959.2023.00017","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00017","url":null,"abstract":"Ternary Content Addressable Memory (TCAM) is an essential hardware component in SDN-enabled switches, which supports fast lookup speed and flexible matching patterns. However, TCAM’s limited storage capacity has long been a scalability challenge to enforce fine-grained forwarding policies in SDN. Based on the observation of traffic locality, the rule-caching mechanism employs a combination of TCAM and Random Access Memory (RAM) to maintain the forwarding rules of large and small flows, respectively. However, previous works cannot identify large flows timely and accurately, and suffer from high computational complexity when addressing rule dependencies in TCAM. Worse still, TCAM only caches the forwarding rules of large flows but ignores the latency requirements of small flows. Small flows encounter cache-miss in TCAM and then will be diverted to RAM, where they have to experience slow lookup processes. To jointly optimize the performance of both high-throughput large flows and latency-sensitive small flows, we propose a hybrid rule-caching framework, H-Cache, to scale traffic-aware forwarding policies in SDN. H-Cache identifies large flows through a collaboration of learning-based and threshold-based methods to achieve early detection and high accuracy, and proposes a time-efficient greedy heuristic to address rule dependencies. For small flows, H-Cache establishes default paths in TCAM to speed up their lookup processes, and also reduces their TCAM occupancy through label switching and region partitioning. Experiments with both real-world and synthetic datasets demonstrate that H-Cache increases TCAM utilization by an average of 11% and reduces the average completion time of small flows by almost 70%.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114692561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1