2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文中文

RLP: Power Management Based on a Latency-Aware Roofline Model RLP:基于延迟感知屋顶线模型的电源管理

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00052

Bo Wang, Anara Kozhokanova, C. Terboven, Matthias S. Müller

The ever-growing power draw in high-performance computing (HPC) clusters and the rising energy costs enforce a pressing urge for energy-efficient computing. Consequently, advanced infrastructure orchestration is required to regulate power dissipation efficiently. In this work, we propose a novel approach for managing power consumption at runtime based on the well-known roofline model and call it Roofline Power (RLP) management. The RLP employs rigorously selected but generally available hardware performance events to construct rooflines, with minimal overheads. In particular, RLP extends the original roofline model to include the memory access latency metric for the first time. The extension identifies whether execution is bandwidth, latency, or compute-bound, and improves the modeling accuracy. We evaluated the RLP model on server-grade CPUs and a GPU with real-world HPC workloads in two scenarios: optimization with and without power capping. Compared to system default settings, RLP reduces the energy-to-solution up to 22% with negligible performance degradation. The other scenario accelerates the execution up to 14.7% under power capping. In addition, RLP outperforms other state-of-the-art techniques in generality and effectiveness.

高性能计算(HPC)集群中不断增长的功耗和不断上升的能源成本迫使人们迫切需要节能计算。因此，需要先进的基础设施编排来有效地调节功耗。在这项工作中，我们提出了一种基于众所周知的屋顶线模型的运行时功耗管理新方法，并将其称为屋顶线功率(RLP)管理。RLP采用严格选择但通常可用的硬件性能事件来构建屋顶线，开销最小。特别是，RLP扩展了原始的rooline模型，首次包含了内存访问延迟度量。该扩展识别执行是否受带宽、延迟或计算限制，并提高建模精度。我们在服务器级cpu和具有真实HPC工作负载的GPU上评估了RLP模型，分为两种场景:有功率上限和没有功率上限的优化。与系统默认设置相比，RLP将能量到解决方案的比例降低了22%，而性能下降可以忽略不计。另一个场景在功率上限下将执行速度加快到14.7%。此外，RLP在通用性和有效性方面优于其他最先进的技术。

{"title":"RLP: Power Management Based on a Latency-Aware Roofline Model","authors":"Bo Wang, Anara Kozhokanova, C. Terboven, Matthias S. Müller","doi":"10.1109/IPDPS54959.2023.00052","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00052","url":null,"abstract":"The ever-growing power draw in high-performance computing (HPC) clusters and the rising energy costs enforce a pressing urge for energy-efficient computing. Consequently, advanced infrastructure orchestration is required to regulate power dissipation efficiently. In this work, we propose a novel approach for managing power consumption at runtime based on the well-known roofline model and call it Roofline Power (RLP) management. The RLP employs rigorously selected but generally available hardware performance events to construct rooflines, with minimal overheads. In particular, RLP extends the original roofline model to include the memory access latency metric for the first time. The extension identifies whether execution is bandwidth, latency, or compute-bound, and improves the modeling accuracy. We evaluated the RLP model on server-grade CPUs and a GPU with real-world HPC workloads in two scenarios: optimization with and without power capping. Compared to system default settings, RLP reduces the energy-to-solution up to 22% with negligible performance degradation. The other scenario accelerates the execution up to 14.7% under power capping. In addition, RLP outperforms other state-of-the-art techniques in generality and effectiveness.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128681438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Message from the IPDPS 2023 General Co-chairs IPDPS 2023总联合主席致辞

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/ipdps54959.2023.00005

引用次数: 0

Alioth: A Machine Learning Based Interference-Aware Performance Monitor for Multi-Tenancy Applications in Public Cloud Alioth:基于机器学习的干扰感知性能监视器，用于公共云中的多租户应用

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00095

Tianyao Shi, Yingxuan Yang, Yunlong Cheng, Xiaofeng Gao, Zhen Fang, Yongqiang Yang

Multi-tenancy in public clouds may lead to co-location interference on shared resources, which possibly results in performance degradation of cloud applications. Cloud providers want to know when such events happen and how serious the degradation is, to perform interference-aware migrations and alleviate the problem. However, virtual machines (VM) in Infrastructure-as-a-Service public clouds are black boxes to providers, where application-level performance information cannot be acquired. This makes performance monitoring intensely challenging as cloud providers can only rely on low-level metrics such as CPU usage and hardware counters.We propose a novel machine learning framework, Alioth, to monitor the performance degradation of cloud applications. To feed the data-hungry models, we first elaborate interference generators and conduct comprehensive co-location experiments on a testbed to build Alioth-dataset which reflects the complexity and dynamicity in real-world scenarios. Then we construct Alioth by (1) augmenting features via recovering low-level metrics under no interference using denoising auto-encoders, (2) devising a transfer learning model based on domain adaptation neural network to make models generalize on test cases unseen in offline training, and (3) developing a SHAP explainer to automate feature selection and enhance model interpretability. Experiments show that Alioth achieves an average mean absolute error of 5.29% offline and 10.8% when testing on applications unseen in the training stage, outperforming the baseline methods. Alioth is also robust in signaling quality-of-service violation under dynamicity. Finally, we demonstrate a possible application of Alioth’s interpretability, providing insights to benefit the decision-making of cloud operators. The dataset and code of Alioth have been released on GitHub.

公共云中的多租户可能导致共享资源的共存干扰，从而可能导致云应用程序的性能下降。云提供商希望知道此类事件何时发生以及降级的严重程度，以便执行干扰感知迁移并缓解问题。然而，对于提供商来说，基础设施即服务公共云中的虚拟机(VM)是黑盒子，无法获取应用程序级别的性能信息。这使得性能监控极具挑战性，因为云提供商只能依赖CPU使用率和硬件计数器等低级指标。我们提出了一种新的机器学习框架Alioth来监控云应用程序的性能下降。为了满足数据饥渴的模型，我们首先精心设计了干扰发生器，并在测试平台上进行了全面的协同定位实验，以构建反映现实世界场景复杂性和动态性的alioth数据集。然后，我们通过(1)使用去噪自编码器在无干扰的情况下通过恢复低级指标来增强特征，(2)设计基于领域自适应神经网络的迁移学习模型，使模型在离线训练中看不到的测试用例上泛化，以及(3)开发SHAP解释器来自动选择特征并增强模型的可解释性来构建Alioth。实验表明，Alioth在离线状态下的平均绝对误差为5.29%，在训练阶段未见过的应用程序测试时的平均绝对误差为10.8%，优于基线方法。在动态条件下，Alioth在服务质量违规信号处理方面也具有鲁棒性。最后，我们展示了Alioth的可解释性的可能应用，为云运营商的决策提供了有益的见解。Alioth的数据集和代码已经在GitHub上发布。

{"title":"Alioth: A Machine Learning Based Interference-Aware Performance Monitor for Multi-Tenancy Applications in Public Cloud","authors":"Tianyao Shi, Yingxuan Yang, Yunlong Cheng, Xiaofeng Gao, Zhen Fang, Yongqiang Yang","doi":"10.1109/IPDPS54959.2023.00095","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00095","url":null,"abstract":"Multi-tenancy in public clouds may lead to co-location interference on shared resources, which possibly results in performance degradation of cloud applications. Cloud providers want to know when such events happen and how serious the degradation is, to perform interference-aware migrations and alleviate the problem. However, virtual machines (VM) in Infrastructure-as-a-Service public clouds are black boxes to providers, where application-level performance information cannot be acquired. This makes performance monitoring intensely challenging as cloud providers can only rely on low-level metrics such as CPU usage and hardware counters.We propose a novel machine learning framework, Alioth, to monitor the performance degradation of cloud applications. To feed the data-hungry models, we first elaborate interference generators and conduct comprehensive co-location experiments on a testbed to build Alioth-dataset which reflects the complexity and dynamicity in real-world scenarios. Then we construct Alioth by (1) augmenting features via recovering low-level metrics under no interference using denoising auto-encoders, (2) devising a transfer learning model based on domain adaptation neural network to make models generalize on test cases unseen in offline training, and (3) developing a SHAP explainer to automate feature selection and enhance model interpretability. Experiments show that Alioth achieves an average mean absolute error of 5.29% offline and 10.8% when testing on applications unseen in the training stage, outperforming the baseline methods. Alioth is also robust in signaling quality-of-service violation under dynamicity. Finally, we demonstrate a possible application of Alioth’s interpretability, providing insights to benefit the decision-making of cloud operators. The dataset and code of Alioth have been released on GitHub.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127097704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Proactive SLA-aware Application Placement in the Computing Continuum 计算连续体中的主动sla感知应用程序放置

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00054

Zahra Najafabadi Samani, Narges Mehran, Dragi Kimovski, R.-C. Prodan

The accelerating growth of modern distributed applications with low delivery deadlines leads to a paradigm shift towards the multi-tier computing continuum. However, the geographical dispersion, heterogeneity, and availability of the continuum resources may result in failures and quality of service degradation, significantly negating its advantages and lowering users’ satisfaction. We propose in this paper a proactive application placement (PROS) method relying on distributed coordination to prevent the quality of service violations through service-level agreements on the computing continuum. PROS employs a sigmoid function with adaptive weights for the different parameters to predict the service level agreement assurance of devices based on their past credentials and current capabilities. We evaluate PROS using two application workloads with different traffic stress levels up to 90 million services on a real testbed with 600 heterogeneous instances deployed over eight geographical locations. The results show that PROS increases the success rate by 7%–33%, reduces the response time by 16%–38%, and increases the deadline satisfaction rate by 19%–42% compared to two related work methods. A comprehensive simulation study with 1000 devices and a workload of up to 670 million services confirm the scalability of the results.

具有较低交付期限的现代分布式应用程序的加速增长导致了向多层计算连续体的范式转变。然而，连续体资源的地理分散性、异质性和可用性可能导致服务失效和服务质量下降，显著抵消其优势，降低用户满意度。我们在本文中提出了一种基于分布式协调的主动应用程序放置(PROS)方法，通过计算连续体上的服务水平协议来防止服务质量的违反。PROS采用具有不同参数自适应权重的sigmoid函数，根据设备过去的凭据和当前的能力预测设备的服务水平协议保证。我们使用两个具有不同流量压力水平的应用程序工作负载来评估PROS，在一个真实的测试平台上，在八个地理位置部署了600个异构实例，最多可达9000万个服务。结果表明，与两种相关的工作方法相比，PROS提高了7%-33%的成功率，缩短了16%-38%的响应时间，提高了19%-42%的截止日期满意率。一项包含1000个设备和多达6.7亿个服务的工作负载的全面模拟研究证实了结果的可扩展性。

{"title":"Proactive SLA-aware Application Placement in the Computing Continuum","authors":"Zahra Najafabadi Samani, Narges Mehran, Dragi Kimovski, R.-C. Prodan","doi":"10.1109/IPDPS54959.2023.00054","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00054","url":null,"abstract":"The accelerating growth of modern distributed applications with low delivery deadlines leads to a paradigm shift towards the multi-tier computing continuum. However, the geographical dispersion, heterogeneity, and availability of the continuum resources may result in failures and quality of service degradation, significantly negating its advantages and lowering users’ satisfaction. We propose in this paper a proactive application placement (PROS) method relying on distributed coordination to prevent the quality of service violations through service-level agreements on the computing continuum. PROS employs a sigmoid function with adaptive weights for the different parameters to predict the service level agreement assurance of devices based on their past credentials and current capabilities. We evaluate PROS using two application workloads with different traffic stress levels up to 90 million services on a real testbed with 600 heterogeneous instances deployed over eight geographical locations. The results show that PROS increases the success rate by 7%–33%, reduces the response time by 16%–38%, and increases the deadline satisfaction rate by 19%–42% compared to two related work methods. A comprehensive simulation study with 1000 devices and a workload of up to 670 million services confirm the scalability of the results.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126744447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UnifyFS: A User-level Shared File System for Unified Access to Distributed Local Storage UnifyFS:用于统一访问分布式本地存储的用户级共享文件系统

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00037

Michael J. Brim, A. Moody, Seung-Hwan Lim, Ross G. Miller, Swen Boehm, Cameron Stanavige, K. Mohror, S. Oral

We introduce UnifyFS, a user-level file system that aggregates node-local storage tiers available on high performance computing (HPC) systems and makes them available to HPC applications under a unified namespace. UnifyFS employs transparent I/O interception, so it does not require changes to application code and is compatible with commonly used HPC I/O libraries. The design of UnifyFS supports the predominant HPC I/O workloads and is optimized for bulk-synchronous I/O patterns. Furthermore, UnifyFS provides customizable file system semantics to flexibly adapt its behavior for diverse I/O workloads and storage devices. In this paper, we discuss the unique design goals and architecture of UnifyFS and evaluate its performance on a leadership-class HPC system. In our experimental results, we demonstrate that UnifyFS exhibits excellent scaling performance for write operations and can improve the performance of application checkpoint operations by as much as 3× versus a tuned configuration.

我们介绍了UnifyFS，这是一个用户级文件系统，它聚合了高性能计算(HPC)系统上可用的节点本地存储层，并使它们在统一的命名空间下可供HPC应用程序使用。UnifyFS采用透明的I/O拦截，因此它不需要更改应用程序代码，并且与常用的HPC I/O库兼容。UnifyFS的设计支持主要的HPC I/O工作负载，并针对大容量同步I/O模式进行了优化。此外，UnifyFS提供可定制的文件系统语义，以灵活地调整其行为以适应不同的I/O工作负载和存储设备。在本文中，我们讨论了UnifyFS独特的设计目标和架构，并评估了其在领先级HPC系统上的性能。在我们的实验结果中，我们证明了UnifyFS在写操作方面表现出出色的伸缩性能，并且与调优配置相比，可以将应用程序检查点操作的性能提高3倍。

引用次数: 1

MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism MPipeMoE:具有自适应流水线并行性的预训练模型的内存有效移动

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00026

Zhenghang Zhang, Donglin Yang, Yaqi Xia, Liang Ding, Dacheng Tao, Xiaobo Zhou, Dazhao Cheng

Recently, Mixture-of-Experts (MoE) has become one of the most popular techniques to scale pre-trained models to extraordinarily large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system and algorithm optimizations, there are significant challenges to be tackled when it comes to the inefficiencies of communication and memory consumption.In this paper, we present the design and implementation of MPipeMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages, we design adaptive pipeline parallelism with an online algorithm to configure the granularity of the pipelining. Further, we analyze the memory footprint breakdown of MoE training and identify that activations and temporary buffers are the primary contributors to the overall memory footprint. Toward memory efficiency, we propose memory reusing strategies to reduce memory requirements by eliminating memory redundancies, and develop an adaptive selection component to determine the optimal strategy that considers both hardware capacities and model characteristics at runtime. We implement MPipeMoE upon PyTorch and evaluate it with common MoE models in a physical cluster consisting of 8 NVIDIA DGX A100 servers. Compared with the state-of-art approach, MPipeMoE achieves up to 2.8× speedup and reduces memory footprint by up to 47% in training large models.

最近，混合专家(MoE)已经成为将预训练模型扩展到超大规模的最流行的技术之一。专家的动态激活允许条件计算，增加神经网络参数的数量，这对于吸收许多深度学习领域中可用的大量知识至关重要。然而，尽管现有的系统和算法进行了优化，但当涉及到通信和内存消耗的低效率时，仍存在重大挑战需要解决。在本文中，我们提出了MPipeMoE的设计和实现，这是一个高性能库，可以通过自适应和内存高效的管道并行性来加速MoE训练。受MoE训练过程可划分为多个独立子阶段的启发，我们采用在线算法设计了自适应并行流水线，以配置流水线的粒度。此外，我们分析了MoE训练的内存占用分解，并确定激活和临时缓冲区是总体内存占用的主要贡献者。为了提高内存效率，我们提出了内存重用策略，通过消除内存冗余来减少内存需求，并开发了一个自适应选择组件，以确定在运行时考虑硬件容量和模型特征的最佳策略。我们在PyTorch上实现了mpipeemoe，并在由8台NVIDIA DGX A100服务器组成的物理集群中使用常见的MoE模型对其进行了评估。与最先进的方法相比，MPipeMoE在训练大型模型时实现了高达2.8倍的加速，并减少了高达47%的内存占用。

{"title":"MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism","authors":"Zhenghang Zhang, Donglin Yang, Yaqi Xia, Liang Ding, Dacheng Tao, Xiaobo Zhou, Dazhao Cheng","doi":"10.1109/IPDPS54959.2023.00026","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00026","url":null,"abstract":"Recently, Mixture-of-Experts (MoE) has become one of the most popular techniques to scale pre-trained models to extraordinarily large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system and algorithm optimizations, there are significant challenges to be tackled when it comes to the inefficiencies of communication and memory consumption.In this paper, we present the design and implementation of MPipeMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages, we design adaptive pipeline parallelism with an online algorithm to configure the granularity of the pipelining. Further, we analyze the memory footprint breakdown of MoE training and identify that activations and temporary buffers are the primary contributors to the overall memory footprint. Toward memory efficiency, we propose memory reusing strategies to reduce memory requirements by eliminating memory redundancies, and develop an adaptive selection component to determine the optimal strategy that considers both hardware capacities and model characteristics at runtime. We implement MPipeMoE upon PyTorch and evaluate it with common MoE models in a physical cluster consisting of 8 NVIDIA DGX A100 servers. Compared with the state-of-art approach, MPipeMoE achieves up to 2.8× speedup and reduces memory footprint by up to 47% in training large models.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132710551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

IPDPS 2023 Organization IPDPS 2023组织

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/ipdps54959.2023.00008

引用次数: 0

Signal Detection for Large MIMO Systems Using Sphere Decoding on FPGAs 基于fpga球面解码的大型MIMO系统信号检测

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00020

Mohamed W. Hassan, A. Dabah, H. Ltaief, Suhaib A. Fahmy

Wireless communication systems rely on aggressive spatial multiplexing Multiple-Input Multiple-Output (MIMO) access points to enhance network throughput. A significant computational hurdle for large MIMO systems is signal detection and decoding, which has exponentially increasing computational complexity as the number of antennas increases. Hence, the feasibility of large MIMO systems depends on suitable implementations of signal decoding schemes.This paper presents an FPGA-based Sphere Decoder (SD) architecture that provides high-performance signal decoding for large MIMO systems, supporting up to 16-QAM modulation. The SD algorithm is refactored to map well to the FPGA architecture using a GEMM-based approach to exploit the parallel computational power of FPGAs. We implement FPGA-specific optimization techniques to improve computational complexity. We show significant improvement in time to decode the received signal with under 10–2 BER. The design is deployed on a Xilinx Alveo U280 FPGA and shows up to a 9× speedup compared to optimized multi-core CPU execution, achieving real-time requirements. Our proposed design reduces power consumption by a geo-mean of 38.1× compared to CPU implementation, which is important in real-world deployments. We also evaluate our design against alternative approaches on GPU.

无线通信系统依靠积极的空间多路多输入多输出(MIMO)接入点来提高网络吞吐量。对于大型MIMO系统来说，一个重要的计算障碍是信号检测和解码，随着天线数量的增加，其计算复杂度呈指数增长。因此，大型MIMO系统的可行性取决于合适的信号解码方案的实现。本文提出了一种基于fpga的球形解码器(SD)架构，为大型MIMO系统提供高性能的信号解码，支持高达16-QAM调制。采用基于gem的方法对SD算法进行重构，使其能够很好地映射到FPGA架构中，从而充分利用FPGA的并行计算能力。我们实现了fpga特定的优化技术来提高计算复杂度。我们在解码接收到的低于10-2 BER的信号的时间上有了显著的改进。该设计部署在Xilinx Alveo U280 FPGA上，与优化后的多核CPU执行相比，速度提高了9倍，实现了实时性要求。与CPU实现相比，我们提出的设计将功耗降低了38.1倍，这在实际部署中非常重要。我们还针对GPU上的替代方法评估了我们的设计。

{"title":"Signal Detection for Large MIMO Systems Using Sphere Decoding on FPGAs","authors":"Mohamed W. Hassan, A. Dabah, H. Ltaief, Suhaib A. Fahmy","doi":"10.1109/IPDPS54959.2023.00020","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00020","url":null,"abstract":"Wireless communication systems rely on aggressive spatial multiplexing Multiple-Input Multiple-Output (MIMO) access points to enhance network throughput. A significant computational hurdle for large MIMO systems is signal detection and decoding, which has exponentially increasing computational complexity as the number of antennas increases. Hence, the feasibility of large MIMO systems depends on suitable implementations of signal decoding schemes.This paper presents an FPGA-based Sphere Decoder (SD) architecture that provides high-performance signal decoding for large MIMO systems, supporting up to 16-QAM modulation. The SD algorithm is refactored to map well to the FPGA architecture using a GEMM-based approach to exploit the parallel computational power of FPGAs. We implement FPGA-specific optimization techniques to improve computational complexity. We show significant improvement in time to decode the received signal with under 10–2 BER. The design is deployed on a Xilinx Alveo U280 FPGA and shows up to a 9× speedup compared to optimized multi-core CPU execution, achieving real-time requirements. Our proposed design reduces power consumption by a geo-mean of 38.1× compared to CPU implementation, which is important in real-world deployments. We also evaluate our design against alternative approaches on GPU.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131330761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generalizable Reinforcement Learning-Based Coarsening Model for Resource Allocation over Large and Diverse Stream Processing Graphs 基于广义强化学习的大型多元流处理图资源分配粗化模型

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00051

Lanshun Nie, Yuqi Qiu, Fei Meng, Mo Yu, Jing Li

Resource allocation for stream processing graphs on computing devices is critical to the performance of stream processing. Efficient allocations need to balance workload distribution and minimize communication simultaneously and globally. Since this problem is known to be NP-complete, recent machine learning solutions were proposed based on an encoder-decoder framework, which predicts the device assignment of computing nodes sequentially as an approximation. However, for large graphs, these solutions suffer from the deficiency in handling long-distance dependency and global information, resulting in suboptimal predictions. This work proposes a new paradigm to deal with this challenge, which first coarsens the graph and conducts assignments on the smaller graph with existing graph partitioning methods. Unlike existing graph coarsening works, we leverage the theoretical insights in this resource allocation problem, formulate the coarsening of stream graphs as edge-collapsing predictions, and propose an edge-aware coarsening model. Extensive experiments on various datasets show that our framework significantly improves over existing learning-based and heuristic-based baselines with up to 56% relative improvement on large graphs.

计算设备上流处理图的资源分配对流处理的性能至关重要。有效的分配需要平衡工作负载的分配，同时最小化全局通信。由于已知该问题是np完全的，因此最近提出了基于编码器-解码器框架的机器学习解决方案，该框架预测计算节点的设备分配顺序作为近似值。然而，对于大型图，这些解决方案在处理长距离依赖关系和全局信息方面存在缺陷，从而导致次优预测。这项工作提出了一个新的范式来应对这一挑战，该范式首先对图进行粗化，并使用现有的图划分方法对较小的图进行分配。与现有的图粗化工作不同，我们利用这个资源分配问题的理论见解，将流图的粗化作为边缘崩溃预测，并提出一个边缘感知的粗化模型。在各种数据集上进行的大量实验表明，我们的框架显著改进了现有的基于学习和基于启发式的基线，在大型图上的相对改进高达56%。

{"title":"Generalizable Reinforcement Learning-Based Coarsening Model for Resource Allocation over Large and Diverse Stream Processing Graphs","authors":"Lanshun Nie, Yuqi Qiu, Fei Meng, Mo Yu, Jing Li","doi":"10.1109/IPDPS54959.2023.00051","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00051","url":null,"abstract":"Resource allocation for stream processing graphs on computing devices is critical to the performance of stream processing. Efficient allocations need to balance workload distribution and minimize communication simultaneously and globally. Since this problem is known to be NP-complete, recent machine learning solutions were proposed based on an encoder-decoder framework, which predicts the device assignment of computing nodes sequentially as an approximation. However, for large graphs, these solutions suffer from the deficiency in handling long-distance dependency and global information, resulting in suboptimal predictions. This work proposes a new paradigm to deal with this challenge, which first coarsens the graph and conducts assignments on the smaller graph with existing graph partitioning methods. Unlike existing graph coarsening works, we leverage the theoretical insights in this resource allocation problem, formulate the coarsening of stream graphs as edge-collapsing predictions, and propose an edge-aware coarsening model. Extensive experiments on various datasets show that our framework significantly improves over existing learning-based and heuristic-based baselines with up to 56% relative improvement on large graphs.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114133003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Smart Redbelly Blockchain: Reducing Congestion for Web3 智能红腹区块链:减少Web3的拥塞

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00098

Deepal Tennakoon, Yiding Hua, V. Gramoli

Decentralization promises to remedy the drawbacks of the web by executing decentralized applications (DApps) on blockchains. Unfortunately, modern blockchains cannot support realistic web application workloads mainly due to congestion.We introduce the Smart Redbelly Blockchain (SRBB), a provably correct permissionless blockchain that reduces congestion by (1) avoiding redundant propagation and validations of transactions with Transaction Validation and Propagation Reduction (TVPR) and (2) mitigating the propagation of invalid transactions within blocks by Byzantine nodes with a dedicated Reward-Penalty Mechanism (RPM). Our comparison of SRBB against Algorand, Avalanche, Diem, Ethereum, Quorum, and Solana, using the DIABLO benchmark suite, indicates that SRBB outperforms all these blockchains under real application workloads. Moreover, SRBB is the only blockchain to successfully execute real workloads of NASDAQ and Uber on a DApp without losing transactions. To demonstrate that TVPR and RPM are the causes of the improved performance, we compare SRBB with its naive baseline, which does not contain TVPR and RPM. Our results show that TVPR increases the throughput by 55× and divides the latency by 3.5, while RPM increases the throughput by 7% under flooding attacks. Finally, TVPR helps reduce transaction losses in the normal scenario while RPM goes further and mitigates transaction losses under flooding attacks.

去中心化有望通过在区块链上执行去中心化应用程序(DApps)来弥补网络的缺点。不幸的是，由于拥塞，现代区块链无法支持现实的web应用工作负载。我们介绍了智能红腹区块链(SRBB)，这是一种可证明正确的无权限区块链，通过(1)避免交易的冗余传播和交易验证减少(TVPR)和(2)通过专用奖惩机制(RPM)减轻拜占庭节点在块内无效交易的传播来减少拥塞。我们使用DIABLO基准套件将SRBB与Algorand、Avalanche、Diem、Ethereum、Quorum和Solana进行比较，表明SRBB在实际应用工作负载下的性能优于所有这些区块链。此外，SRBB是唯一一个在DApp上成功执行纳斯达克和Uber真实工作负载而不丢失交易的区块链。为了证明TVPR和RPM是提高性能的原因，我们将SRBB与不包含TVPR和RPM的原始基线进行了比较。我们的研究结果表明，在洪水攻击下，TVPR将吞吐量提高了55倍，将延迟减少了3.5倍，而RPM将吞吐量提高了7%。最后，TVPR有助于减少正常情况下的事务损失，而RPM则更进一步，可以减轻泛洪攻击下的事务损失。

{"title":"Smart Redbelly Blockchain: Reducing Congestion for Web3","authors":"Deepal Tennakoon, Yiding Hua, V. Gramoli","doi":"10.1109/IPDPS54959.2023.00098","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00098","url":null,"abstract":"Decentralization promises to remedy the drawbacks of the web by executing decentralized applications (DApps) on blockchains. Unfortunately, modern blockchains cannot support realistic web application workloads mainly due to congestion.We introduce the Smart Redbelly Blockchain (SRBB), a provably correct permissionless blockchain that reduces congestion by (1) avoiding redundant propagation and validations of transactions with Transaction Validation and Propagation Reduction (TVPR) and (2) mitigating the propagation of invalid transactions within blocks by Byzantine nodes with a dedicated Reward-Penalty Mechanism (RPM). Our comparison of SRBB against Algorand, Avalanche, Diem, Ethereum, Quorum, and Solana, using the DIABLO benchmark suite, indicates that SRBB outperforms all these blockchains under real application workloads. Moreover, SRBB is the only blockchain to successfully execute real workloads of NASDAQ and Uber on a DApp without losing transactions. To demonstrate that TVPR and RPM are the causes of the improved performance, we compare SRBB with its naive baseline, which does not contain TVPR and RPM. Our results show that TVPR increases the throughput by 55× and divides the latency by 3.5, while RPM increases the throughput by 7% under flooding attacks. Finally, TVPR helps reduce transaction losses in the normal scenario while RPM goes further and mitigates transaction losses under flooding attacks.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114188769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀