IEEE Transactions on Parallel and Distributed Systems最新文献_第6页

DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks DeepCAT+：用于大数据框架的低成本、可转移的在线配置自动调整方法

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-12 DOI: 10.1109/TPDS.2024.3459889

Hui Dou;Yilun Wang;Yiwen Zhang;Pengfei Chen;Zibin Zheng

Big data frameworks usually provide a large number of performance-related parameters. Online auto-tuning these parameters based on deep reinforcement learning (DRL) to achieve a better performance has shown their advantages over search-based and machine learning-based approaches. Unfortunately, the time cost during the online tuning phase of conventional DRL-based methods is still heavy, especially for Big Data applications. Therefore, in this paper, we propose DeepCAT

$^+$

, a low-cost and transferrable deep reinforcement learning-based approach to achieve online configuration auto-tuning for Big Data frameworks. To reduce the total online tuning cost and increase the adaptability: 1) DeepCAT

$^+$

utilizes the TD3 algorithm instead of DDPG to alleviate value overestimation; 2) DeepCAT

$^+$

modifies the conventional experience replay to fully utilize the rare but valuable transitions via a novel reward-driven prioritized experience replay mechanism; 3) DeepCAT

$^+$

designs a Twin-Q Optimizer to estimate the execution time of each action without the costly configuration evaluation and optimize the sub-optimal ones to achieve a low-cost exploration-exploitation tradeoff; 4) Furthermore, DeepCAT

$^+$

also implements an Online Continual Learner module based on Progressive Neural Networks to transfer knowledge from historical tuning experiences. Experimental results based on a lab Spark cluster with HiBench benchmark applications show that DeepCAT

$^+$

is able to speed up the best execution time by a factor of 1.49×, 1.63× and 1.65× on average respectively over the baselines, while consuming up to 50.08%, 53.39% and 70.79% less total tuning time. In addition, DeepCAT

$^+$

also has a strong adaptability to the time-varying environment of Big Data frameworks.

大数据框架通常会提供大量与性能相关的参数。与基于搜索和机器学习的方法相比，基于深度强化学习（DRL）的在线自动调整这些参数以获得更好的性能已显示出其优势。遗憾的是，基于 DRL 的传统方法在在线调整阶段的时间成本仍然很高，尤其是在大数据应用中。因此，我们在本文中提出了 DeepCAT$^+$，一种低成本、可移植的基于深度强化学习的方法，用于实现大数据框架的在线配置自动调整。为了降低总的在线调优成本并提高适应性，我们提出了以下几种方法：1）DeepCAT$^+$ 利用 TD3 算法而不是 DDPG 来减轻价值高估；2）DeepCAT$^+$ 通过一种新颖的奖励驱动优先级经验重放机制，修改了传统的经验重放，以充分利用稀有但有价值的转换；3）DeepCAT$^+$ 设计了双 Q 优化器（Twin-Q Optimizer），在不进行高成本配置评估的情况下估算每个动作的执行时间，并优化次优动作，以实现低成本的探索-开发权衡；4）此外，DeepCAT$^+$ 还实现了基于渐进式神经网络的在线持续学习模块，从历史调整经验中转移知识。基于实验室 Spark 集群和 HiBench 基准应用的实验结果表明，DeepCAT$^+$ 能够将最佳执行时间分别比基准平均加快 1.49 倍、1.63 倍和 1.65 倍，同时将总调整时间分别减少 50.08%、53.39% 和 70.79%。此外，DeepCAT$^+$ 还能很好地适应大数据框架的时变环境。

{"title":"DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks","authors":"Hui Dou;Yilun Wang;Yiwen Zhang;Pengfei Chen;Zibin Zheng","doi":"10.1109/TPDS.2024.3459889","DOIUrl":"10.1109/TPDS.2024.3459889","url":null,"abstract":"Big data frameworks usually provide a large number of performance-related parameters. Online auto-tuning these parameters based on deep reinforcement learning (DRL) to achieve a better performance has shown their advantages over search-based and machine learning-based approaches. Unfortunately, the time cost during the online tuning phase of conventional DRL-based methods is still heavy, especially for Big Data applications. Therefore, in this paper, we propose DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000, a low-cost and transferrable deep reinforcement learning-based approach to achieve online configuration auto-tuning for Big Data frameworks. To reduce the total online tuning cost and increase the adaptability: 1) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 utilizes the TD3 algorithm instead of DDPG to alleviate value overestimation; 2) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 modifies the conventional experience replay to fully utilize the rare but valuable transitions via a novel reward-driven prioritized experience replay mechanism; 3) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 designs a Twin-Q Optimizer to estimate the execution time of each action without the costly configuration evaluation and optimize the sub-optimal ones to achieve a low-cost exploration-exploitation tradeoff; 4) Furthermore, DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 also implements an Online Continual Learner module based on Progressive Neural Networks to transfer knowledge from historical tuning experiences. Experimental results based on a lab Spark cluster with HiBench benchmark applications show that DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 is able to speed up the best execution time by a factor of 1.49×, 1.63× and 1.65× on average respectively over the baselines, while consuming up to 50.08%, 53.39% and 70.79% less total tuning time. In addition, DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 also has a strong adaptability to the time-varying environment of Big Data frameworks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2114-2131"},"PeriodicalIF":5.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gamora: Learning-Based Buffer-Aware Preloading for Adaptive Short Video Streaming 卡魔拉基于学习的缓冲区感知预加载，实现自适应短视频流

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-09 DOI: 10.1109/TPDS.2024.3456567

Biao Hou;Song Yang;Fan Li;Liehuang Zhu;Lei Jiao;Xu Chen;Xiaoming Fu

Nowadays, the emerging short video streaming applications have gained substantial attention. With the rapidly burgeoning demand for short video streaming services, maximizing their Quality of Experience (QoE) is an onerous challenge. Current video preloading algorithms cannot determine video preloading sequence decisions appropriately due to the impact of users’ swipes and bandwidth fluctuations. As a result, it is still ambiguous how to improve the overall QoE while mitigating bandwidth wastage to optimize short video streaming services. In this article, we devise Gamora, a buffer-aware short video streaming system to provide a high QoE of users. In Gamora, we first propose an unordered preloading algorithm that utilizes a Deep Reinforcement Learning (DRL) algorithm to make video preloading decisions. Then, we further devise an Asymmetric Imitation Learning (AIL) algorithm to guide the DRL-based preloading algorithm, which enables the agent to learn from expert demonstrations for fast convergence. Finally, we implement our proposed short video streaming system prototype and evaluate the performance of Gamora on various real-world network datasets. Our results demonstrate that Gamora significantly achieves QoE improvement by 28.7%–51.4% compared to state-of-the-art algorithms, while mitigating bandwidth wastage by 40.7%–83.2% without sacrificing video quality.

如今，新兴的短视频流应用已受到广泛关注。随着短视频流服务需求的快速增长，如何最大限度地提高其体验质量（QoE）成为一项艰巨的挑战。由于用户刷卡和带宽波动的影响，当前的视频预加载算法无法正确确定视频预加载顺序决策。因此，如何在改善整体 QoE 的同时减少带宽浪费以优化短视频流媒体服务仍是一个模糊的问题。在本文中，我们设计了一种缓冲感知短视频流系统 Gamora，为用户提供高 QoE。在 Gamora 中，我们首先提出了一种无序预加载算法，利用深度强化学习（DRL）算法做出视频预加载决策。然后，我们进一步设计了一种非对称模仿学习（AIL）算法来指导基于 DRL 的预加载算法，使代理能够从专家示范中学习，从而快速收敛。最后，我们实现了所提出的短视频流系统原型，并在各种实际网络数据集上评估了 Gamora 的性能。结果表明，与最先进的算法相比，Gamora 的 QoE 显著提高了 28.7%-51.4%，同时在不牺牲视频质量的情况下减少了 40.7%-83.2% 的带宽浪费。

{"title":"Gamora: Learning-Based Buffer-Aware Preloading for Adaptive Short Video Streaming","authors":"Biao Hou;Song Yang;Fan Li;Liehuang Zhu;Lei Jiao;Xu Chen;Xiaoming Fu","doi":"10.1109/TPDS.2024.3456567","DOIUrl":"10.1109/TPDS.2024.3456567","url":null,"abstract":"Nowadays, the emerging short video streaming applications have gained substantial attention. With the rapidly burgeoning demand for short video streaming services, maximizing their Quality of Experience (QoE) is an onerous challenge. Current video preloading algorithms cannot determine video preloading sequence decisions appropriately due to the impact of users’ swipes and bandwidth fluctuations. As a result, it is still ambiguous how to improve the overall QoE while mitigating bandwidth wastage to optimize short video streaming services. In this article, we devise Gamora, a buffer-aware short video streaming system to provide a high QoE of users. In Gamora, we first propose an unordered preloading algorithm that utilizes a Deep Reinforcement Learning (DRL) algorithm to make video preloading decisions. Then, we further devise an Asymmetric Imitation Learning (AIL) algorithm to guide the DRL-based preloading algorithm, which enables the agent to learn from expert demonstrations for fast convergence. Finally, we implement our proposed short video streaming system prototype and evaluate the performance of Gamora on various real-world network datasets. Our results demonstrate that Gamora significantly achieves QoE improvement by 28.7%–51.4% compared to state-of-the-art algorithms, while mitigating bandwidth wastage by 40.7%–83.2% without sacrificing video quality.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2132-2146"},"PeriodicalIF":5.6,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Trusted Model Aggregation With Zero-Knowledge Proofs in Federated Learning 联盟学习中的零知识证明可信模型聚合

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-06 DOI: 10.1109/TPDS.2024.3455762

Renwen Ma;Kai Hwang;Mo Li;Yiming Miao

This paper proposes a new global model aggregation method based on using zero-knowledge federated learning (ZKFL). The purpose is to secure horizontal or P2P federated machine learning systems with shorter aggregation times, higher model accuracy, and lower system costs. We use a model parameter-sharing Chord overlay network among all client hosts. The overlay guarantees a trusted sharing of zero-knowledge proofs for aggregation integrity, even under malicious Byzantine attacks. We tested over popular datasets, Fashion-MNIST and CIFAR10, to prove the new system protection concept. Our benchmark experiments validate the claimed advantages of the ZKFL scheme in all objective functions. Our aggregation method can be applied to secure both rank-based and similarity-based aggregation schemes. For a large system with over 200 clients, our system takes only 3 seconds to yield high-precision global machine models under the ALIE attacks with the Fashion-MNIST dataset. We have achieved up to 85% model accuracy, compared to only 3%

$sim$

45% accuracy observed with federated schemes without protection. Moreover, our method demands a low memory overhead for handling zero-knowledge proofs as the system scales greatly to a larger number of client nodes.

本文基于零知识联合学习（ZKFL）提出了一种新的全局模型聚合方法。其目的是以更短的聚合时间、更高的模型准确性和更低的系统成本确保水平或 P2P 联合机器学习系统的安全。我们在所有客户主机之间使用模型参数共享的 Chord 重叠网络。即使在受到恶意拜占庭攻击的情况下，叠加网络也能保证可信的零知识证明共享，从而保证聚合的完整性。我们在流行数据集 Fashion-MNIST 和 CIFAR10 上进行了测试，以证明新的系统保护概念。我们的基准实验验证了 ZKFL 方案在所有目标函数中宣称的优势。我们的聚合方法既可用于保护基于等级的聚合方案，也可用于保护基于相似性的聚合方案。对于一个拥有 200 多个客户端的大型系统，我们的系统只需 3 秒钟就能利用 Fashion-MNIST 数据集生成 ALIE 攻击下的高精度全局机器模型。我们的模型准确率高达 85%，而无保护的联合方案准确率仅为 3%。此外，我们的方法只需较低的内存开销来处理零知识证明，因为系统可以极大地扩展到更多的客户端节点。

{"title":"Trusted Model Aggregation With Zero-Knowledge Proofs in Federated Learning","authors":"Renwen Ma;Kai Hwang;Mo Li;Yiming Miao","doi":"10.1109/TPDS.2024.3455762","DOIUrl":"10.1109/TPDS.2024.3455762","url":null,"abstract":"This paper proposes a new global model aggregation method based on using zero-knowledge federated learning (ZKFL). The purpose is to secure horizontal or P2P federated machine learning systems with shorter aggregation times, higher model accuracy, and lower system costs. We use a model parameter-sharing Chord overlay network among all client hosts. The overlay guarantees a trusted sharing of zero-knowledge proofs for aggregation integrity, even under malicious Byzantine attacks. We tested over popular datasets, Fashion-MNIST and CIFAR10, to prove the new system protection concept. Our benchmark experiments validate the claimed advantages of the ZKFL scheme in all objective functions. Our aggregation method can be applied to secure both rank-based and similarity-based aggregation schemes. For a large system with over 200 clients, our system takes only 3 seconds to yield high-precision global machine models under the ALIE attacks with the Fashion-MNIST dataset. We have achieved up to 85% model accuracy, compared to only 3%\u0000<inline-formula><tex-math>$sim$</tex-math></inline-formula>\u000045% accuracy observed with federated schemes without protection. Moreover, our method demands a low memory overhead for handling zero-knowledge proofs as the system scales greatly to a larger number of client nodes.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2284-2296"},"PeriodicalIF":5.6,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FedVeca: Federated Vectorized Averaging on Non-IID Data With Adaptive Bi-Directional Global Objective FedVeca：非 IID 数据的联合矢量化平均与自适应双向全局目标

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-04 DOI: 10.1109/TPDS.2024.3454203

Ping Luo;Jieren Cheng;N. Xiong;Zhenhao Liu;Jie Wu

Federated Learning (FL) is a distributed machine learning framework in parallel and distributed systems. However, the systems’ Non-Independent and Identically Distributed (Non-IID) data negatively affect the communication efficiency, since clients with different datasets may cause significant gaps to the local gradients in each communication round. In this article, we propose a Federated Vectorized Averaging (FedVeca) method to optimize the FL communication system on Non-IID data. Specifically, we set a novel objective for the global model which is related to the local gradients. The local gradient is defined as a bi-directional vector with step size and direction, where the step size is the number of local updates and the direction is divided into positive and negative according to our definition. In FedVeca, the direction is influenced by the step size, thus we average the bi-directional vectors to reduce the effect of different step sizes. Then, we theoretically analyze the relationship between the step sizes and the global objective, and obtain upper bounds on the step sizes per communication round. Based on the upper bounds, we design an algorithm for the server and the client to adaptively adjusts the step sizes that make the objective close to the optimum. Finally, we conduct experiments on different datasets, models and scenarios by building a prototype system, and the experimental results demonstrate the effectiveness and efficiency of the FedVeca method.

联合学习（FL）是并行和分布式系统中的一种分布式机器学习框架。然而，系统中的非独立和相同分布（Non-IID）数据会对通信效率产生负面影响，因为拥有不同数据集的客户端可能会在每轮通信中对本地梯度造成巨大差距。在本文中，我们提出了一种联邦矢量化平均（FedVeca）方法，用于优化非独立同分布数据的 FL 通信系统。具体来说，我们为全局模型设定了一个与局部梯度相关的新目标。根据我们的定义，局部梯度被定义为具有步长和方向的双向向量，其中步长是局部更新的次数，方向分为正向和负向。在 FedVeca 中，方向受步长的影响，因此我们将双向向量平均化，以减少不同步长的影响。然后，我们从理论上分析了步长与全局目标之间的关系，并得出了每轮通信的步长上限。在此基础上，我们为服务器和客户端设计了一种算法，用于自适应地调整步长，使目标接近最优。最后，我们通过构建原型系统，在不同的数据集、模型和场景下进行了实验，实验结果证明了 FedVeca 方法的有效性和高效性。

{"title":"FedVeca: Federated Vectorized Averaging on Non-IID Data With Adaptive Bi-Directional Global Objective","authors":"Ping Luo;Jieren Cheng;N. Xiong;Zhenhao Liu;Jie Wu","doi":"10.1109/TPDS.2024.3454203","DOIUrl":"10.1109/TPDS.2024.3454203","url":null,"abstract":"Federated Learning (FL) is a distributed machine learning framework in parallel and distributed systems. However, the systems’ Non-Independent and Identically Distributed (Non-IID) data negatively affect the communication efficiency, since clients with different datasets may cause significant gaps to the local gradients in each communication round. In this article, we propose a Federated Vectorized Averaging (FedVeca) method to optimize the FL communication system on Non-IID data. Specifically, we set a novel objective for the global model which is related to the local gradients. The local gradient is defined as a bi-directional vector with step size and direction, where the step size is the number of local updates and the direction is divided into positive and negative according to our definition. In FedVeca, the direction is influenced by the step size, thus we average the bi-directional vectors to reduce the effect of different step sizes. Then, we theoretically analyze the relationship between the step sizes and the global objective, and obtain upper bounds on the step sizes per communication round. Based on the upper bounds, we design an algorithm for the server and the client to adaptively adjusts the step sizes that make the objective close to the optimum. Finally, we conduct experiments on different datasets, models and scenarios by building a prototype system, and the experimental results demonstrate the effectiveness and efficiency of the FedVeca method.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2102-2113"},"PeriodicalIF":5.6,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-Throughput GPU Implementation of Dilithium Post-Quantum Digital Signature 锂后量子数字签名的高吞吐量 GPU 实现

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453289

Shiyu Shen;Hao Yang;Wangchen Dai;Hong Zhang;Zhe Liu;Yunlei Zhao

Digital signatures are fundamental building blocks in various protocols to provide integrity and authenticity. The development of the quantum computing has raised concerns about the security guarantees afforded by classical signature schemes. CRYSTALS-Dilithium is an efficient post-quantum digital signature scheme based on lattice cryptography and has been selected as the primary algorithm for standardization by the National Institute of Standards and Technology. In this work, we present a high-throughput GPU implementation of Dilithium. For individual operations, we employ a range of computational and memory optimizations to overcome sequential constraints, reduce memory usage and IO latency, address bank conflicts, and mitigate pipeline stalls. This results in high and balanced compute throughput and memory throughput for each operation. In terms of concurrent task processing, we leverage task-level batching to fully utilize parallelism and implement a memory pool mechanism for rapid memory access. We propose a dynamic task scheduling mechanism to improve multiprocessor occupancy and significantly reduce execution time. Furthermore, we apply asynchronous computing and launch multiple streams to hide data transfer latencies and maximize the computing capabilities of both CPU and GPU. Across all three security levels, our GPU implementation achieves over 160× speedups for signing and over 80× speedups for verification on both commercial and server-grade GPUs. This achieves microsecond-level amortized execution times for each task, offering a high-throughput and quantum-resistant solution suitable for a wide array of applications in real systems.

数字签名是各种协议中提供完整性和真实性的基本构件。量子计算的发展引发了人们对经典签名方案所提供的安全保证的担忧。CRYSTALS-Dilithium 是一种基于晶格密码学的高效后量子数字签名方案，已被美国国家标准与技术研究院选为标准化的主要算法。在这项工作中，我们介绍了 Dilithium 的高吞吐量 GPU 实现。对于单个操作，我们采用了一系列计算和内存优化措施，以克服顺序限制、减少内存使用和 IO 延迟、解决库冲突并缓解流水线停滞。因此，每项操作的计算吞吐量和内存吞吐量都很高，而且很均衡。在并发任务处理方面，我们利用任务级批处理来充分利用并行性，并实施了快速内存访问的内存池机制。我们提出了一种动态任务调度机制，以提高多处理器占用率并显著缩短执行时间。此外，我们还应用异步计算并启动多个流来隐藏数据传输延迟，最大限度地发挥 CPU 和 GPU 的计算能力。在所有三个安全级别中，我们的 GPU 实现在商用和服务器级 GPU 上的签名速度提高了 160 倍以上，验证速度提高了 80 倍以上。这使得每个任务的摊销执行时间达到了微秒级，从而提供了一种适合实际系统中各种应用的高吞吐量和抗量子解决方案。

{"title":"High-Throughput GPU Implementation of Dilithium Post-Quantum Digital Signature","authors":"Shiyu Shen;Hao Yang;Wangchen Dai;Hong Zhang;Zhe Liu;Yunlei Zhao","doi":"10.1109/TPDS.2024.3453289","DOIUrl":"10.1109/TPDS.2024.3453289","url":null,"abstract":"Digital signatures are fundamental building blocks in various protocols to provide integrity and authenticity. The development of the quantum computing has raised concerns about the security guarantees afforded by classical signature schemes. CRYSTALS-Dilithium is an efficient post-quantum digital signature scheme based on lattice cryptography and has been selected as the primary algorithm for standardization by the National Institute of Standards and Technology. In this work, we present a high-throughput GPU implementation of Dilithium. For individual operations, we employ a range of computational and memory optimizations to overcome sequential constraints, reduce memory usage and IO latency, address bank conflicts, and mitigate pipeline stalls. This results in high and balanced compute throughput and memory throughput for each operation. In terms of concurrent task processing, we leverage task-level batching to fully utilize parallelism and implement a memory pool mechanism for rapid memory access. We propose a dynamic task scheduling mechanism to improve multiprocessor occupancy and significantly reduce execution time. Furthermore, we apply asynchronous computing and launch multiple streams to hide data transfer latencies and maximize the computing capabilities of both CPU and GPU. Across all three security levels, our GPU implementation achieves over 160× speedups for signing and over 80× speedups for verification on both commercial and server-grade GPUs. This achieves microsecond-level amortized execution times for each task, offering a high-throughput and quantum-resistant solution suitable for a wide array of applications in real systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1964-1976"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SC-CGRA: An Energy-Efficient CGRA Using Stochastic Computing SC-CGRA：使用随机计算的高能效 CGRA

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453310

Di Mou;Bo Wang;Dajiang Liu

Stochastic Computing (SC) offers a promising computing paradigm for low-power and cost-effective applications, with the added advantage of high error tolerance. In parallel, Coarse-Grained Reconfigurable Arrays (CGRA) prove to be a highly promising platform for domain-specific applications due to their combination of energy efficiency and flexibility. Intuitively, introducing SC to CGRA would significantly reinforce the strengths of both paradigms. However, existing SC-based architectures often encounter inherent computation errors, while the stochastic number generators employed in SC result in exponentially growing latency, which is deemed unacceptable in CGRA. In this work, we propose an SC-based CGRA by replacing the exact multiplication in traditional CGRA with an SC-based multiplication. To improve the accuracy of SC and shorten the latency of Stochastic Number Generators (SNG), we introduce the leading zero shifting and comparator truncation, while keeping the length of bitstream fixed. In addition, due to the flexible interconnections among PEs, we propose a quality scaling strategy that combines neighbor PEs to achieve high-accuracy operations without switching costs like power-gating. Compared to the state-of-the-art approximate computing design of CGRA, our proposed CGRA can averagely achieve a 65.3% reduction in output error while having a 21.2% reduction in energy consumption and a noteworthy 28.37% area savings.

随机计算（Schochastic Computing，SC）为低功耗、高成本效益的应用提供了一种前景广阔的计算范式，并具有高容错性的额外优势。与此同时，粗粒度可重构阵列（CGRA）由于兼具能效和灵活性，被证明是一种非常有前途的特定领域应用平台。直观地说，将 SC 引入 CGRA 将大大加强这两种模式的优势。然而，现有的基于 SC 的架构经常会遇到固有的计算错误，而 SC 中采用的随机数字生成器会导致指数级增长的延迟，这在 CGRA 中被认为是不可接受的。在这项工作中，我们提出了一种基于 SC 的 CGRA，用基于 SC 的乘法取代传统 CGRA 中的精确乘法。为了提高 SC 的精度并缩短随机数发生器 (SNG) 的延迟，我们引入了前导零移位和比较器截断，同时保持比特流的长度不变。此外，由于 PE 之间具有灵活的互连，我们提出了一种质量缩放策略，即结合相邻 PE 实现高精度操作，而无需电源门等开关成本。与最先进的近似计算 CGRA 设计相比，我们提出的 CGRA 平均可将输出误差减少 65.3%，同时能耗减少 21.2%，面积节省 28.37%。

{"title":"SC-CGRA: An Energy-Efficient CGRA Using Stochastic Computing","authors":"Di Mou;Bo Wang;Dajiang Liu","doi":"10.1109/TPDS.2024.3453310","DOIUrl":"10.1109/TPDS.2024.3453310","url":null,"abstract":"Stochastic Computing (SC) offers a promising computing paradigm for low-power and cost-effective applications, with the added advantage of high error tolerance. In parallel, Coarse-Grained Reconfigurable Arrays (CGRA) prove to be a highly promising platform for domain-specific applications due to their combination of energy efficiency and flexibility. Intuitively, introducing SC to CGRA would significantly reinforce the strengths of both paradigms. However, existing SC-based architectures often encounter inherent computation errors, while the stochastic number generators employed in SC result in exponentially growing latency, which is deemed unacceptable in CGRA. In this work, we propose an SC-based CGRA by replacing the exact multiplication in traditional CGRA with an SC-based multiplication. To improve the accuracy of SC and shorten the latency of Stochastic Number Generators (SNG), we introduce the leading zero shifting and comparator truncation, while keeping the length of bitstream fixed. In addition, due to the flexible interconnections among PEs, we propose a quality scaling strategy that combines neighbor PEs to achieve high-accuracy operations without switching costs like power-gating. Compared to the state-of-the-art approximate computing design of CGRA, our proposed CGRA can averagely achieve a 65.3% reduction in output error while having a 21.2% reduction in energy consumption and a noteworthy 28.37% area savings.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2023-2038"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Efficient Graph Processing in Geo-Distributed Data Centers 在地理分布式数据中心实现高效图形处理

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453872

Feng Yao;Qian Tao;Shengyuan Lin;Yanfeng Zhang;Wenyuan Yu;Shufeng Gong;Qiange Wang;Ge Yu;Jingren Zhou

Iterative graph processing is widely used as a significant paradigm for large-scale data analysis. In many global businesses of multinational enterprises, graph-structure data is usually geographically distributed in different regions to support low-latency services. Geo-distributed graph processing suffers from the Wide Area Networks (WANs) with scarce and heterogeneous bandwidth, thus essentially differs from traditional distributed graph processing. In this paper, we propose RAGraph, a Region-Aware framework for geo-distributed graph processing. At the core of RAGraph, we design a region-aware graph processing framework that allows advancing inefficient global updates locally and enables sensible coordination-free message interactions and flexible replaceable communication module. In terms of graph data preprocessing, RAGraph introduces a contribution-driven edge migration algorithm to effectively utilize network resources. RAGraph also contains an adaptive hierarchical message interaction engine to switch interaction modes adaptively based on network heterogeneity and fluctuation, and a discrepancy-aware message filtering strategy to filter important messages. Experimental results show that RAGraph can achieve an average speedup of 9.7× (up to 98×) and an average WAN cost reduction of 78.5

$%$

(up to 97.3

$%$

) compared with state-of-the-art systems.

迭代图处理作为大规模数据分析的重要范例得到广泛应用。在许多跨国企业的全球业务中，图结构数据通常地理分布在不同地区，以支持低延迟服务。地理分布式图处理受广域网（WAN）带宽稀缺和异构的影响，因此与传统的分布式图处理存在本质区别。在本文中，我们提出了用于地理分布式图处理的区域感知框架 RAGraph。作为 RAGraph 的核心，我们设计了一个区域感知图处理框架，允许在本地推进低效的全局更新，实现合理的免协调消息交互和灵活的可替换通信模块。在图数据预处理方面，RAGraph 引入了贡献驱动的边迁移算法，以有效利用网络资源。RAGraph还包含一个自适应分层消息交互引擎，可根据网络异构性和波动性自适应地切换交互模式，还包含一个差异感知消息过滤策略，可过滤重要消息。实验结果表明，与最先进的系统相比，RAGraph 的平均速度提高了 9.7 倍（最高 98 倍），平均广域网成本降低了 78.5%（最高 97.3%）。

{"title":"Towards Efficient Graph Processing in Geo-Distributed Data Centers","authors":"Feng Yao;Qian Tao;Shengyuan Lin;Yanfeng Zhang;Wenyuan Yu;Shufeng Gong;Qiange Wang;Ge Yu;Jingren Zhou","doi":"10.1109/TPDS.2024.3453872","DOIUrl":"10.1109/TPDS.2024.3453872","url":null,"abstract":"Iterative graph processing is widely used as a significant paradigm for large-scale data analysis. In many global businesses of multinational enterprises, graph-structure data is usually geographically distributed in different regions to support low-latency services. Geo-distributed graph processing suffers from the Wide Area Networks (WANs) with scarce and heterogeneous bandwidth, thus essentially differs from traditional distributed graph processing. In this paper, we propose RAGraph, a \u0000Region-Aware framework for geo-distributed graph processing\u0000. At the core of RAGraph, we design a region-aware graph processing framework that allows advancing inefficient global updates locally and enables sensible coordination-free message interactions and flexible replaceable communication module. In terms of graph data preprocessing, RAGraph introduces a contribution-driven edge migration algorithm to effectively utilize network resources. RAGraph also contains an adaptive hierarchical message interaction engine to switch interaction modes adaptively based on network heterogeneity and fluctuation, and a discrepancy-aware message filtering strategy to filter important messages. Experimental results show that RAGraph can achieve an average speedup of 9.7× (up to 98×) and an average WAN cost reduction of 78.5\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 (up to 97.3\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000) compared with state-of-the-art systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2147-2160"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ComboFunc: Joint Resource Combination and Container Placement for Serverless Function Scaling With Heterogeneous Container ComboFunc：联合资源组合与容器放置，实现无服务器功能与异构容器的扩展

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3454071

Zhaojie Wen;Qiong Chen;Quanfeng Deng;Yipei Niu;Zhen Song;Fangming Liu

Serverless computing provides developers with a maintenance-free approach to resource usage, but it also transfers resource management responsibility to the cloud platform. However, the fine granularity of serverless function resources can lead to performance bottlenecks and resource fragmentation on nodes when creating many function containers. This poses challenges in effectively scaling function resources and optimizing node resource allocation, hindering overall agility. To address these challenges, we have introduced ComboFunc, an innovative resource scaling system for serverless platforms. ComboFunc associates function with heterogeneous containers of varying specifications and optimizes their resource combination and placement. This approach not only selects appropriate nodes for container creation, but also leverages the new feature of Kubernetes In-place Pod Vertical Scaling to enhance resource scaling agility and efficiency. By allowing a single function to correspond to heterogeneous containers with varying resource specifications and providing the ability to modify the resource specifications of existing containers in place, ComboFunc effectively utilizes fragmented resources on nodes. This, in turn, enhances the overall resource utilization of the entire cluster and improves scaling agility. We also model the problem of combining and placing heterogeneous containers as an NP-hard problem and design a heuristic solution based on a greedy algorithm that solves it in polynomial time. We implemented a prototype of ComboFunc on the Kubernetes platform and conducted experiments using real traces on a local cluster. The results demonstrate that, compared to existing strategies, ComboFunc achieves up to 3.01 × faster function resource scaling and reduces resource costs by up to 42.6%.

无服务器计算为开发人员提供了一种免维护的资源使用方法，但同时也将资源管理责任转移给了云平台。然而，当创建许多功能容器时，无服务器功能资源的细粒度可能会导致节点上出现性能瓶颈和资源碎片。这给有效扩展功能资源和优化节点资源分配带来了挑战，阻碍了整体敏捷性。为了应对这些挑战，我们为无服务器平台推出了创新的资源扩展系统 ComboFunc。ComboFunc 将函数与不同规格的异构容器关联起来，并优化它们的资源组合和布局。这种方法不仅能为容器创建选择合适的节点，还能利用 Kubernetes 就地 Pod 垂直扩展的新功能来提高资源扩展的灵活性和效率。ComboFunc 允许一个函数对应具有不同资源规格的异构容器，并提供就地修改现有容器资源规格的功能，从而有效利用了节点上的零散资源。这反过来又提高了整个集群的整体资源利用率，提高了扩展灵活性。我们还将异构容器的组合和放置问题建模为一个 NP 难问题，并设计了一个基于贪婪算法的启发式解决方案，该方案可在多项式时间内解决该问题。我们在 Kubernetes 平台上实现了 ComboFunc 的原型，并使用本地集群上的真实痕迹进行了实验。结果表明，与现有策略相比，ComboFunc 的函数资源扩展速度提高了 3.01 倍，资源成本降低了 42.6%。

{"title":"ComboFunc: Joint Resource Combination and Container Placement for Serverless Function Scaling With Heterogeneous Container","authors":"Zhaojie Wen;Qiong Chen;Quanfeng Deng;Yipei Niu;Zhen Song;Fangming Liu","doi":"10.1109/TPDS.2024.3454071","DOIUrl":"10.1109/TPDS.2024.3454071","url":null,"abstract":"Serverless computing provides developers with a maintenance-free approach to resource usage, but it also transfers resource management responsibility to the cloud platform. However, the fine granularity of serverless function resources can lead to performance bottlenecks and resource fragmentation on nodes when creating many function containers. This poses challenges in effectively scaling function resources and optimizing node resource allocation, hindering overall agility. To address these challenges, we have introduced ComboFunc, an innovative resource scaling system for serverless platforms. ComboFunc associates function with heterogeneous containers of varying specifications and optimizes their resource combination and placement. This approach not only selects appropriate nodes for container creation, but also leverages the new feature of Kubernetes In-place Pod Vertical Scaling to enhance resource scaling agility and efficiency. By allowing a single function to correspond to heterogeneous containers with varying resource specifications and providing the ability to modify the resource specifications of existing containers in place, ComboFunc effectively utilizes fragmented resources on nodes. This, in turn, enhances the overall resource utilization of the entire cluster and improves scaling agility. We also model the problem of combining and placing heterogeneous containers as an NP-hard problem and design a heuristic solution based on a greedy algorithm that solves it in polynomial time. We implemented a prototype of ComboFunc on the Kubernetes platform and conducted experiments using real traces on a local cluster. The results demonstrate that, compared to existing strategies, ComboFunc achieves up to 3.01 × faster function resource scaling and reduces resource costs by up to 42.6%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1989-2005"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CODE$^{+}$+: Fast and Accurate Inference for Compact Distributed IoT Data Collection CODE+：针对紧凑型分布式物联网数据采集的快速准确推理。

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453607

Huali Lu;Feng Lyu;Ju Ren;Huaqing Wu;Conghao Zhou;Zhongyuan Liu;Yaoxue Zhang;Xuemin Shen

In distributed IoT data systems, full-size data collection is impractical due to the energy constraints and large system scales. Our previous work has investigated the advantages of integrating matrix sampling and inference for compact distributed IoT data collection, to minimize the data collection cost while guaranteeing the data benefits. This paper further advances the technology by boosting fast and accurate inference for those distributed IoT data systems that are sensitive to computation time, training stability, and inference accuracy. Particularly, we propose CODE

$^{+}$

+

, i.e., Compact Distributed IOT Data CollEction Plus, which features a cluster-based sampling module and a Convolutional Neural Network (CNN)-Transformer Autoencoders-based inference module, to reduce cost and guarantee the data benefits. The sampling component employs a cluster-based matrix sampling approach, in which data clustering is first conducted and then a two-step sampling is performed in accordance with the number of clusters and clustering errors. The inference component integrates a CNN-Transformer Autoencoders-based matrix inference model to estimate the full-size spatio-temporal data matrix, which consists of a CNN-Transformer encoder that extracts the underlying features from the sampled data matrix and a lightweight decoder that maps the learned latent features back to the original full-size data matrix. We implement CODE

$^{+}$

+

under three operational large-scale IoT systems and one synthetic Gaussian distribution dataset, and extensive experiments are provided to demonstrate its efficiency and robustness. With a 20% sampling ratio, CODE

$^{+}$

+

achieves an average data reconstruction accuracy of 94% across four datasets, outperforming our previous version of 87% and state-of-the-art baseline of 71%.

在分布式物联网数据系统中，由于能源限制和庞大的系统规模，全尺寸数据收集是不切实际的。我们之前的工作研究了在紧凑型分布式物联网数据收集中集成矩阵采样和推理的优势，从而在保证数据效益的同时最大限度地降低数据收集成本。本文针对对计算时间、训练稳定性和推理准确性敏感的分布式物联网数据系统，通过提高推理的快速性和准确性，进一步推动了该技术的发展。特别是，我们提出了 CODE$^{+}$+，即 Compact Distributed IOT Data CollEction Plus，它具有基于集群的采样模块和基于卷积神经网络（CNN）-变换器自动编码器的推理模块，以降低成本并保证数据效益。采样组件采用基于聚类的矩阵采样方法，首先对数据进行聚类，然后根据聚类数量和聚类误差进行两步采样。推理组件集成了一个基于 CNN-Transformer Autoencoders 的矩阵推理模型来估计全尺寸时空数据矩阵，它由一个 CNN-Transformer 编码器和一个轻量级解码器组成，前者从采样数据矩阵中提取底层特征，后者则将学习到的潜在特征映射回原始全尺寸数据矩阵。我们在三个运行中的大规模物联网系统和一个合成高斯分布数据集下实现了 CODE$^{+}$+，并通过大量实验证明了其效率和鲁棒性。在采样率为 20% 的情况下，CODE$^{+}$+ 在四个数据集上实现了 94% 的平均数据重建准确率，优于我们之前版本的 87% 和最先进基线的 71%。

{"title":"CODE$^{+}$+: Fast and Accurate Inference for Compact Distributed IoT Data Collection","authors":"Huali Lu;Feng Lyu;Ju Ren;Huaqing Wu;Conghao Zhou;Zhongyuan Liu;Yaoxue Zhang;Xuemin Shen","doi":"10.1109/TPDS.2024.3453607","DOIUrl":"10.1109/TPDS.2024.3453607","url":null,"abstract":"In distributed IoT data systems, full-size data collection is impractical due to the energy constraints and large system scales. Our previous work has investigated the advantages of integrating matrix sampling and inference for compact distributed IoT data collection, to minimize the data collection cost while guaranteeing the data benefits. This paper further advances the technology by boosting fast and accurate inference for those distributed IoT data systems that are sensitive to computation time, training stability, and inference accuracy. Particularly, we propose \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula>\u0000, i.e., \u0000<underline>C\u0000ompact Distributed I\u0000<underline>O\u0000T \u0000<underline>D\u0000ata Coll\u0000<underline>E\u0000ction Plus, which features a cluster-based sampling module and a Convolutional Neural Network (CNN)-Transformer Autoencoders-based inference module, to reduce cost and guarantee the data benefits. The sampling component employs a cluster-based matrix sampling approach, in which data clustering is first conducted and then a two-step sampling is performed in accordance with the number of clusters and clustering errors. The inference component integrates a CNN-Transformer Autoencoders-based matrix inference model to estimate the full-size spatio-temporal data matrix, which consists of a CNN-Transformer encoder that extracts the underlying features from the sampled data matrix and a lightweight decoder that maps the learned latent features back to the original full-size data matrix. We implement \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula>\u0000 under three operational large-scale IoT systems and one synthetic Gaussian distribution dataset, and extensive experiments are provided to demonstrate its efficiency and robustness. With a 20% sampling ratio, \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula>\u0000 achieves an average data reconstruction accuracy of 94% across four datasets, outperforming our previous version of 87% and state-of-the-art baseline of 71%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2006-2022"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring the Design Space of Distributed Parallel Sparse Matrix-Multiple Vector Multiplication 探索分布式并行稀疏矩阵-多矢量乘法的设计空间

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-08-30 DOI: 10.1109/TPDS.2024.3452478

Hua Huang;Edmond Chow

We consider the distributed memory parallel multiplication of a sparse matrix by a dense matrix (SpMM). The dense matrix is often a collection of dense vectors. Standard implementations will multiply the sparse matrix by multiple dense vectors at the same time, to exploit the computational efficiencies therein. But such approaches generally utilize the same sparse matrix partitioning as if multiplying by a single vector. This article explores the design space of parallelizing SpMM and shows that a coarser grain partitioning of the matrix combined with a column-wise partitioning of the block of vectors can often require less communication volume and achieve higher SpMM performance. An algorithm is presented that chooses a process grid geometry for a given number of processes to optimize the performance of parallel SpMM. The algorithm can augment existing graph partitioners by utilizing the additional concurrency available when multiplying by multiple dense vectors to further reduce communication.

我们考虑的是稀疏矩阵与密集矩阵的分布式内存并行乘法（SpMM）。稠密矩阵通常是稠密向量的集合。标准实现方法会同时用稀疏矩阵与多个稠密向量相乘，以利用其中的计算效率。但这种方法通常使用的稀疏矩阵分区与单个向量相乘的方法相同。本文探讨了 SpMM 并行化的设计空间，并表明较粗粒度的矩阵划分与按列划分的向量块相结合，往往能减少通信量，实现更高的 SpMM 性能。本文提出了一种算法，可为给定数量的进程选择进程网格几何形状，以优化并行 SpMM 性能。该算法可以利用多个密集向量相乘时的额外并发性，进一步减少通信量，从而增强现有的图分割器。

引用次数: 0