首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment 利用整体稀疏性对齐在移动设备上高效推断剪枝 CNN 模型
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-17 DOI: 10.1109/tpds.2024.3462092
Yuyang Jin, Runxin Zhong, Saiqin Long, Jidong Zhai
{"title":"Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment","authors":"Yuyang Jin, Runxin Zhong, Saiqin Long, Jidong Zhai","doi":"10.1109/tpds.2024.3462092","DOIUrl":"https://doi.org/10.1109/tpds.2024.3462092","url":null,"abstract":"","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Freyr$^+$:Harvesting Idle Resources in Serverless Computing Via Deep Reinforcement Learning Freyr$^+$:通过深度强化学习挖掘无服务器计算中的闲置资源
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-17 DOI: 10.1109/tpds.2024.3462294
Hanfei Yu, Hao Wang
{"title":"Freyr$^+$:Harvesting Idle Resources in Serverless Computing Via Deep Reinforcement Learning","authors":"Hanfei Yu, Hao Wang","doi":"10.1109/tpds.2024.3462294","DOIUrl":"https://doi.org/10.1109/tpds.2024.3462294","url":null,"abstract":"","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Cross-Cloud Partial Reduce With CREW 利用 CREW 实现高效的跨云部分还原
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-13 DOI: 10.1109/tpds.2024.3460185
Shouxi Luo, Renyi Wang, Ke Li, Huanlai Xing
{"title":"Efficient Cross-Cloud Partial Reduce With CREW","authors":"Shouxi Luo, Renyi Wang, Ke Li, Huanlai Xing","doi":"10.1109/tpds.2024.3460185","DOIUrl":"https://doi.org/10.1109/tpds.2024.3460185","url":null,"abstract":"","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Evaluation Framework for Dynamic Thermal Management Strategies in 3D MultiProcessor System-on-Chip Co-Design 三维多处理器片上系统协同设计中动态热管理策略的评估框架
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-12 DOI: 10.1109/tpds.2024.3459414
Darong Huang, Luis Costero, David Atienza
{"title":"An Evaluation Framework for Dynamic Thermal Management Strategies in 3D MultiProcessor System-on-Chip Co-Design","authors":"Darong Huang, Luis Costero, David Atienza","doi":"10.1109/tpds.2024.3459414","DOIUrl":"https://doi.org/10.1109/tpds.2024.3459414","url":null,"abstract":"","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks DeepCAT+:用于大数据框架的低成本、可转移的在线配置自动调整方法
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-12 DOI: 10.1109/TPDS.2024.3459889
Hui Dou;Yilun Wang;Yiwen Zhang;Pengfei Chen;Zibin Zheng
Big data frameworks usually provide a large number of performance-related parameters. Online auto-tuning these parameters based on deep reinforcement learning (DRL) to achieve a better performance has shown their advantages over search-based and machine learning-based approaches. Unfortunately, the time cost during the online tuning phase of conventional DRL-based methods is still heavy, especially for Big Data applications. Therefore, in this paper, we propose DeepCAT$^+$, a low-cost and transferrable deep reinforcement learning-based approach to achieve online configuration auto-tuning for Big Data frameworks. To reduce the total online tuning cost and increase the adaptability: 1) DeepCAT$^+$ utilizes the TD3 algorithm instead of DDPG to alleviate value overestimation; 2) DeepCAT$^+$ modifies the conventional experience replay to fully utilize the rare but valuable transitions via a novel reward-driven prioritized experience replay mechanism; 3) DeepCAT$^+$ designs a Twin-Q Optimizer to estimate the execution time of each action without the costly configuration evaluation and optimize the sub-optimal ones to achieve a low-cost exploration-exploitation tradeoff; 4) Furthermore, DeepCAT$^+$ also implements an Online Continual Learner module based on Progressive Neural Networks to transfer knowledge from historical tuning experiences. Experimental results based on a lab Spark cluster with HiBench benchmark applications show that DeepCAT$^+$ is able to speed up the best execution time by a factor of 1.49×, 1.63× and 1.65× on average respectively over the baselines, while consuming up to 50.08%, 53.39% and 70.79% less total tuning time. In addition, DeepCAT$^+$ also has a strong adaptability to the time-varying environment of Big Data frameworks.
大数据框架通常会提供大量与性能相关的参数。与基于搜索和机器学习的方法相比,基于深度强化学习(DRL)的在线自动调整这些参数以获得更好的性能已显示出其优势。遗憾的是,基于 DRL 的传统方法在在线调整阶段的时间成本仍然很高,尤其是在大数据应用中。因此,我们在本文中提出了 DeepCAT$^+$,一种低成本、可移植的基于深度强化学习的方法,用于实现大数据框架的在线配置自动调整。为了降低总的在线调优成本并提高适应性,我们提出了以下几种方法:1)DeepCAT$^+$ 利用 TD3 算法而不是 DDPG 来减轻价值高估;2)DeepCAT$^+$ 通过一种新颖的奖励驱动优先级经验重放机制,修改了传统的经验重放,以充分利用稀有但有价值的转换;3)DeepCAT$^+$ 设计了双 Q 优化器(Twin-Q Optimizer),在不进行高成本配置评估的情况下估算每个动作的执行时间,并优化次优动作,以实现低成本的探索-开发权衡;4)此外,DeepCAT$^+$ 还实现了基于渐进式神经网络的在线持续学习模块,从历史调整经验中转移知识。基于实验室 Spark 集群和 HiBench 基准应用的实验结果表明,DeepCAT$^+$ 能够将最佳执行时间分别比基准平均加快 1.49 倍、1.63 倍和 1.65 倍,同时将总调整时间分别减少 50.08%、53.39% 和 70.79%。此外,DeepCAT$^+$ 还能很好地适应大数据框架的时变环境。
{"title":"DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks","authors":"Hui Dou;Yilun Wang;Yiwen Zhang;Pengfei Chen;Zibin Zheng","doi":"10.1109/TPDS.2024.3459889","DOIUrl":"10.1109/TPDS.2024.3459889","url":null,"abstract":"Big data frameworks usually provide a large number of performance-related parameters. Online auto-tuning these parameters based on deep reinforcement learning (DRL) to achieve a better performance has shown their advantages over search-based and machine learning-based approaches. Unfortunately, the time cost during the online tuning phase of conventional DRL-based methods is still heavy, especially for Big Data applications. Therefore, in this paper, we propose DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000, a low-cost and transferrable deep reinforcement learning-based approach to achieve online configuration auto-tuning for Big Data frameworks. To reduce the total online tuning cost and increase the adaptability: 1) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 utilizes the TD3 algorithm instead of DDPG to alleviate value overestimation; 2) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 modifies the conventional experience replay to fully utilize the rare but valuable transitions via a novel reward-driven prioritized experience replay mechanism; 3) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 designs a Twin-Q Optimizer to estimate the execution time of each action without the costly configuration evaluation and optimize the sub-optimal ones to achieve a low-cost exploration-exploitation tradeoff; 4) Furthermore, DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 also implements an Online Continual Learner module based on Progressive Neural Networks to transfer knowledge from historical tuning experiences. Experimental results based on a lab Spark cluster with HiBench benchmark applications show that DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 is able to speed up the best execution time by a factor of 1.49×, 1.63× and 1.65× on average respectively over the baselines, while consuming up to 50.08%, 53.39% and 70.79% less total tuning time. In addition, DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 also has a strong adaptability to the time-varying environment of Big Data frameworks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gamora: Learning-Based Buffer-Aware Preloading for Adaptive Short Video Streaming 卡魔拉基于学习的缓冲区感知预加载,实现自适应短视频流
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-09 DOI: 10.1109/TPDS.2024.3456567
Biao Hou;Song Yang;Fan Li;Liehuang Zhu;Lei Jiao;Xu Chen;Xiaoming Fu
Nowadays, the emerging short video streaming applications have gained substantial attention. With the rapidly burgeoning demand for short video streaming services, maximizing their Quality of Experience (QoE) is an onerous challenge. Current video preloading algorithms cannot determine video preloading sequence decisions appropriately due to the impact of users’ swipes and bandwidth fluctuations. As a result, it is still ambiguous how to improve the overall QoE while mitigating bandwidth wastage to optimize short video streaming services. In this article, we devise Gamora, a buffer-aware short video streaming system to provide a high QoE of users. In Gamora, we first propose an unordered preloading algorithm that utilizes a Deep Reinforcement Learning (DRL) algorithm to make video preloading decisions. Then, we further devise an Asymmetric Imitation Learning (AIL) algorithm to guide the DRL-based preloading algorithm, which enables the agent to learn from expert demonstrations for fast convergence. Finally, we implement our proposed short video streaming system prototype and evaluate the performance of Gamora on various real-world network datasets. Our results demonstrate that Gamora significantly achieves QoE improvement by 28.7%–51.4% compared to state-of-the-art algorithms, while mitigating bandwidth wastage by 40.7%–83.2% without sacrificing video quality.
如今,新兴的短视频流应用已受到广泛关注。随着短视频流服务需求的快速增长,如何最大限度地提高其体验质量(QoE)成为一项艰巨的挑战。由于用户刷卡和带宽波动的影响,当前的视频预加载算法无法正确确定视频预加载顺序决策。因此,如何在改善整体 QoE 的同时减少带宽浪费以优化短视频流媒体服务仍是一个模糊的问题。在本文中,我们设计了一种缓冲感知短视频流系统 Gamora,为用户提供高 QoE。在 Gamora 中,我们首先提出了一种无序预加载算法,利用深度强化学习(DRL)算法做出视频预加载决策。然后,我们进一步设计了一种非对称模仿学习(AIL)算法来指导基于 DRL 的预加载算法,使代理能够从专家示范中学习,从而快速收敛。最后,我们实现了所提出的短视频流系统原型,并在各种实际网络数据集上评估了 Gamora 的性能。结果表明,与最先进的算法相比,Gamora 的 QoE 显著提高了 28.7%-51.4%,同时在不牺牲视频质量的情况下减少了 40.7%-83.2% 的带宽浪费。
{"title":"Gamora: Learning-Based Buffer-Aware Preloading for Adaptive Short Video Streaming","authors":"Biao Hou;Song Yang;Fan Li;Liehuang Zhu;Lei Jiao;Xu Chen;Xiaoming Fu","doi":"10.1109/TPDS.2024.3456567","DOIUrl":"10.1109/TPDS.2024.3456567","url":null,"abstract":"Nowadays, the emerging short video streaming applications have gained substantial attention. With the rapidly burgeoning demand for short video streaming services, maximizing their Quality of Experience (QoE) is an onerous challenge. Current video preloading algorithms cannot determine video preloading sequence decisions appropriately due to the impact of users’ swipes and bandwidth fluctuations. As a result, it is still ambiguous how to improve the overall QoE while mitigating bandwidth wastage to optimize short video streaming services. In this article, we devise Gamora, a buffer-aware short video streaming system to provide a high QoE of users. In Gamora, we first propose an unordered preloading algorithm that utilizes a Deep Reinforcement Learning (DRL) algorithm to make video preloading decisions. Then, we further devise an Asymmetric Imitation Learning (AIL) algorithm to guide the DRL-based preloading algorithm, which enables the agent to learn from expert demonstrations for fast convergence. Finally, we implement our proposed short video streaming system prototype and evaluate the performance of Gamora on various real-world network datasets. Our results demonstrate that Gamora significantly achieves QoE improvement by 28.7%–51.4% compared to state-of-the-art algorithms, while mitigating bandwidth wastage by 40.7%–83.2% without sacrificing video quality.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Trusted Model Aggregation with Zero-Knowledge Proofs in Federated Learning 联盟学习中的零知识证明可信模型聚合
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-06 DOI: 10.1109/tpds.2024.3455762
Renwen Ma, Kai Hwang, Mo Li and, Yiming Miao
{"title":"Trusted Model Aggregation with Zero-Knowledge Proofs in Federated Learning","authors":"Renwen Ma, Kai Hwang, Mo Li and, Yiming Miao","doi":"10.1109/tpds.2024.3455762","DOIUrl":"https://doi.org/10.1109/tpds.2024.3455762","url":null,"abstract":"","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FedVeca: Federated Vectorized Averaging on Non-IID Data With Adaptive Bi-Directional Global Objective FedVeca:非 IID 数据的联合矢量化平均与自适应双向全局目标
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-04 DOI: 10.1109/TPDS.2024.3454203
Ping Luo;Jieren Cheng;N. Xiong;Zhenhao Liu;Jie Wu
Federated Learning (FL) is a distributed machine learning framework in parallel and distributed systems. However, the systems’ Non-Independent and Identically Distributed (Non-IID) data negatively affect the communication efficiency, since clients with different datasets may cause significant gaps to the local gradients in each communication round. In this article, we propose a Federated Vectorized Averaging (FedVeca) method to optimize the FL communication system on Non-IID data. Specifically, we set a novel objective for the global model which is related to the local gradients. The local gradient is defined as a bi-directional vector with step size and direction, where the step size is the number of local updates and the direction is divided into positive and negative according to our definition. In FedVeca, the direction is influenced by the step size, thus we average the bi-directional vectors to reduce the effect of different step sizes. Then, we theoretically analyze the relationship between the step sizes and the global objective, and obtain upper bounds on the step sizes per communication round. Based on the upper bounds, we design an algorithm for the server and the client to adaptively adjusts the step sizes that make the objective close to the optimum. Finally, we conduct experiments on different datasets, models and scenarios by building a prototype system, and the experimental results demonstrate the effectiveness and efficiency of the FedVeca method.
联合学习(FL)是并行和分布式系统中的一种分布式机器学习框架。然而,系统中的非独立和相同分布(Non-IID)数据会对通信效率产生负面影响,因为拥有不同数据集的客户端可能会在每轮通信中对本地梯度造成巨大差距。在本文中,我们提出了一种联邦矢量化平均(FedVeca)方法,用于优化非独立同分布数据的 FL 通信系统。具体来说,我们为全局模型设定了一个与局部梯度相关的新目标。根据我们的定义,局部梯度被定义为具有步长和方向的双向向量,其中步长是局部更新的次数,方向分为正向和负向。在 FedVeca 中,方向受步长的影响,因此我们将双向向量平均化,以减少不同步长的影响。然后,我们从理论上分析了步长与全局目标之间的关系,并得出了每轮通信的步长上限。在此基础上,我们为服务器和客户端设计了一种算法,用于自适应地调整步长,使目标接近最优。最后,我们通过构建原型系统,在不同的数据集、模型和场景下进行了实验,实验结果证明了 FedVeca 方法的有效性和高效性。
{"title":"FedVeca: Federated Vectorized Averaging on Non-IID Data With Adaptive Bi-Directional Global Objective","authors":"Ping Luo;Jieren Cheng;N. Xiong;Zhenhao Liu;Jie Wu","doi":"10.1109/TPDS.2024.3454203","DOIUrl":"10.1109/TPDS.2024.3454203","url":null,"abstract":"Federated Learning (FL) is a distributed machine learning framework in parallel and distributed systems. However, the systems’ Non-Independent and Identically Distributed (Non-IID) data negatively affect the communication efficiency, since clients with different datasets may cause significant gaps to the local gradients in each communication round. In this article, we propose a Federated Vectorized Averaging (FedVeca) method to optimize the FL communication system on Non-IID data. Specifically, we set a novel objective for the global model which is related to the local gradients. The local gradient is defined as a bi-directional vector with step size and direction, where the step size is the number of local updates and the direction is divided into positive and negative according to our definition. In FedVeca, the direction is influenced by the step size, thus we average the bi-directional vectors to reduce the effect of different step sizes. Then, we theoretically analyze the relationship between the step sizes and the global objective, and obtain upper bounds on the step sizes per communication round. Based on the upper bounds, we design an algorithm for the server and the client to adaptively adjusts the step sizes that make the objective close to the optimum. Finally, we conduct experiments on different datasets, models and scenarios by building a prototype system, and the experimental results demonstrate the effectiveness and efficiency of the FedVeca method.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Throughput GPU Implementation of Dilithium Post-Quantum Digital Signature 锂后量子数字签名的高吞吐量 GPU 实现
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453289
Shiyu Shen;Hao Yang;Wangchen Dai;Hong Zhang;Zhe Liu;Yunlei Zhao
Digital signatures are fundamental building blocks in various protocols to provide integrity and authenticity. The development of the quantum computing has raised concerns about the security guarantees afforded by classical signature schemes. CRYSTALS-Dilithium is an efficient post-quantum digital signature scheme based on lattice cryptography and has been selected as the primary algorithm for standardization by the National Institute of Standards and Technology. In this work, we present a high-throughput GPU implementation of Dilithium. For individual operations, we employ a range of computational and memory optimizations to overcome sequential constraints, reduce memory usage and IO latency, address bank conflicts, and mitigate pipeline stalls. This results in high and balanced compute throughput and memory throughput for each operation. In terms of concurrent task processing, we leverage task-level batching to fully utilize parallelism and implement a memory pool mechanism for rapid memory access. We propose a dynamic task scheduling mechanism to improve multiprocessor occupancy and significantly reduce execution time. Furthermore, we apply asynchronous computing and launch multiple streams to hide data transfer latencies and maximize the computing capabilities of both CPU and GPU. Across all three security levels, our GPU implementation achieves over 160× speedups for signing and over 80× speedups for verification on both commercial and server-grade GPUs. This achieves microsecond-level amortized execution times for each task, offering a high-throughput and quantum-resistant solution suitable for a wide array of applications in real systems.
数字签名是各种协议中提供完整性和真实性的基本构件。量子计算的发展引发了人们对经典签名方案所提供的安全保证的担忧。CRYSTALS-Dilithium 是一种基于晶格密码学的高效后量子数字签名方案,已被美国国家标准与技术研究院选为标准化的主要算法。在这项工作中,我们介绍了 Dilithium 的高吞吐量 GPU 实现。对于单个操作,我们采用了一系列计算和内存优化措施,以克服顺序限制、减少内存使用和 IO 延迟、解决库冲突并缓解流水线停滞。因此,每项操作的计算吞吐量和内存吞吐量都很高,而且很均衡。在并发任务处理方面,我们利用任务级批处理来充分利用并行性,并实施了快速内存访问的内存池机制。我们提出了一种动态任务调度机制,以提高多处理器占用率并显著缩短执行时间。此外,我们还应用异步计算并启动多个流来隐藏数据传输延迟,最大限度地发挥 CPU 和 GPU 的计算能力。在所有三个安全级别中,我们的 GPU 实现在商用和服务器级 GPU 上的签名速度提高了 160 倍以上,验证速度提高了 80 倍以上。这使得每个任务的摊销执行时间达到了微秒级,从而提供了一种适合实际系统中各种应用的高吞吐量和抗量子解决方案。
{"title":"High-Throughput GPU Implementation of Dilithium Post-Quantum Digital Signature","authors":"Shiyu Shen;Hao Yang;Wangchen Dai;Hong Zhang;Zhe Liu;Yunlei Zhao","doi":"10.1109/TPDS.2024.3453289","DOIUrl":"10.1109/TPDS.2024.3453289","url":null,"abstract":"Digital signatures are fundamental building blocks in various protocols to provide integrity and authenticity. The development of the quantum computing has raised concerns about the security guarantees afforded by classical signature schemes. CRYSTALS-Dilithium is an efficient post-quantum digital signature scheme based on lattice cryptography and has been selected as the primary algorithm for standardization by the National Institute of Standards and Technology. In this work, we present a high-throughput GPU implementation of Dilithium. For individual operations, we employ a range of computational and memory optimizations to overcome sequential constraints, reduce memory usage and IO latency, address bank conflicts, and mitigate pipeline stalls. This results in high and balanced compute throughput and memory throughput for each operation. In terms of concurrent task processing, we leverage task-level batching to fully utilize parallelism and implement a memory pool mechanism for rapid memory access. We propose a dynamic task scheduling mechanism to improve multiprocessor occupancy and significantly reduce execution time. Furthermore, we apply asynchronous computing and launch multiple streams to hide data transfer latencies and maximize the computing capabilities of both CPU and GPU. Across all three security levels, our GPU implementation achieves over 160× speedups for signing and over 80× speedups for verification on both commercial and server-grade GPUs. This achieves microsecond-level amortized execution times for each task, offering a high-throughput and quantum-resistant solution suitable for a wide array of applications in real systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ComboFunc: Joint Resource Combination and Container Placement for Serverless Function Scaling With Heterogeneous Container ComboFunc:联合资源组合与容器放置,实现无服务器功能与异构容器的扩展
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3454071
Zhaojie Wen;Qiong Chen;Quanfeng Deng;Yipei Niu;Zhen Song;Fangming Liu
Serverless computing provides developers with a maintenance-free approach to resource usage, but it also transfers resource management responsibility to the cloud platform. However, the fine granularity of serverless function resources can lead to performance bottlenecks and resource fragmentation on nodes when creating many function containers. This poses challenges in effectively scaling function resources and optimizing node resource allocation, hindering overall agility. To address these challenges, we have introduced ComboFunc, an innovative resource scaling system for serverless platforms. ComboFunc associates function with heterogeneous containers of varying specifications and optimizes their resource combination and placement. This approach not only selects appropriate nodes for container creation, but also leverages the new feature of Kubernetes In-place Pod Vertical Scaling to enhance resource scaling agility and efficiency. By allowing a single function to correspond to heterogeneous containers with varying resource specifications and providing the ability to modify the resource specifications of existing containers in place, ComboFunc effectively utilizes fragmented resources on nodes. This, in turn, enhances the overall resource utilization of the entire cluster and improves scaling agility. We also model the problem of combining and placing heterogeneous containers as an NP-hard problem and design a heuristic solution based on a greedy algorithm that solves it in polynomial time. We implemented a prototype of ComboFunc on the Kubernetes platform and conducted experiments using real traces on a local cluster. The results demonstrate that, compared to existing strategies, ComboFunc achieves up to 3.01 × faster function resource scaling and reduces resource costs by up to 42.6%.
无服务器计算为开发人员提供了一种免维护的资源使用方法,但同时也将资源管理责任转移给了云平台。然而,当创建许多功能容器时,无服务器功能资源的细粒度可能会导致节点上出现性能瓶颈和资源碎片。这给有效扩展功能资源和优化节点资源分配带来了挑战,阻碍了整体敏捷性。为了应对这些挑战,我们为无服务器平台推出了创新的资源扩展系统 ComboFunc。ComboFunc 将函数与不同规格的异构容器关联起来,并优化它们的资源组合和布局。这种方法不仅能为容器创建选择合适的节点,还能利用 Kubernetes 就地 Pod 垂直扩展的新功能来提高资源扩展的灵活性和效率。ComboFunc 允许一个函数对应具有不同资源规格的异构容器,并提供就地修改现有容器资源规格的功能,从而有效利用了节点上的零散资源。这反过来又提高了整个集群的整体资源利用率,提高了扩展灵活性。我们还将异构容器的组合和放置问题建模为一个 NP 难问题,并设计了一个基于贪婪算法的启发式解决方案,该方案可在多项式时间内解决该问题。我们在 Kubernetes 平台上实现了 ComboFunc 的原型,并使用本地集群上的真实痕迹进行了实验。结果表明,与现有策略相比,ComboFunc 的函数资源扩展速度提高了 3.01 倍,资源成本降低了 42.6%。
{"title":"ComboFunc: Joint Resource Combination and Container Placement for Serverless Function Scaling With Heterogeneous Container","authors":"Zhaojie Wen;Qiong Chen;Quanfeng Deng;Yipei Niu;Zhen Song;Fangming Liu","doi":"10.1109/TPDS.2024.3454071","DOIUrl":"10.1109/TPDS.2024.3454071","url":null,"abstract":"Serverless computing provides developers with a maintenance-free approach to resource usage, but it also transfers resource management responsibility to the cloud platform. However, the fine granularity of serverless function resources can lead to performance bottlenecks and resource fragmentation on nodes when creating many function containers. This poses challenges in effectively scaling function resources and optimizing node resource allocation, hindering overall agility. To address these challenges, we have introduced ComboFunc, an innovative resource scaling system for serverless platforms. ComboFunc associates function with heterogeneous containers of varying specifications and optimizes their resource combination and placement. This approach not only selects appropriate nodes for container creation, but also leverages the new feature of Kubernetes In-place Pod Vertical Scaling to enhance resource scaling agility and efficiency. By allowing a single function to correspond to heterogeneous containers with varying resource specifications and providing the ability to modify the resource specifications of existing containers in place, ComboFunc effectively utilizes fragmented resources on nodes. This, in turn, enhances the overall resource utilization of the entire cluster and improves scaling agility. We also model the problem of combining and placing heterogeneous containers as an NP-hard problem and design a heuristic solution based on a greedy algorithm that solves it in polynomial time. We implemented a prototype of ComboFunc on the Kubernetes platform and conducted experiments using real traces on a local cluster. The results demonstrate that, compared to existing strategies, ComboFunc achieves up to 3.01 × faster function resource scaling and reduces resource costs by up to 42.6%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1