IEEE Transactions on Parallel and Distributed Systems最新文献_第7页

Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment 利用整体稀疏性对齐在移动设备上高效推断剪枝 CNN 模型

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-17 DOI: 10.1109/TPDS.2024.3462092

Yuyang Jin;Runxin Zhong;Saiqin Long;Jidong Zhai

Many artificial intelligence applications based on convolutional neural networks are directly deployed on mobile devices to avoid network unavailability and user privacy leakage. However, the significant increase in model parameter volumes makes it difficult to achieve high-performance convolutional neural network inference on these mobile devices with limited computing power. Weight pruning is one of the main approaches to compress models by reducing model parameters and computational operations, which also introduces irregular sparsity of neural networks, leading to inefficient computation and memory access during inference. This work proposes an end-to-end framework, namely MCPruner, for efficient inference of pruned convolutional neural networks on mobile devices by aligning the sparse patterns with hardware execution features in computation, memory access, and parallelism. It first co-designs pruning methods and code generation optimizations for the alignment of non-zero weight count and vector width, to improve computational efficiency while ensuring accuracy. During the code generation, it applies a sparse pattern-aware format to reduce inefficient memory accesses. Besides, convolution computations are reordered for alignment, and then mapped to parallel threads on accelerated units to achieve high parallelism. Experimental results using several commonly used models and datasets on the ARM-based Hikey970 demonstrate that our work outperforms state-of-the-art methods in inference efficiency, with no accuracy degradation.

许多基于卷积神经网络的人工智能应用都直接部署在移动设备上，以避免网络不可用和用户隐私泄露。然而，由于模型参数量大幅增加，很难在计算能力有限的移动设备上实现高性能的卷积神经网络推理。权重剪枝是通过减少模型参数和计算操作来压缩模型的主要方法之一，但同时也会引入神经网络的不规则稀疏性，导致推理过程中的计算和内存访问效率低下。本研究提出了一种端到端框架，即 MCPruner，通过将稀疏模式与计算、内存访问和并行性方面的硬件执行特性相匹配，在移动设备上高效推断剪枝卷积神经网络。它首先共同设计了剪枝方法和代码生成优化方法，用于调整非零权重计数和向量宽度，以提高计算效率，同时确保准确性。在代码生成过程中，它采用了稀疏模式感知格式，以减少低效的内存访问。此外，卷积计算会重新排序以进行对齐，然后映射到加速单元上的并行线程，以实现高并行性。在基于 ARM 的 Hikey970 上使用几种常用模型和数据集的实验结果表明，我们的工作在推理效率方面优于最先进的方法，而且准确性没有降低。

{"title":"Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment","authors":"Yuyang Jin;Runxin Zhong;Saiqin Long;Jidong Zhai","doi":"10.1109/TPDS.2024.3462092","DOIUrl":"10.1109/TPDS.2024.3462092","url":null,"abstract":"Many artificial intelligence applications based on convolutional neural networks are directly deployed on mobile devices to avoid network unavailability and user privacy leakage. However, the significant increase in model parameter volumes makes it difficult to achieve high-performance convolutional neural network inference on these mobile devices with limited computing power. Weight pruning is one of the main approaches to compress models by reducing model parameters and computational operations, which also introduces irregular sparsity of neural networks, leading to inefficient computation and memory access during inference. This work proposes an end-to-end framework, namely MCPruner, for efficient inference of pruned convolutional neural networks on mobile devices by aligning the sparse patterns with hardware execution features in computation, memory access, and parallelism. It first co-designs pruning methods and code generation optimizations for the alignment of non-zero weight count and vector width, to improve computational efficiency while ensuring accuracy. During the code generation, it applies a sparse pattern-aware format to reduce inefficient memory accesses. Besides, convolution computations are reordered for alignment, and then mapped to parallel threads on accelerated units to achieve high parallelism. Experimental results using several commonly used models and datasets on the ARM-based Hikey970 demonstrate that our work outperforms state-of-the-art methods in inference efficiency, with no accuracy degradation.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2208-2223"},"PeriodicalIF":5.6,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Freyr $^+$+: Harvesting Idle Resources in Serverless Computing via Deep Reinforcement Learning Freyr$^+$：通过深度强化学习挖掘无服务器计算中的闲置资源

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-17 DOI: 10.1109/TPDS.2024.3462294

Hanfei Yu;Hao Wang;Jian Li;Xu Yuan;Seung-Jong Park

Serverless computing has revolutionized online service development and deployment with ease-to-use operations, auto-scaling, fine-grained resource allocation, and pay-as-you-go pricing. However, a gap remains in configuring serverless functions—the actual resource consumption may vary due to function types, dependencies, and input data sizes, thus mismatching the static resource configuration by users. Dynamic resource consumption against static configuration may lead to either poor function execution performance or low utilization. This paper proposes Freyr

$^+$

, a novel resource manager (RM) that dynamically harvests idle resources from over-provisioned functions to accelerate under-provisioned functions for serverless platforms. Freyr

$^+$

monitors each function's resource utilization in real-time and detects the mismatches between user configuration and actual resource consumption. We design deep reinforcement learning (DRL) algorithms with attention-enhanced embedding, incremental learning, and safeguard mechanism for Freyr

$^+$

to harvest idle resources safely and accelerate functions efficiently. We have implemented and deployed a Freyr

$^+$

prototype in a 13-node Apache OpenWhisk cluster using AWS EC2. Freyr

$^+$

is evaluated on both large-scale simulation and real-world testbed. Experimental results show that Freyr

$^+$

harvests 38% of function invocations’ idle resources and accelerates 39% of invocations using harvested resources. Freyr

$^+$

reduces the 99th-percentile function response latency by 26% compared to the baseline RMs.

无服务器计算凭借易于使用的操作、自动缩放、细粒度资源分配和即用即付的定价方式，彻底改变了在线服务的开发和部署。然而，在配置无服务器功能方面仍存在差距--实际资源消耗可能会因功能类型、依赖关系和输入数据大小而变化，从而与用户的静态资源配置不匹配。与静态配置不匹配的动态资源消耗可能导致函数执行性能低下或利用率低。本文提出了一种新型资源管理器（RM）--Freyr$^+$，它可以动态地从超额配置的函数中获取闲置资源，从而加速无服务器平台中配置不足的函数。Freyr$^+$ 实时监控每个函数的资源利用率，并检测用户配置与实际资源消耗之间的不匹配。我们为Freyr$^+$设计了具有注意力增强嵌入、增量学习和保障机制的深度强化学习（DRL）算法，以安全地获取闲置资源并高效地加速函数。我们在使用 AWS EC2 的 13 节点 Apache OpenWhisk 集群中实施并部署了 Freyr$^+$ 原型。Freyr$^+$ 在大规模模拟和真实世界测试平台上进行了评估。实验结果表明，Freyr$^+$ 可收集 38% 的函数调用闲置资源，并利用收集的资源加速 39% 的调用。与基准 RM 相比，Freyr$^+$ 将第 99 百分位函数响应延迟降低了 26%。

{"title":"Freyr $^+$+: Harvesting Idle Resources in Serverless Computing via Deep Reinforcement Learning","authors":"Hanfei Yu;Hao Wang;Jian Li;Xu Yuan;Seung-Jong Park","doi":"10.1109/TPDS.2024.3462294","DOIUrl":"10.1109/TPDS.2024.3462294","url":null,"abstract":"Serverless computing has revolutionized online service development and deployment with ease-to-use operations, auto-scaling, fine-grained resource allocation, and pay-as-you-go pricing. However, a gap remains in configuring serverless functions—the actual resource consumption may vary due to function types, dependencies, and input data sizes, thus mismatching the static resource configuration by users. Dynamic resource consumption against static configuration may lead to either poor function execution performance or low utilization. This paper proposes \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000, a novel resource manager (RM) that dynamically harvests idle resources from over-provisioned functions to accelerate under-provisioned functions for serverless platforms. \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 monitors each function's resource utilization in real-time and detects the mismatches between user configuration and actual resource consumption. We design deep reinforcement learning (DRL) algorithms with attention-enhanced embedding, incremental learning, and safeguard mechanism for \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 to harvest idle resources safely and accelerate functions efficiently. We have implemented and deployed a \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 prototype in a 13-node Apache OpenWhisk cluster using AWS EC2. \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 is evaluated on both large-scale simulation and real-world testbed. Experimental results show that \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 harvests 38% of function invocations’ idle resources and accelerates 39% of invocations using harvested resources. \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 reduces the 99th-percentile function response latency by 26% compared to the baseline RMs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2254-2269"},"PeriodicalIF":5.6,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Cross-Cloud Partial Reduce With CREW 利用 CREW 实现高效的跨云部分还原

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-13 DOI: 10.1109/TPDS.2024.3460185

Shouxi Luo;Renyi Wang;Ke Li;Huanlai Xing

By allowing

$p$

out of

$n$

workers to conduct all reduce operations among them for a round of synchronization, partial reduce, a promising partially-asynchronous variant of all reduce, has shown its power in alleviating the impacts of stragglers for iterative distributed machine learning (DML). Current partial reduce solutions are mainly designed for intra-cluster DML, in which workers are networked with high-bandwidth LAN links. Yet no prior work has looked into the problem of how to achieve efficient partial reduce for cross-cloud DML, where inter-worker connections are with scarcely-available capacities. To fill the gap, in this paper, we propose CREW, a flexible and efficient implementation of partial reduce for cross-cloud DML. At the high level, CREW is built upon the novel design of employing all active workers along with their internal connection capacities to execute the involved communication and computation tasks; and at the low level, CREW employs a suite of algorithms to distribute the tasks among workers in a load-balanced way, and deal with possible outages of workers/connections, and bandwidth contention. Detailed performance studies confirm that, CREW not only shortens the execution of each partial reduce operation, outperforming existing communication schemes such as PS, Ring, TopoAdopt, and BLINK greatly, but also significantly accelerates the training of large models, up to

$15times$

and

$9times$

, respectively, when compared with the all-to-all direct communication scheme and original partial reduce design.

部分还原是全部还原的一种很有前途的部分异步变体，它允许 n 个工作者中的 $p$ 在一轮同步中在他们之间进行所有还原操作，在减轻迭代式分布式机器学习（DML）中滞后者的影响方面显示出了它的威力。目前的部分还原解决方案主要是针对集群内的 DML 而设计的，在集群内，工作人员通过高带宽局域网链接联网。然而，对于如何在跨云 DML 中实现高效的部分还原这一问题，此前还没有任何研究成果，因为在跨云 DML 中，工人之间的连接几乎没有可用的容量。为了填补这一空白，我们在本文中提出了 CREW，一种灵活高效的跨云 DML 部分还原实现方法。在高层，CREW 采用了新颖的设计，即利用所有活跃的工作者及其内部连接能力来执行相关的通信和计算任务；在底层，CREW 采用了一套算法，以负载均衡的方式在工作者之间分配任务，并处理工作者/连接可能出现的中断和带宽争用问题。详细的性能研究证实，与全对全直接通信方案和原始的部分还原设计相比，CREW 不仅缩短了每个部分还原操作的执行时间，大大优于现有的通信方案，如 PS、Ring、TopoAdopt 和 BLINK，而且还显著加快了大型模型的训练速度，分别达到 $15/times$ 和 $9/times$。

{"title":"Efficient Cross-Cloud Partial Reduce With CREW","authors":"Shouxi Luo;Renyi Wang;Ke Li;Huanlai Xing","doi":"10.1109/TPDS.2024.3460185","DOIUrl":"10.1109/TPDS.2024.3460185","url":null,"abstract":"By allowing \u0000<inline-formula><tex-math>$p$</tex-math></inline-formula>\u0000 out of \u0000<inline-formula><tex-math>$n$</tex-math></inline-formula>\u0000 workers to conduct \u0000all reduce\u0000 operations among them for a round of synchronization, \u0000partial reduce\u0000, a promising partially-asynchronous variant of \u0000all reduce\u0000, has shown its power in alleviating the impacts of stragglers for iterative distributed machine learning (DML). Current \u0000partial reduce\u0000 solutions are mainly designed for intra-cluster DML, in which workers are networked with high-bandwidth LAN links. Yet no prior work has looked into the problem of how to achieve efficient \u0000partial reduce\u0000 for cross-cloud DML, where inter-worker connections are with scarcely-available capacities. To fill the gap, in this paper, we propose \u0000CREW\u0000, a flexible and efficient implementation of \u0000partial reduce\u0000 for cross-cloud DML. At the high level, \u0000CREW\u0000 is built upon the novel design of employing all active workers along with their internal connection capacities to execute the involved communication and computation tasks; and at the low level, \u0000CREW\u0000 employs a suite of algorithms to distribute the tasks among workers in a load-balanced way, and deal with possible outages of workers/connections, and bandwidth contention. Detailed performance studies confirm that, \u0000CREW\u0000 not only shortens the execution of each \u0000partial reduce\u0000 operation, outperforming existing communication schemes such as PS, Ring, \u0000TopoAdopt\u0000, and BLINK greatly, but also significantly accelerates the training of large models, up to \u0000<inline-formula><tex-math>$15times$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$9times$</tex-math></inline-formula>\u0000, respectively, when compared with the all-to-all direct communication scheme and \u0000original partial reduce\u0000 design.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2224-2238"},"PeriodicalIF":5.6,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Evaluation Framework for Dynamic Thermal Management Strategies in 3D MultiProcessor System-on-Chip Co-Design 三维多处理器片上系统协同设计中动态热管理策略的评估框架

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-12 DOI: 10.1109/TPDS.2024.3459414

Darong Huang;Luis Costero;David Atienza

Dynamic thermal management (DTM) has been widely adopted to improve the energy efficiency, reliability, and performance of modern Multi-Processor SoCs (MPSoCs). However, the evolving industry trends and heterogeneous architecture designs have introduced significant challenges in state-of-the-art DTM methods. Specifically, the emergence of heterogeneous design has led to increased localized and non-uniform hotspots, necessitating accurate and responsive DTM strategies. Additionally, the increased number of cores to be managed requires the DTM to optimize and coordinate the whole system. However, existing methodologies fail in both precise thermal modeling in localized hotspots and fast architecture simulation. To tackle these existing challenges, we first introduce the latest version of 3D-ICE 3.1, with a novel non-uniform thermal modeling technique to support customized discretization levels of thermal grids. 3D-ICE 3.1 improves the accuracy of thermal analysis and reduces simulation overhead. Then, in conjunction with an efficient and fast offline application profiling strategy utilizing the architecture simulator gem5-X, we propose a novel DTM evaluation framework. This framework enables us to explore novel DTM methods to optimize the energy efficiency, reliability, and performance of contemporary 3D MPSoCs. The experimental results demonstrate that 3D-ICE 3.1 achieves high accuracy, with only 0.3K mean temperature error. Subsequently, we evaluate various DTM methods and propose a Multi-Agent Reinforcement Learning (MARL) control to address the demanding thermal challenges of 3D MPSoCs. Our experimental results show that the proposed DTM method based on MARL can reduce power consumption by 13% while maintaining a similar performance level to the comparison methods.

动态热管理（DTM）已被广泛采用，以提高现代多处理器系统级芯片（MPSoC）的能效、可靠性和性能。然而，不断发展的行业趋势和异构架构设计给最先进的 DTM 方法带来了重大挑战。具体来说，异构设计的出现导致局部和非均匀热点的增加，这就要求采用准确、灵敏的 DTM 策略。此外，需要管理的内核数量增加，要求 DTM 对整个系统进行优化和协调。然而，现有方法在局部热点的精确热建模和快速架构仿真方面都存在不足。为了应对这些现有挑战，我们首先介绍了最新版本的 3D-ICE 3.1，该版本采用了新颖的非均匀热建模技术，支持自定义热网格离散等级。3D-ICE 3.1 提高了热分析的精度，降低了仿真开销。然后，结合利用架构模拟器 gem5-X 的高效、快速离线应用剖析策略，我们提出了一个新颖的 DTM 评估框架。该框架使我们能够探索新型 DTM 方法，以优化当代 3D MPSoC 的能效、可靠性和性能。实验结果表明，3D-ICE 3.1 实现了高精度，平均温度误差仅为 0.3K。随后，我们对各种 DTM 方法进行了评估，并提出了多代理强化学习 (MARL) 控制方法，以应对 3D MPSoC 在散热方面的严峻挑战。实验结果表明，基于 MARL 提出的 DTM 方法可以降低 13% 的功耗，同时保持与比较方法类似的性能水平。

{"title":"An Evaluation Framework for Dynamic Thermal Management Strategies in 3D MultiProcessor System-on-Chip Co-Design","authors":"Darong Huang;Luis Costero;David Atienza","doi":"10.1109/TPDS.2024.3459414","DOIUrl":"10.1109/TPDS.2024.3459414","url":null,"abstract":"Dynamic thermal management (DTM) has been widely adopted to improve the energy efficiency, reliability, and performance of modern Multi-Processor SoCs (MPSoCs). However, the evolving industry trends and heterogeneous architecture designs have introduced significant challenges in state-of-the-art DTM methods. Specifically, the emergence of heterogeneous design has led to increased localized and non-uniform hotspots, necessitating accurate and responsive DTM strategies. Additionally, the increased number of cores to be managed requires the DTM to optimize and coordinate the whole system. However, existing methodologies fail in both precise thermal modeling in localized hotspots and fast architecture simulation. To tackle these existing challenges, we first introduce the latest version of 3D-ICE 3.1, with a novel non-uniform thermal modeling technique to support customized discretization levels of thermal grids. 3D-ICE 3.1 improves the accuracy of thermal analysis and reduces simulation overhead. Then, in conjunction with an efficient and fast offline application profiling strategy utilizing the architecture simulator gem5-X, we propose a novel DTM evaluation framework. This framework enables us to explore novel DTM methods to optimize the energy efficiency, reliability, and performance of contemporary 3D MPSoCs. The experimental results demonstrate that 3D-ICE 3.1 achieves high accuracy, with only 0.3K mean temperature error. Subsequently, we evaluate various DTM methods and propose a Multi-Agent Reinforcement Learning (MARL) control to address the demanding thermal challenges of 3D MPSoCs. Our experimental results show that the proposed DTM method based on MARL can reduce power consumption by 13% while maintaining a similar performance level to the comparison methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2161-2176"},"PeriodicalIF":5.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks DeepCAT+：用于大数据框架的低成本、可转移的在线配置自动调整方法

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-12 DOI: 10.1109/TPDS.2024.3459889

Hui Dou;Yilun Wang;Yiwen Zhang;Pengfei Chen;Zibin Zheng

Big data frameworks usually provide a large number of performance-related parameters. Online auto-tuning these parameters based on deep reinforcement learning (DRL) to achieve a better performance has shown their advantages over search-based and machine learning-based approaches. Unfortunately, the time cost during the online tuning phase of conventional DRL-based methods is still heavy, especially for Big Data applications. Therefore, in this paper, we propose DeepCAT

$^+$

, a low-cost and transferrable deep reinforcement learning-based approach to achieve online configuration auto-tuning for Big Data frameworks. To reduce the total online tuning cost and increase the adaptability: 1) DeepCAT

$^+$

utilizes the TD3 algorithm instead of DDPG to alleviate value overestimation; 2) DeepCAT

$^+$

modifies the conventional experience replay to fully utilize the rare but valuable transitions via a novel reward-driven prioritized experience replay mechanism; 3) DeepCAT

$^+$

designs a Twin-Q Optimizer to estimate the execution time of each action without the costly configuration evaluation and optimize the sub-optimal ones to achieve a low-cost exploration-exploitation tradeoff; 4) Furthermore, DeepCAT

$^+$

also implements an Online Continual Learner module based on Progressive Neural Networks to transfer knowledge from historical tuning experiences. Experimental results based on a lab Spark cluster with HiBench benchmark applications show that DeepCAT

$^+$

is able to speed up the best execution time by a factor of 1.49×, 1.63× and 1.65× on average respectively over the baselines, while consuming up to 50.08%, 53.39% and 70.79% less total tuning time. In addition, DeepCAT

$^+$

also has a strong adaptability to the time-varying environment of Big Data frameworks.

大数据框架通常会提供大量与性能相关的参数。与基于搜索和机器学习的方法相比，基于深度强化学习（DRL）的在线自动调整这些参数以获得更好的性能已显示出其优势。遗憾的是，基于 DRL 的传统方法在在线调整阶段的时间成本仍然很高，尤其是在大数据应用中。因此，我们在本文中提出了 DeepCAT$^+$，一种低成本、可移植的基于深度强化学习的方法，用于实现大数据框架的在线配置自动调整。为了降低总的在线调优成本并提高适应性，我们提出了以下几种方法：1）DeepCAT$^+$ 利用 TD3 算法而不是 DDPG 来减轻价值高估；2）DeepCAT$^+$ 通过一种新颖的奖励驱动优先级经验重放机制，修改了传统的经验重放，以充分利用稀有但有价值的转换；3）DeepCAT$^+$ 设计了双 Q 优化器（Twin-Q Optimizer），在不进行高成本配置评估的情况下估算每个动作的执行时间，并优化次优动作，以实现低成本的探索-开发权衡；4）此外，DeepCAT$^+$ 还实现了基于渐进式神经网络的在线持续学习模块，从历史调整经验中转移知识。基于实验室 Spark 集群和 HiBench 基准应用的实验结果表明，DeepCAT$^+$ 能够将最佳执行时间分别比基准平均加快 1.49 倍、1.63 倍和 1.65 倍，同时将总调整时间分别减少 50.08%、53.39% 和 70.79%。此外，DeepCAT$^+$ 还能很好地适应大数据框架的时变环境。

{"title":"DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks","authors":"Hui Dou;Yilun Wang;Yiwen Zhang;Pengfei Chen;Zibin Zheng","doi":"10.1109/TPDS.2024.3459889","DOIUrl":"10.1109/TPDS.2024.3459889","url":null,"abstract":"Big data frameworks usually provide a large number of performance-related parameters. Online auto-tuning these parameters based on deep reinforcement learning (DRL) to achieve a better performance has shown their advantages over search-based and machine learning-based approaches. Unfortunately, the time cost during the online tuning phase of conventional DRL-based methods is still heavy, especially for Big Data applications. Therefore, in this paper, we propose DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000, a low-cost and transferrable deep reinforcement learning-based approach to achieve online configuration auto-tuning for Big Data frameworks. To reduce the total online tuning cost and increase the adaptability: 1) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 utilizes the TD3 algorithm instead of DDPG to alleviate value overestimation; 2) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 modifies the conventional experience replay to fully utilize the rare but valuable transitions via a novel reward-driven prioritized experience replay mechanism; 3) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 designs a Twin-Q Optimizer to estimate the execution time of each action without the costly configuration evaluation and optimize the sub-optimal ones to achieve a low-cost exploration-exploitation tradeoff; 4) Furthermore, DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 also implements an Online Continual Learner module based on Progressive Neural Networks to transfer knowledge from historical tuning experiences. Experimental results based on a lab Spark cluster with HiBench benchmark applications show that DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 is able to speed up the best execution time by a factor of 1.49×, 1.63× and 1.65× on average respectively over the baselines, while consuming up to 50.08%, 53.39% and 70.79% less total tuning time. In addition, DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 also has a strong adaptability to the time-varying environment of Big Data frameworks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2114-2131"},"PeriodicalIF":5.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gamora: Learning-Based Buffer-Aware Preloading for Adaptive Short Video Streaming 卡魔拉基于学习的缓冲区感知预加载，实现自适应短视频流

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-09 DOI: 10.1109/TPDS.2024.3456567

Biao Hou;Song Yang;Fan Li;Liehuang Zhu;Lei Jiao;Xu Chen;Xiaoming Fu

Nowadays, the emerging short video streaming applications have gained substantial attention. With the rapidly burgeoning demand for short video streaming services, maximizing their Quality of Experience (QoE) is an onerous challenge. Current video preloading algorithms cannot determine video preloading sequence decisions appropriately due to the impact of users’ swipes and bandwidth fluctuations. As a result, it is still ambiguous how to improve the overall QoE while mitigating bandwidth wastage to optimize short video streaming services. In this article, we devise Gamora, a buffer-aware short video streaming system to provide a high QoE of users. In Gamora, we first propose an unordered preloading algorithm that utilizes a Deep Reinforcement Learning (DRL) algorithm to make video preloading decisions. Then, we further devise an Asymmetric Imitation Learning (AIL) algorithm to guide the DRL-based preloading algorithm, which enables the agent to learn from expert demonstrations for fast convergence. Finally, we implement our proposed short video streaming system prototype and evaluate the performance of Gamora on various real-world network datasets. Our results demonstrate that Gamora significantly achieves QoE improvement by 28.7%–51.4% compared to state-of-the-art algorithms, while mitigating bandwidth wastage by 40.7%–83.2% without sacrificing video quality.

如今，新兴的短视频流应用已受到广泛关注。随着短视频流服务需求的快速增长，如何最大限度地提高其体验质量（QoE）成为一项艰巨的挑战。由于用户刷卡和带宽波动的影响，当前的视频预加载算法无法正确确定视频预加载顺序决策。因此，如何在改善整体 QoE 的同时减少带宽浪费以优化短视频流媒体服务仍是一个模糊的问题。在本文中，我们设计了一种缓冲感知短视频流系统 Gamora，为用户提供高 QoE。在 Gamora 中，我们首先提出了一种无序预加载算法，利用深度强化学习（DRL）算法做出视频预加载决策。然后，我们进一步设计了一种非对称模仿学习（AIL）算法来指导基于 DRL 的预加载算法，使代理能够从专家示范中学习，从而快速收敛。最后，我们实现了所提出的短视频流系统原型，并在各种实际网络数据集上评估了 Gamora 的性能。结果表明，与最先进的算法相比，Gamora 的 QoE 显著提高了 28.7%-51.4%，同时在不牺牲视频质量的情况下减少了 40.7%-83.2% 的带宽浪费。

{"title":"Gamora: Learning-Based Buffer-Aware Preloading for Adaptive Short Video Streaming","authors":"Biao Hou;Song Yang;Fan Li;Liehuang Zhu;Lei Jiao;Xu Chen;Xiaoming Fu","doi":"10.1109/TPDS.2024.3456567","DOIUrl":"10.1109/TPDS.2024.3456567","url":null,"abstract":"Nowadays, the emerging short video streaming applications have gained substantial attention. With the rapidly burgeoning demand for short video streaming services, maximizing their Quality of Experience (QoE) is an onerous challenge. Current video preloading algorithms cannot determine video preloading sequence decisions appropriately due to the impact of users’ swipes and bandwidth fluctuations. As a result, it is still ambiguous how to improve the overall QoE while mitigating bandwidth wastage to optimize short video streaming services. In this article, we devise Gamora, a buffer-aware short video streaming system to provide a high QoE of users. In Gamora, we first propose an unordered preloading algorithm that utilizes a Deep Reinforcement Learning (DRL) algorithm to make video preloading decisions. Then, we further devise an Asymmetric Imitation Learning (AIL) algorithm to guide the DRL-based preloading algorithm, which enables the agent to learn from expert demonstrations for fast convergence. Finally, we implement our proposed short video streaming system prototype and evaluate the performance of Gamora on various real-world network datasets. Our results demonstrate that Gamora significantly achieves QoE improvement by 28.7%–51.4% compared to state-of-the-art algorithms, while mitigating bandwidth wastage by 40.7%–83.2% without sacrificing video quality.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2132-2146"},"PeriodicalIF":5.6,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Trusted Model Aggregation With Zero-Knowledge Proofs in Federated Learning 联盟学习中的零知识证明可信模型聚合

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-06 DOI: 10.1109/TPDS.2024.3455762

Renwen Ma;Kai Hwang;Mo Li;Yiming Miao

This paper proposes a new global model aggregation method based on using zero-knowledge federated learning (ZKFL). The purpose is to secure horizontal or P2P federated machine learning systems with shorter aggregation times, higher model accuracy, and lower system costs. We use a model parameter-sharing Chord overlay network among all client hosts. The overlay guarantees a trusted sharing of zero-knowledge proofs for aggregation integrity, even under malicious Byzantine attacks. We tested over popular datasets, Fashion-MNIST and CIFAR10, to prove the new system protection concept. Our benchmark experiments validate the claimed advantages of the ZKFL scheme in all objective functions. Our aggregation method can be applied to secure both rank-based and similarity-based aggregation schemes. For a large system with over 200 clients, our system takes only 3 seconds to yield high-precision global machine models under the ALIE attacks with the Fashion-MNIST dataset. We have achieved up to 85% model accuracy, compared to only 3%

$sim$

45% accuracy observed with federated schemes without protection. Moreover, our method demands a low memory overhead for handling zero-knowledge proofs as the system scales greatly to a larger number of client nodes.

本文基于零知识联合学习（ZKFL）提出了一种新的全局模型聚合方法。其目的是以更短的聚合时间、更高的模型准确性和更低的系统成本确保水平或 P2P 联合机器学习系统的安全。我们在所有客户主机之间使用模型参数共享的 Chord 重叠网络。即使在受到恶意拜占庭攻击的情况下，叠加网络也能保证可信的零知识证明共享，从而保证聚合的完整性。我们在流行数据集 Fashion-MNIST 和 CIFAR10 上进行了测试，以证明新的系统保护概念。我们的基准实验验证了 ZKFL 方案在所有目标函数中宣称的优势。我们的聚合方法既可用于保护基于等级的聚合方案，也可用于保护基于相似性的聚合方案。对于一个拥有 200 多个客户端的大型系统，我们的系统只需 3 秒钟就能利用 Fashion-MNIST 数据集生成 ALIE 攻击下的高精度全局机器模型。我们的模型准确率高达 85%，而无保护的联合方案准确率仅为 3%。此外，我们的方法只需较低的内存开销来处理零知识证明，因为系统可以极大地扩展到更多的客户端节点。

{"title":"Trusted Model Aggregation With Zero-Knowledge Proofs in Federated Learning","authors":"Renwen Ma;Kai Hwang;Mo Li;Yiming Miao","doi":"10.1109/TPDS.2024.3455762","DOIUrl":"10.1109/TPDS.2024.3455762","url":null,"abstract":"This paper proposes a new global model aggregation method based on using zero-knowledge federated learning (ZKFL). The purpose is to secure horizontal or P2P federated machine learning systems with shorter aggregation times, higher model accuracy, and lower system costs. We use a model parameter-sharing Chord overlay network among all client hosts. The overlay guarantees a trusted sharing of zero-knowledge proofs for aggregation integrity, even under malicious Byzantine attacks. We tested over popular datasets, Fashion-MNIST and CIFAR10, to prove the new system protection concept. Our benchmark experiments validate the claimed advantages of the ZKFL scheme in all objective functions. Our aggregation method can be applied to secure both rank-based and similarity-based aggregation schemes. For a large system with over 200 clients, our system takes only 3 seconds to yield high-precision global machine models under the ALIE attacks with the Fashion-MNIST dataset. We have achieved up to 85% model accuracy, compared to only 3%\u0000<inline-formula><tex-math>$sim$</tex-math></inline-formula>\u000045% accuracy observed with federated schemes without protection. Moreover, our method demands a low memory overhead for handling zero-knowledge proofs as the system scales greatly to a larger number of client nodes.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2284-2296"},"PeriodicalIF":5.6,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FedVeca: Federated Vectorized Averaging on Non-IID Data With Adaptive Bi-Directional Global Objective FedVeca：非 IID 数据的联合矢量化平均与自适应双向全局目标

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-04 DOI: 10.1109/TPDS.2024.3454203

Ping Luo;Jieren Cheng;N. Xiong;Zhenhao Liu;Jie Wu

Federated Learning (FL) is a distributed machine learning framework in parallel and distributed systems. However, the systems’ Non-Independent and Identically Distributed (Non-IID) data negatively affect the communication efficiency, since clients with different datasets may cause significant gaps to the local gradients in each communication round. In this article, we propose a Federated Vectorized Averaging (FedVeca) method to optimize the FL communication system on Non-IID data. Specifically, we set a novel objective for the global model which is related to the local gradients. The local gradient is defined as a bi-directional vector with step size and direction, where the step size is the number of local updates and the direction is divided into positive and negative according to our definition. In FedVeca, the direction is influenced by the step size, thus we average the bi-directional vectors to reduce the effect of different step sizes. Then, we theoretically analyze the relationship between the step sizes and the global objective, and obtain upper bounds on the step sizes per communication round. Based on the upper bounds, we design an algorithm for the server and the client to adaptively adjusts the step sizes that make the objective close to the optimum. Finally, we conduct experiments on different datasets, models and scenarios by building a prototype system, and the experimental results demonstrate the effectiveness and efficiency of the FedVeca method.

联合学习（FL）是并行和分布式系统中的一种分布式机器学习框架。然而，系统中的非独立和相同分布（Non-IID）数据会对通信效率产生负面影响，因为拥有不同数据集的客户端可能会在每轮通信中对本地梯度造成巨大差距。在本文中，我们提出了一种联邦矢量化平均（FedVeca）方法，用于优化非独立同分布数据的 FL 通信系统。具体来说，我们为全局模型设定了一个与局部梯度相关的新目标。根据我们的定义，局部梯度被定义为具有步长和方向的双向向量，其中步长是局部更新的次数，方向分为正向和负向。在 FedVeca 中，方向受步长的影响，因此我们将双向向量平均化，以减少不同步长的影响。然后，我们从理论上分析了步长与全局目标之间的关系，并得出了每轮通信的步长上限。在此基础上，我们为服务器和客户端设计了一种算法，用于自适应地调整步长，使目标接近最优。最后，我们通过构建原型系统，在不同的数据集、模型和场景下进行了实验，实验结果证明了 FedVeca 方法的有效性和高效性。

{"title":"FedVeca: Federated Vectorized Averaging on Non-IID Data With Adaptive Bi-Directional Global Objective","authors":"Ping Luo;Jieren Cheng;N. Xiong;Zhenhao Liu;Jie Wu","doi":"10.1109/TPDS.2024.3454203","DOIUrl":"10.1109/TPDS.2024.3454203","url":null,"abstract":"Federated Learning (FL) is a distributed machine learning framework in parallel and distributed systems. However, the systems’ Non-Independent and Identically Distributed (Non-IID) data negatively affect the communication efficiency, since clients with different datasets may cause significant gaps to the local gradients in each communication round. In this article, we propose a Federated Vectorized Averaging (FedVeca) method to optimize the FL communication system on Non-IID data. Specifically, we set a novel objective for the global model which is related to the local gradients. The local gradient is defined as a bi-directional vector with step size and direction, where the step size is the number of local updates and the direction is divided into positive and negative according to our definition. In FedVeca, the direction is influenced by the step size, thus we average the bi-directional vectors to reduce the effect of different step sizes. Then, we theoretically analyze the relationship between the step sizes and the global objective, and obtain upper bounds on the step sizes per communication round. Based on the upper bounds, we design an algorithm for the server and the client to adaptively adjusts the step sizes that make the objective close to the optimum. Finally, we conduct experiments on different datasets, models and scenarios by building a prototype system, and the experimental results demonstrate the effectiveness and efficiency of the FedVeca method.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2102-2113"},"PeriodicalIF":5.6,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-Throughput GPU Implementation of Dilithium Post-Quantum Digital Signature 锂后量子数字签名的高吞吐量 GPU 实现

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453289

Shiyu Shen;Hao Yang;Wangchen Dai;Hong Zhang;Zhe Liu;Yunlei Zhao

Digital signatures are fundamental building blocks in various protocols to provide integrity and authenticity. The development of the quantum computing has raised concerns about the security guarantees afforded by classical signature schemes. CRYSTALS-Dilithium is an efficient post-quantum digital signature scheme based on lattice cryptography and has been selected as the primary algorithm for standardization by the National Institute of Standards and Technology. In this work, we present a high-throughput GPU implementation of Dilithium. For individual operations, we employ a range of computational and memory optimizations to overcome sequential constraints, reduce memory usage and IO latency, address bank conflicts, and mitigate pipeline stalls. This results in high and balanced compute throughput and memory throughput for each operation. In terms of concurrent task processing, we leverage task-level batching to fully utilize parallelism and implement a memory pool mechanism for rapid memory access. We propose a dynamic task scheduling mechanism to improve multiprocessor occupancy and significantly reduce execution time. Furthermore, we apply asynchronous computing and launch multiple streams to hide data transfer latencies and maximize the computing capabilities of both CPU and GPU. Across all three security levels, our GPU implementation achieves over 160× speedups for signing and over 80× speedups for verification on both commercial and server-grade GPUs. This achieves microsecond-level amortized execution times for each task, offering a high-throughput and quantum-resistant solution suitable for a wide array of applications in real systems.

数字签名是各种协议中提供完整性和真实性的基本构件。量子计算的发展引发了人们对经典签名方案所提供的安全保证的担忧。CRYSTALS-Dilithium 是一种基于晶格密码学的高效后量子数字签名方案，已被美国国家标准与技术研究院选为标准化的主要算法。在这项工作中，我们介绍了 Dilithium 的高吞吐量 GPU 实现。对于单个操作，我们采用了一系列计算和内存优化措施，以克服顺序限制、减少内存使用和 IO 延迟、解决库冲突并缓解流水线停滞。因此，每项操作的计算吞吐量和内存吞吐量都很高，而且很均衡。在并发任务处理方面，我们利用任务级批处理来充分利用并行性，并实施了快速内存访问的内存池机制。我们提出了一种动态任务调度机制，以提高多处理器占用率并显著缩短执行时间。此外，我们还应用异步计算并启动多个流来隐藏数据传输延迟，最大限度地发挥 CPU 和 GPU 的计算能力。在所有三个安全级别中，我们的 GPU 实现在商用和服务器级 GPU 上的签名速度提高了 160 倍以上，验证速度提高了 80 倍以上。这使得每个任务的摊销执行时间达到了微秒级，从而提供了一种适合实际系统中各种应用的高吞吐量和抗量子解决方案。

{"title":"High-Throughput GPU Implementation of Dilithium Post-Quantum Digital Signature","authors":"Shiyu Shen;Hao Yang;Wangchen Dai;Hong Zhang;Zhe Liu;Yunlei Zhao","doi":"10.1109/TPDS.2024.3453289","DOIUrl":"10.1109/TPDS.2024.3453289","url":null,"abstract":"Digital signatures are fundamental building blocks in various protocols to provide integrity and authenticity. The development of the quantum computing has raised concerns about the security guarantees afforded by classical signature schemes. CRYSTALS-Dilithium is an efficient post-quantum digital signature scheme based on lattice cryptography and has been selected as the primary algorithm for standardization by the National Institute of Standards and Technology. In this work, we present a high-throughput GPU implementation of Dilithium. For individual operations, we employ a range of computational and memory optimizations to overcome sequential constraints, reduce memory usage and IO latency, address bank conflicts, and mitigate pipeline stalls. This results in high and balanced compute throughput and memory throughput for each operation. In terms of concurrent task processing, we leverage task-level batching to fully utilize parallelism and implement a memory pool mechanism for rapid memory access. We propose a dynamic task scheduling mechanism to improve multiprocessor occupancy and significantly reduce execution time. Furthermore, we apply asynchronous computing and launch multiple streams to hide data transfer latencies and maximize the computing capabilities of both CPU and GPU. Across all three security levels, our GPU implementation achieves over 160× speedups for signing and over 80× speedups for verification on both commercial and server-grade GPUs. This achieves microsecond-level amortized execution times for each task, offering a high-throughput and quantum-resistant solution suitable for a wide array of applications in real systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1964-1976"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SC-CGRA: An Energy-Efficient CGRA Using Stochastic Computing SC-CGRA：使用随机计算的高能效 CGRA

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453310

Di Mou;Bo Wang;Dajiang Liu

Stochastic Computing (SC) offers a promising computing paradigm for low-power and cost-effective applications, with the added advantage of high error tolerance. In parallel, Coarse-Grained Reconfigurable Arrays (CGRA) prove to be a highly promising platform for domain-specific applications due to their combination of energy efficiency and flexibility. Intuitively, introducing SC to CGRA would significantly reinforce the strengths of both paradigms. However, existing SC-based architectures often encounter inherent computation errors, while the stochastic number generators employed in SC result in exponentially growing latency, which is deemed unacceptable in CGRA. In this work, we propose an SC-based CGRA by replacing the exact multiplication in traditional CGRA with an SC-based multiplication. To improve the accuracy of SC and shorten the latency of Stochastic Number Generators (SNG), we introduce the leading zero shifting and comparator truncation, while keeping the length of bitstream fixed. In addition, due to the flexible interconnections among PEs, we propose a quality scaling strategy that combines neighbor PEs to achieve high-accuracy operations without switching costs like power-gating. Compared to the state-of-the-art approximate computing design of CGRA, our proposed CGRA can averagely achieve a 65.3% reduction in output error while having a 21.2% reduction in energy consumption and a noteworthy 28.37% area savings.

随机计算（Schochastic Computing，SC）为低功耗、高成本效益的应用提供了一种前景广阔的计算范式，并具有高容错性的额外优势。与此同时，粗粒度可重构阵列（CGRA）由于兼具能效和灵活性，被证明是一种非常有前途的特定领域应用平台。直观地说，将 SC 引入 CGRA 将大大加强这两种模式的优势。然而，现有的基于 SC 的架构经常会遇到固有的计算错误，而 SC 中采用的随机数字生成器会导致指数级增长的延迟，这在 CGRA 中被认为是不可接受的。在这项工作中，我们提出了一种基于 SC 的 CGRA，用基于 SC 的乘法取代传统 CGRA 中的精确乘法。为了提高 SC 的精度并缩短随机数发生器 (SNG) 的延迟，我们引入了前导零移位和比较器截断，同时保持比特流的长度不变。此外，由于 PE 之间具有灵活的互连，我们提出了一种质量缩放策略，即结合相邻 PE 实现高精度操作，而无需电源门等开关成本。与最先进的近似计算 CGRA 设计相比，我们提出的 CGRA 平均可将输出误差减少 65.3%，同时能耗减少 21.2%，面积节省 28.37%。

{"title":"SC-CGRA: An Energy-Efficient CGRA Using Stochastic Computing","authors":"Di Mou;Bo Wang;Dajiang Liu","doi":"10.1109/TPDS.2024.3453310","DOIUrl":"10.1109/TPDS.2024.3453310","url":null,"abstract":"Stochastic Computing (SC) offers a promising computing paradigm for low-power and cost-effective applications, with the added advantage of high error tolerance. In parallel, Coarse-Grained Reconfigurable Arrays (CGRA) prove to be a highly promising platform for domain-specific applications due to their combination of energy efficiency and flexibility. Intuitively, introducing SC to CGRA would significantly reinforce the strengths of both paradigms. However, existing SC-based architectures often encounter inherent computation errors, while the stochastic number generators employed in SC result in exponentially growing latency, which is deemed unacceptable in CGRA. In this work, we propose an SC-based CGRA by replacing the exact multiplication in traditional CGRA with an SC-based multiplication. To improve the accuracy of SC and shorten the latency of Stochastic Number Generators (SNG), we introduce the leading zero shifting and comparator truncation, while keeping the length of bitstream fixed. In addition, due to the flexible interconnections among PEs, we propose a quality scaling strategy that combines neighbor PEs to achieve high-accuracy operations without switching costs like power-gating. Compared to the state-of-the-art approximate computing design of CGRA, our proposed CGRA can averagely achieve a 65.3% reduction in output error while having a 21.2% reduction in energy consumption and a noteworthy 28.37% area savings.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2023-2038"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0