IEEE Transactions on Parallel and Distributed Systems最新文献_第8页

Which Coupled is Best Coupled? An Exploration of AIMC Tile Interfaces and Load Balancing for CNNs 哪种耦合是最佳耦合？AIMC 瓦片接口和 CNN 负载平衡探索

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-08-02 DOI: 10.1109/TPDS.2024.3437657

Joshua Klein;Irem Boybat;Giovanni Ansaloni;Marina Zapater;David Atienza

Due to stringent energy and performance constraints, edge AI computing often employs heterogeneous systems that utilize both general-purpose CPUs and accelerators. Analog in-memory computing (AIMC) is a well-known AI inference solution that overcomes computational bottlenecks by performing matrix-vector multiplication operations (MVMs) in constant time. However, the tiles of AIMC-based accelerators are limited by the number of weights they can hold. State-of-the-art research often sizes neural networks to AIMC tiles (or vice-versa), but does not consider cases where AIMC tiles cannot cover the whole network due to lack of tile resources or the network size. In this work, we study the trade-offs of available AIMC tile resources, neural network coverage, AIMC tile proximity to compute resources, and multi-core load balancing techniques. We first perform a study of single-layer performance and energy scalability of AIMC tiles in the two most typical AIMC acceleration targets: dense/fully-connected layers and convolutional layers. This study guides the methodology with which we approach parameter allocation to AIMC tiles in the context of large edge neural networks, both where AIMC tiles are close to the CPU (tightly-coupled) and cannot share resources across the system, and where AIMC tiles are far from the CPU (loosely-coupled) and can employ workload stealing. We explore the performance and energy trends of six modern CNNs using different methods of load balancing for differently-coupled system configurations with variable AIMC tile resources. We show that, by properly distributing workloads, AIMC acceleration can be made highly effective even on under-provisioned systems. As an example, 5.9x speedup and 5.6x energy gains were measured on an 8-core system, for a 41% coverage of neural network parameters.

由于严格的能耗和性能限制，边缘人工智能计算通常采用同时使用通用 CPU 和加速器的异构系统。模拟内存计算（AIMC）是一种著名的人工智能推理解决方案，它通过在恒定时间内执行矩阵-向量乘法运算（MVM）来克服计算瓶颈。然而，基于 AIMC 的加速器所能容纳的权重数量有限。最先进的研究通常会将神经网络的大小调整为 AIMC 瓦片（反之亦然），但不会考虑 AIMC 瓦片因缺乏瓦片资源或网络大小而无法覆盖整个网络的情况。在这项工作中，我们研究了可用 AIMC 瓦片资源、神经网络覆盖率、AIMC 瓦片与计算资源的接近程度以及多核负载平衡技术之间的权衡。我们首先研究了 AIMC 瓦片在两个最典型的 AIMC 加速目标中的单层性能和能量可扩展性：密集/全连接层和卷积层。这项研究为我们在大型边缘神经网络中处理 AIMC 瓦片参数分配提供了方法论指导，在这种情况下，AIMC 瓦片靠近 CPU（紧密耦合），无法在整个系统中共享资源，而在 AIMC 瓦片远离 CPU（松散耦合）的情况下，则可以采用工作负载窃取。我们探索了六种现代 CNN 的性能和能耗趋势，这些 CNN 采用了不同的负载均衡方法，适用于 AIMC 瓦片资源可变的不同耦合系统配置。我们的研究表明，通过适当分配工作负载，即使在配置不足的系统中，AIMC 加速也能非常有效。例如，在神经网络参数覆盖率为 41% 的 8 核系统上，我们测得了 5.9 倍的速度提升和 5.6 倍的能量增益。

{"title":"Which Coupled is Best Coupled? An Exploration of AIMC Tile Interfaces and Load Balancing for CNNs","authors":"Joshua Klein;Irem Boybat;Giovanni Ansaloni;Marina Zapater;David Atienza","doi":"10.1109/TPDS.2024.3437657","DOIUrl":"10.1109/TPDS.2024.3437657","url":null,"abstract":"Due to stringent energy and performance constraints, edge AI computing often employs heterogeneous systems that utilize both general-purpose CPUs and accelerators. Analog in-memory computing (AIMC) is a well-known AI inference solution that overcomes computational bottlenecks by performing matrix-vector multiplication operations (MVMs) in constant time. However, the tiles of AIMC-based accelerators are limited by the number of weights they can hold. State-of-the-art research often sizes neural networks to AIMC tiles (or vice-versa), but does not consider cases where AIMC tiles cannot cover the whole network due to lack of tile resources or the network size. In this work, we study the trade-offs of available AIMC tile resources, neural network coverage, AIMC tile proximity to compute resources, and multi-core load balancing techniques. We first perform a study of single-layer performance and energy scalability of AIMC tiles in the two most typical AIMC acceleration targets: dense/fully-connected layers and convolutional layers. This study guides the methodology with which we approach parameter allocation to AIMC tiles in the context of large edge neural networks, both where AIMC tiles are close to the CPU (tightly-coupled) and cannot share resources across the system, and where AIMC tiles are far from the CPU (loosely-coupled) and can employ workload stealing. We explore the performance and energy trends of six modern CNNs using different methods of load balancing for differently-coupled system configurations with variable AIMC tile resources. We show that, by properly distributing workloads, AIMC acceleration can be made highly effective even on under-provisioned systems. As an example, 5.9x speedup and 5.6x energy gains were measured on an 8-core system, for a 41% coverage of neural network parameters.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1780-1795"},"PeriodicalIF":5.6,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Locality-Preserving Graph Traversal With Split Live Migration 利用分割实时迁移实现位置保护图遍历

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-08-02 DOI: 10.1109/TPDS.2024.3436828

Rong Chen;Xingda Wei;Xiating Xie;Haibo Chen

Graph models many real-world data like social, transportation, biology, and communication data. Hence, graph traversal including multi-hop or graph-walking queries has been the key operation atop graph stores. However, since different graph traversals may touch different sets of vertices, it is hard or even impossible to have a one-size-fits-all graph partitioning algorithm that preserves access locality for various graph traversal workloads. Meanwhile, prior shard-based migration faces a dilemma such that coarse-grained migration may incur more migration overhead over increased locality benefits, while fine-grained migration usually requires excessive metadata and incurs non-trivial maintenance costs. We present Pragh, an efficient locality-preserving live graph migration scheme for graph stores in the form of key-value pairs. The key idea of Pragh is a split migration model that only migrates values physically while retaining keys in the initial location. This allows fine-grained migration while avoiding the need to maintain excessive metadata. Pragh integrates an RDMA-friendly location cache from DrTM-KV to provide fully-localized access to migrated data and further makes a novel reuse of the cache replacement policy for lightweight monitoring. Pragh further supports evolving graphs through a check-and-forward mechanism to resolve the conflict between updates and migration of graph data. Evaluations on an 8-node RDMA-capable cluster (100 Gbps) using a representative graph traversal benchmark show that Pragh can increase the throughput by up to 19× and decrease the median latency by up to 94%, thanks to split live migration that eliminates 97% remote accesses. A port of split live migration to Wukong shows up to 2.53× throughput improvement on representative workloads like LUBM-10240, thanks to a reduction of 88% remote accesses. This further confirms the effectiveness and generality of Pragh. Finally, though Pragh focuses on RDMA-based graph traversal, we show its generality by extending it to support graph traversals under traditional networking. Evaluations on the graph traversal benchmarks and graph query workloads on the same cluster but with 10 Gbps TCP/IP network further confirm its effectiveness without RDMA. Specifically, when evaluating on the LUBM-10240, Wukong-TCP with Pragh can achieve up to 1.87× throughput improvement with a 56% decrease in remote accesses.

图是许多现实世界数据的模型，如社会、交通、生物和通信数据。因此，图遍历（包括多跳或图行走查询）一直是图存储的关键操作。然而，由于不同的图遍历可能会触及不同的顶点集，因此很难甚至不可能有一种放之四海而皆准的图分区算法，能为各种图遍历工作负载保留访问局部性。与此同时，之前基于分片的迁移也面临着两难选择，粗粒度迁移可能会带来更多迁移开销，而不是更多的本地性优势，而细粒度迁移通常需要过多的元数据，并产生非同小可的维护成本。我们提出了 Pragh，这是一种针对键值对形式图存储的高效本地性保护实时图迁移方案。Pragh 的关键理念是一种拆分迁移模型，只对值进行物理迁移，而将键保留在初始位置。这样既能实现细粒度迁移，又能避免维护过多的元数据。Pragh 整合了来自 DrTM-KV 的 RDMA 友好位置缓存，为迁移数据提供完全本地化的访问，并进一步对缓存替换策略进行了新颖的重用，以实现轻量级监控。Pragh 还通过检查和转发机制进一步支持演化图，以解决图数据更新和迁移之间的冲突。在一个支持 RDMA 的 8 节点集群（100 Gbps）上使用具有代表性的图形遍历基准进行的评估表明，Pragh 可将吞吐量提高 19 倍，将中位延迟降低 94%，这要归功于可消除 97% 远程访问的拆分实时迁移。在 LUBM-10240 等代表性工作负载上，由于减少了 88% 的远程访问，将拆分实时迁移移植到 "悟空 "后，吞吐量最多提高了 2.53 倍。这进一步证实了 Pragh 的有效性和通用性。最后，虽然 Pragh 专注于基于 RDMA 的图遍历，但我们通过扩展它来支持传统网络下的图遍历，从而展示了它的通用性。在使用 10 Gbps TCP/IP 网络的同一集群上对图遍历基准和图查询工作负载进行的评估进一步证实了 Pragh 在不使用 RDMA 的情况下的有效性。具体而言，在 LUBM-10240 上进行评估时，使用 Pragh 的 Wukong-TCP 可实现高达 1.87 倍的吞吐量改进，远程访问量减少了 56%。

{"title":"Locality-Preserving Graph Traversal With Split Live Migration","authors":"Rong Chen;Xingda Wei;Xiating Xie;Haibo Chen","doi":"10.1109/TPDS.2024.3436828","DOIUrl":"10.1109/TPDS.2024.3436828","url":null,"abstract":"Graph models many real-world data like social, transportation, biology, and communication data. Hence, graph traversal including multi-hop or graph-walking queries has been the key operation atop graph stores. However, since different graph traversals may touch different sets of vertices, it is hard or even impossible to have a one-size-fits-all graph partitioning algorithm that preserves access locality for various graph traversal workloads. Meanwhile, prior shard-based migration faces a dilemma such that coarse-grained migration may incur more migration overhead over increased locality benefits, while fine-grained migration usually requires excessive metadata and incurs non-trivial maintenance costs. We present Pragh, an efficient locality-preserving live graph migration scheme for graph stores in the form of key-value pairs. The key idea of Pragh is a split migration model that only migrates values physically while retaining keys in the initial location. This allows fine-grained migration while avoiding the need to maintain excessive metadata. Pragh integrates an RDMA-friendly location cache from DrTM-KV to provide fully-localized access to migrated data and further makes a novel reuse of the cache replacement policy for lightweight monitoring. Pragh further supports evolving graphs through a check-and-forward mechanism to resolve the conflict between updates and migration of graph data. Evaluations on an 8-node RDMA-capable cluster (100 Gbps) using a representative graph traversal benchmark show that Pragh can increase the throughput by up to 19× and decrease the median latency by up to 94%, thanks to split live migration that eliminates 97% remote accesses. A port of split live migration to Wukong shows up to 2.53× throughput improvement on representative workloads like LUBM-10240, thanks to a reduction of 88% remote accesses. This further confirms the effectiveness and generality of Pragh. Finally, though Pragh focuses on RDMA-based graph traversal, we show its generality by extending it to support graph traversals under traditional networking. Evaluations on the graph traversal benchmarks and graph query workloads on the same cluster but with 10 Gbps TCP/IP network further confirm its effectiveness without RDMA. Specifically, when evaluating on the LUBM-10240, Wukong-TCP with Pragh can achieve up to 1.87× throughput improvement with a 56% decrease in remote accesses.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1810-1825"},"PeriodicalIF":5.6,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distributed Evolution Strategies With Multi-Level Learning for Large-Scale Black-Box Optimization 针对大规模黑箱优化的多级学习分布式进化策略

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-08-02 DOI: 10.1109/TPDS.2024.3437688

Qiqi Duan;Chang Shao;Guochen Zhou;Minghan Zhang;Qi Zhao;Yuhui Shi

In the post-Moore era, main performance gains of black-box optimizers are increasingly depending on parallelism, especially for large-scale optimization (LSO). Here we propose to parallelize the well-established covariance matrix adaptation evolution strategy (CMA-ES) and in particular its one latest LSO variant called limited-memory CMA-ES (LM-CMA). To achieve efficiency while approximating its powerful invariance property, we present a multilevel learning-based meta-framework for distributed LM-CMA. Owing to its hierarchically organized structure, Meta-ES is well-suited to implement our distributed meta-framework, wherein the outer-ES controls strategy parameters while all parallel inner-ESs run the serial LM-CMA with different settings. For the distribution mean update of the outer-ES, both the elitist and multi-recombination strategy are used in parallel to avoid stagnation and regression, respectively. To exploit spatiotemporal information, the global step-size adaptation combines Meta-ES with the parallel cumulative step-size adaptation. After each isolation time, our meta-framework employs both the structure and parameter learning strategy to combine aligned evolution paths for CMA reconstruction. Experiments on a set of large-scale benchmarking functions with memory-intensive evaluations, arguably reflecting many data-driven optimization problems, validate the benefits (e.g., effectiveness w.r.t. solution quality, and adaptability w.r.t. second-order learning) and costs of our meta-framework.

在后摩尔时代，黑盒优化器的主要性能提升越来越依赖于并行化，尤其是大规模优化（LSO）。在此，我们提议并行化成熟的协方差矩阵适应演化策略（CMA-ES），特别是其最新的 LSO 变体--有限内存 CMA-ES （LM-CMA）。为了在近似其强大不变性特性的同时提高效率，我们提出了一种基于多层次学习的分布式 LM-CMA 元框架。由于其分层组织结构，Meta-ES 非常适合实现我们的分布式元框架，其中外层 ES 控制策略参数，而所有并行的内层 ES 以不同的设置运行串行 LM-CMA。对于外层 ESP 的分布均值更新，将并行使用精英策略和多重组合策略，以分别避免停滞和回归。为了利用时空信息，全局步长适应将 Meta-ES 与并行累积步长适应相结合。在每次隔离时间之后，我们的元框架都会采用结构和参数学习策略，结合对齐的演化路径进行 CMA 重建。在一组大规模基准函数上进行的实验验证了我们元框架的优势（例如，在解决方案质量方面的有效性和在二阶学习方面的适应性）和成本，这些基准函数具有内存密集型评估，可以说反映了许多数据驱动的优化问题。

{"title":"Distributed Evolution Strategies With Multi-Level Learning for Large-Scale Black-Box Optimization","authors":"Qiqi Duan;Chang Shao;Guochen Zhou;Minghan Zhang;Qi Zhao;Yuhui Shi","doi":"10.1109/TPDS.2024.3437688","DOIUrl":"10.1109/TPDS.2024.3437688","url":null,"abstract":"In the post-Moore era, main performance gains of black-box optimizers are increasingly depending on parallelism, especially for large-scale optimization (LSO). Here we propose to parallelize the well-established covariance matrix adaptation evolution strategy (CMA-ES) and in particular its one latest LSO variant called limited-memory CMA-ES (LM-CMA). To achieve efficiency while approximating its powerful invariance property, we present a multilevel learning-based meta-framework for distributed LM-CMA. Owing to its hierarchically organized structure, Meta-ES is well-suited to implement our distributed meta-framework, wherein the outer-ES controls strategy parameters while all parallel inner-ESs run the serial LM-CMA with different settings. For the distribution mean update of the outer-ES, both the elitist and multi-recombination strategy are used in parallel to avoid stagnation and regression, respectively. To exploit spatiotemporal information, the global step-size adaptation combines Meta-ES with the parallel cumulative step-size adaptation. After each isolation time, our meta-framework employs both the structure and parameter learning strategy to combine aligned evolution paths for CMA reconstruction. Experiments on a set of large-scale benchmarking functions with memory-intensive evaluations, arguably reflecting many data-driven optimization problems, validate the benefits (e.g., effectiveness w.r.t. solution quality, and adaptability w.r.t. second-order learning) and costs of our meta-framework.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2087-2101"},"PeriodicalIF":5.6,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SR-FDIL: Synergistic Replay for Federated Domain-Incremental Learning SR-FDIL：联合领域增量学习的协同重放

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-08-02 DOI: 10.1109/TPDS.2024.3436874

Yichen Li;Wenchao Xu;Yining Qi;Haozhao Wang;Ruixuan Li;Song Guo

Federated Learning (FL) is to allow multiple clients to collaboratively train a model while keeping their data locally. However, existing FL approaches typically assume that the data in each client is static and fixed, which cannot account for incremental data with domain shift, leading to catastrophic forgetting on previous domains, particularly when clients are common edge devices that may lack enough storage to retain full samples of each domain. To tackle this challenge, we propose Federated Domain-Incremental Learning via Synergistic Replay (SR-FDIL), which alleviates catastrophic forgetting by coordinating all clients to cache samples and replay them. More specifically, when new data arrives, each client selects the cached samples based not only on their importance in the local dataset but also on their correlation with the global dataset. Moreover, to achieve a balance between learning new data and memorizing old data, we propose a novel client selection mechanism by jointly considering the importance of both old and new data. We conducted extensive experiments on several datasets of which the results demonstrate that SR-FDIL outperforms state-of-the-art methods by up to 4.05% in terms of average accuracy of all domains.

联合学习（FL）是允许多个客户端协同训练一个模型，同时在本地保存各自的数据。然而，现有的联合学习方法通常假定每个客户端的数据都是静态和固定的，这就无法解释域转移带来的数据增量，从而导致对先前域的灾难性遗忘，特别是当客户端是普通边缘设备时，可能缺乏足够的存储来保留每个域的完整样本。为了应对这一挑战，我们提出了通过协同重放进行联合域增量学习（SR-FDIL），通过协调所有客户端缓存样本并重放它们来缓解灾难性遗忘。更具体地说，当新数据到来时，每个客户端不仅会根据样本在本地数据集中的重要性，还会根据样本与全局数据集的相关性来选择缓存样本。此外，为了在学习新数据和记忆旧数据之间取得平衡，我们提出了一种新颖的客户端选择机制，即共同考虑新旧数据的重要性。我们在多个数据集上进行了广泛的实验，结果表明，SR-FDIL 在所有领域的平均准确率方面比最先进的方法高出 4.05%。

{"title":"SR-FDIL: Synergistic Replay for Federated Domain-Incremental Learning","authors":"Yichen Li;Wenchao Xu;Yining Qi;Haozhao Wang;Ruixuan Li;Song Guo","doi":"10.1109/TPDS.2024.3436874","DOIUrl":"10.1109/TPDS.2024.3436874","url":null,"abstract":"Federated Learning (FL) is to allow multiple clients to collaboratively train a model while keeping their data locally. However, existing FL approaches typically assume that the data in each client is static and fixed, which cannot account for incremental data with domain shift, leading to catastrophic forgetting on previous domains, particularly when clients are common edge devices that may lack enough storage to retain full samples of each domain. To tackle this challenge, we propose \u0000<bold>F\u0000ederated \u0000<bold>D\u0000omain-\u0000<bold>I\u0000ncremental \u0000<bold>L\u0000earning via \u0000<bold>S\u0000ynergistic \u0000<bold>R\u0000eplay (SR-FDIL), which alleviates catastrophic forgetting by coordinating all clients to cache samples and replay them. More specifically, when new data arrives, each client selects the cached samples based not only on their importance in the local dataset but also on their correlation with the global dataset. Moreover, to achieve a balance between learning new data and memorizing old data, we propose a novel client selection mechanism by jointly considering the importance of both old and new data. We conducted extensive experiments on several datasets of which the results demonstrate that SR-FDIL outperforms state-of-the-art methods by up to 4.05% in terms of average accuracy of all domains.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1879-1890"},"PeriodicalIF":5.6,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cost-Effective and Robust Service Provisioning in Multi-Access Edge Computing 在多接入边缘计算中提供经济高效且稳健的服务

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-07-30 DOI: 10.1109/TPDS.2024.3435929

Zhengzhe Xiang;Yuhang Zheng;Dongjing Wang;Javid Taheri;Zengwei Zheng;Minyi Guo

With the development of multiaccess edge computing (MEC) technology, an increasing number of researchers and developers are deploying their computation-intensive and IO-intensive services (especially AI services) on edge devices. These devices, being close to end users, provide better performance in mobile environments. By constructing a service provisioning system at the network edge, latency is significantly reduced due to short-distance communication with edge servers. However, since the MEC-based service provisioning system is resource-sensitive and the network may be unstable, careful resource allocation and traffic scheduling strategies are essential. This paper investigates and quantifies the cost-effectiveness and robustness of the MEC-based service provisioning system with the applied resource allocation and traffic scheduling strategies. Based on this analysis, a cost-effective and robust service provisioning algorithm, termed CERA, is proposed to minimize deployment costs while maintaining system robustness. Extensive experiments are conducted to compare the proposed approach with well-known baseline algorithms and evaluate factors impacting the results. The findings demonstrate that CERA achieves at least 15.9% better performance than other baseline algorithms across various instances.

随着多访问边缘计算（MEC）技术的发展，越来越多的研究人员和开发人员正在边缘设备上部署计算密集型和 IO 密集型服务（尤其是人工智能服务）。这些设备靠近终端用户，能在移动环境中提供更好的性能。通过在网络边缘构建服务供应系统，与边缘服务器的短距离通信可显著降低延迟。然而，由于基于 MEC 的服务供应系统对资源敏感，而且网络可能不稳定，因此必须采取谨慎的资源分配和流量调度策略。本文通过应用资源分配和流量调度策略，研究并量化了基于 MEC 的服务供应系统的成本效益和稳健性。在此分析基础上，提出了一种成本效益高且稳健的服务供应算法（称为 CERA），以最大限度地降低部署成本，同时保持系统的稳健性。我们进行了广泛的实验，将所提出的方法与著名的基线算法进行比较，并对影响结果的因素进行评估。实验结果表明，在各种实例中，CERA 比其他基线算法至少提高了 15.9% 的性能。

{"title":"Cost-Effective and Robust Service Provisioning in Multi-Access Edge Computing","authors":"Zhengzhe Xiang;Yuhang Zheng;Dongjing Wang;Javid Taheri;Zengwei Zheng;Minyi Guo","doi":"10.1109/TPDS.2024.3435929","DOIUrl":"10.1109/TPDS.2024.3435929","url":null,"abstract":"With the development of multiaccess edge computing (MEC) technology, an increasing number of researchers and developers are deploying their computation-intensive and IO-intensive services (especially AI services) on edge devices. These devices, being close to end users, provide better performance in mobile environments. By constructing a service provisioning system at the network edge, latency is significantly reduced due to short-distance communication with edge servers. However, since the MEC-based service provisioning system is resource-sensitive and the network may be unstable, careful resource allocation and traffic scheduling strategies are essential. This paper investigates and quantifies the cost-effectiveness and robustness of the MEC-based service provisioning system with the applied resource allocation and traffic scheduling strategies. Based on this analysis, a \u0000<bold>c\u0000ost-\u0000<bold>e\u0000ffective and \u0000<bold>r\u0000obust service provisioning \u0000<bold>a\u0000lgorithm, termed \u0000<monospace>CERA</monospace>\u0000, is proposed to minimize deployment costs while maintaining system robustness. Extensive experiments are conducted to compare the proposed approach with well-known baseline algorithms and evaluate factors impacting the results. The findings demonstrate that \u0000<monospace>CERA</monospace>\u0000 achieves at least 15.9% better performance than other baseline algorithms across various instances.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1765-1779"},"PeriodicalIF":5.6,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Privacy Preserving Task Push in Spatial Crowdsourcing With Unknown Popularity 在未知人气的空间众包中保护隐私的任务推送

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-07-29 DOI: 10.1109/TPDS.2024.3434978

Yin Xu;Mingjun Xiao;Jie Wu;He Sun

In this paper, we investigate the privacy-preserving task push problem with unknown popularity in Spatial Crowdsourcing (SC), where the platform needs to select some tasks with unknown popularity and push them to workers. Meanwhile, the preferences of workers and the popularity values of tasks might involve some sensitive information, which should be protected from disclosure. To address these concerns, we propose a Privacy Preserving Auction-based Bandit scheme, termed PPAB. Specifically, on the basis of the Combinatorial Multi-armed Bandit (CMAB) game, we first construct a Differentially Private Auction-based CMAB (DPA-CMAB) model. Under the DPA-CMAB model, we design a privacy-preserving arm-pulling policy based on Diffie-Hellman (DH), Differential Privacy (DP), and upper confidence bound, which includes the DH-based encryption mechanism and the hybrid DP-based protection mechanism. The policy not only can learn the popularity of tasks and make online task push decisions, but also can protect the popularity as well as workers’ preferences from being revealed. Meanwhile, we design an auction-based incentive mechanism to determine the payment for each selected task. Furthermore, we conduct an in-depth analysis of the security and online performance of PPAB, and prove that PPAB satisfies some desired properties (i.e., truthfulness, individual rationality, and computational efficiency). Finally, the significant performance of PPAB is confirmed through extensive simulations on the real-world dataset.

在空间众包（SC）中，平台需要选择一些未知人气的任务并将其推送给工人，本文研究了未知人气下的隐私保护任务推送问题。同时，工人的偏好和任务的受欢迎程度值可能涉及一些敏感信息，这些信息应防止泄露。为了解决这些问题，我们提出了一种基于竞价排名的隐私保护方案（Privacy Preserving Auction-based Bandit scheme），简称 PPAB。具体来说，在组合多臂匪徒（CMAB）博弈的基础上，我们首先构建了一个基于差分隐私拍卖的 CMAB（DPA-CMAB）模型。在 DPA-CMAB 模型下，我们设计了一种基于 Diffie-Hellman (DH)、Differential Privacy (DP) 和置信上限的隐私保护拉臂策略，其中包括基于 DH 的加密机制和基于 DP 的混合保护机制。该策略不仅能了解任务的受欢迎程度并做出在线任务推送决策，还能保护任务的受欢迎程度和工人的偏好不被泄露。同时，我们设计了一种基于拍卖的激励机制，以确定每个选定任务的报酬。此外，我们还对 PPAB 的安全性和在线性能进行了深入分析，并证明 PPAB 满足一些期望的特性（即真实性、个体理性和计算效率）。最后，通过在真实世界数据集上进行大量仿真，证实了 PPAB 的显著性能。

{"title":"Privacy Preserving Task Push in Spatial Crowdsourcing With Unknown Popularity","authors":"Yin Xu;Mingjun Xiao;Jie Wu;He Sun","doi":"10.1109/TPDS.2024.3434978","DOIUrl":"10.1109/TPDS.2024.3434978","url":null,"abstract":"In this paper, we investigate the privacy-preserving task push problem with unknown popularity in Spatial Crowdsourcing (SC), where the platform needs to select some tasks with unknown popularity and push them to workers. Meanwhile, the preferences of workers and the popularity values of tasks might involve some sensitive information, which should be protected from disclosure. To address these concerns, we propose a Privacy Preserving Auction-based Bandit scheme, termed PPAB. Specifically, on the basis of the Combinatorial Multi-armed Bandit (CMAB) game, we first construct a Differentially Private Auction-based CMAB (DPA-CMAB) model. Under the DPA-CMAB model, we design a privacy-preserving arm-pulling policy based on Diffie-Hellman (DH), Differential Privacy (DP), and upper confidence bound, which includes the DH-based encryption mechanism and the hybrid DP-based protection mechanism. The policy not only can learn the popularity of tasks and make online task push decisions, but also can protect the popularity as well as workers’ preferences from being revealed. Meanwhile, we design an auction-based incentive mechanism to determine the payment for each selected task. Furthermore, we conduct an in-depth analysis of the security and online performance of PPAB, and prove that PPAB satisfies some desired properties (i.e., truthfulness, individual rationality, and computational efficiency). Finally, the significant performance of PPAB is confirmed through extensive simulations on the real-world dataset.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2039-2053"},"PeriodicalIF":5.6,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141869342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A State-of-the-Art Review with Code about Connected Components Labeling on GPUs 用代码回顾 GPU 上连接组件标签的最新进展

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-07-29 DOI: 10.1109/tpds.2024.3434357

Federico Bolelli, Stefano Allegretti, Luca Lumetti, Costantino Grana

引用次数: 0

SSA: A Uniformly Recursive Bidirection-Sequence Systolic Sorter Array SSA：统一递归双向序列 Systolic Sorter 阵列

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-07-26 DOI: 10.1109/TPDS.2024.3434332

Teng Gao;Lan Huang;Shang Gao;Kangping Wang

The use of reconfigurable circuits with parallel computing capabilities has been explored to enhance sorting performance and reduce power consumption. Nonetheless, most sorting algorithms utilizing dedicated processors are designed solely based on the parallelization of the algorithm, lacking considerations of specialized hardware structures. This leads to problems, including but not limited to the consumption of excessive I/O interface resources, on-chip storage resources, and complex layout wiring. In this paper, we propose a Systolic Sorter Array, implemented by a Uniform Recurrence Equation (URE) with highly parameterised in terms of data size, bit width and type. Leveraging this uniformly recursive structure, the sorter can simultaneously sort two independent sequences. In addition, we implemented global and local control modes on the FPGA to achieve higher computational frequencies. In our experiments, we have demonstrated the speed-up ratio of SSA relative to other state of the art (SOTA) sorting algorithms using C++

$std$

::

$sort()$

as benchmark. Inheriting the benefits from the Systolic Array architecture, the SSA reaches up to 810 Mhz computing frequency on the U200. The results of our study show that SSA outperforms other sorting algorithms in terms of throughput, speed-up ratio, and computation frequency.

人们一直在探索使用具有并行计算能力的可重构电路来提高排序性能和降低功耗。然而，大多数使用专用处理器的排序算法在设计时只考虑了算法的并行化，缺乏对专用硬件结构的考虑。这就导致了一些问题，包括但不限于消耗过多的 I/O 接口资源、片上存储资源和复杂的布局布线。在本文中，我们提出了一种通过统一递归方程（URE）实现的、在数据大小、位宽和类型方面高度参数化的 Systolic Sorter Array。利用这种均匀递归结构，分拣机可以同时对两个独立序列进行分拣。此外，我们还在 FPGA 上实现了全局和局部控制模式，以达到更高的计算频率。在实验中，我们以 C++ $std$::$sort()$ 为基准，展示了 SSA 相对于其他最新排序算法（SOTA）的加速比率。SSA 继承了 Systolic Array 架构的优点，在 U200 上的计算频率高达 810 Mhz。研究结果表明，SSA 在吞吐量、加速比和计算频率方面都优于其他排序算法。

{"title":"SSA: A Uniformly Recursive Bidirection-Sequence Systolic Sorter Array","authors":"Teng Gao;Lan Huang;Shang Gao;Kangping Wang","doi":"10.1109/TPDS.2024.3434332","DOIUrl":"10.1109/TPDS.2024.3434332","url":null,"abstract":"The use of reconfigurable circuits with parallel computing capabilities has been explored to enhance sorting performance and reduce power consumption. Nonetheless, most sorting algorithms utilizing dedicated processors are designed solely based on the parallelization of the algorithm, lacking considerations of specialized hardware structures. This leads to problems, including but not limited to the consumption of excessive I/O interface resources, on-chip storage resources, and complex layout wiring. In this paper, we propose a Systolic Sorter Array, implemented by a Uniform Recurrence Equation (URE) with highly parameterised in terms of data size, bit width and type. Leveraging this uniformly recursive structure, the sorter can simultaneously sort two independent sequences. In addition, we implemented global and local control modes on the FPGA to achieve higher computational frequencies. In our experiments, we have demonstrated the speed-up ratio of SSA relative to other state of the art (SOTA) sorting algorithms using C++ \u0000<inline-formula><tex-math>$std$</tex-math></inline-formula>\u0000::\u0000<inline-formula><tex-math>$sort()$</tex-math></inline-formula>\u0000 as benchmark. Inheriting the benefits from the Systolic Array architecture, the SSA reaches up to 810 Mhz computing frequency on the U200. The results of our study show that SSA outperforms other sorting algorithms in terms of throughput, speed-up ratio, and computation frequency.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1721-1734"},"PeriodicalIF":5.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Long-Range MD Electrostatics Force Computation on FPGAs FPGA 上的长程 MD 静电力计算

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-07-26 DOI: 10.1109/TPDS.2024.3434347

Sahan Bandara;Anthony Ducimo;Chunshu Wu;Martin Herbordt

Strong scaling of long-range electrostatic force computation, which is a central concern of long timescale molecular dynamics simulations, is challenging for CPUs and GPUs due to its complex communication structure and global communication requirements. The scalability challenge is seen especially in small simulations of tens to hundreds of thousands of atoms that are of interest to many important applications such as physics-driven drug discovery. FPGA clusters, with their direct, tightly coupled, low-latency interconnects, are able to address these requirements. For FPGA MD clusters to be effective, however, single device performance must also be competitive. In this work, we leverage the inherent benefits of FPGAs to implement a long-range electrostatic force computation architecture. We present an overall framework with numerous algorithmic, mapping, and architecture innovations, including a unified interleaved memory, a spatial scheduling algorithm, and a design for seamless integration with the larger MD system. We examine a number of alternative configurations based on different resource allocation strategies and user parameters. We show that the best configuration of this architecture, implemented on an Intel Agilex FPGA, can achieve

$2124 ns$

and

$287 ns$

of simulated time per day of wall-clock time for the two molecular dynamics benchmarks DHFR and ApoA1; simulating 23K and 92K particles, respectively.

长程静电力计算的强扩展性是长时间尺度分子动力学模拟的核心问题，由于其复杂的通信结构和全局通信要求，对 CPU 和 GPU 来说具有挑战性。尤其是在数万到数十万个原子的小型模拟中，这种可扩展性挑战尤为突出，而这正是物理驱动药物发现等许多重要应用所关注的。FPGA 群集具有直接、紧密耦合、低延迟的互连功能，能够满足这些要求。然而，要使 FPGA MD 群集有效，单个设备的性能也必须具有竞争力。在这项工作中，我们利用 FPGA 的固有优势实现了长程静电力计算架构。我们提出了一个具有众多算法、映射和架构创新的整体框架，包括统一交错存储器、空间调度算法以及与大型 MD 系统无缝集成的设计。我们根据不同的资源分配策略和用户参数，研究了多种可选配置。我们的研究表明，在英特尔 Agilex FPGA 上实现的这一架构的最佳配置，可以在两个分子动力学基准 DHFR 和 ApoA1 上分别模拟 23K 和 92K 个粒子，每天壁钟时间的模拟时间分别达到 2124 ns$ 和 287 ns$。

{"title":"Long-Range MD Electrostatics Force Computation on FPGAs","authors":"Sahan Bandara;Anthony Ducimo;Chunshu Wu;Martin Herbordt","doi":"10.1109/TPDS.2024.3434347","DOIUrl":"10.1109/TPDS.2024.3434347","url":null,"abstract":"Strong scaling of long-range electrostatic force computation, which is a central concern of long timescale molecular dynamics simulations, is challenging for CPUs and GPUs due to its complex communication structure and global communication requirements. The scalability challenge is seen especially in small simulations of tens to hundreds of thousands of atoms that are of interest to many important applications such as physics-driven drug discovery. FPGA clusters, with their direct, tightly coupled, low-latency interconnects, are able to address these requirements. For FPGA MD clusters to be effective, however, single device performance must also be competitive. In this work, we leverage the inherent benefits of FPGAs to implement a long-range electrostatic force computation architecture. We present an overall framework with numerous algorithmic, mapping, and architecture innovations, including a unified interleaved memory, a spatial scheduling algorithm, and a design for seamless integration with the larger MD system. We examine a number of alternative configurations based on different resource allocation strategies and user parameters. We show that the best configuration of this architecture, implemented on an Intel Agilex FPGA, can achieve \u0000<inline-formula><tex-math>$2124 ns$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$287 ns$</tex-math></inline-formula>\u0000 of simulated time per day of wall-clock time for the two molecular dynamics benchmarks DHFR and ApoA1; simulating 23K and 92K particles, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1690-1707"},"PeriodicalIF":5.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Redundancy-Free and Load-Balanced TGNN Training With Hierarchical Pipeline Parallelism 利用分层流水线并行性进行无冗余和负载平衡的 TGNN 训练

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-07-24 DOI: 10.1109/TPDS.2024.3432855

Yaqi Xia;Zheng Zhang;Donglin Yang;Chuang Hu;Xiaobo Zhou;Hongyang Chen;Qianlong Sang;Dazhao Cheng

Recently, Temporal Graph Neural Networks (TGNNs), as an extension of Graph Neural Networks, have demonstrated remarkable effectiveness in handling dynamic graph data. Distributed TGNN training requires efficiently tackling temporal dependency, which often leads to excessive cross-device communication that generates significant redundant data. However, existing systems are unable to remove the redundancy in data reuse and transfer, and suffer from severe communication overhead in a distributed setting. This work introduces Sven, a co-designed algorithm-system library aimed at accelerating TGNN training on a multi-GPU platform. Exploiting dependency patterns of TGNN models, we develop a redundancy-free graph organization to mitigate redundant data transfer. Additionally, we investigate communication imbalance issues among devices and formulate the graph partitioning problem as minimizing the maximum communication balance cost, which is proved to be an NP-hard problem. We propose an approximation algorithm called Re-FlexBiCut to tackle this problem. Furthermore, we incorporate prefetching, adaptive micro-batch pipelining, and asynchronous pipelining to present a hierarchical pipelining mechanism that mitigates the communication overhead. Sven represents the first comprehensive optimization solution for scaling memory-based TGNN training. Through extensive experiments conducted on a 64-GPU cluster, Sven demonstrates impressive speedup, ranging from 1.9x to 3.5x, compared to State-of-the-Art approaches. Additionally, Sven achieves up to 5.26x higher communication efficiency and reduces communication imbalance by up to 59.2%.

最近，时态图神经网络（TGNN）作为图神经网络的扩展，在处理动态图数据方面表现出了显著的效果。分布式 TGNN 训练需要有效地处理时间依赖性，而时间依赖性往往会导致过度的跨设备通信，从而产生大量冗余数据。然而，现有的系统无法消除数据重用和传输中的冗余，并且在分布式环境中存在严重的通信开销问题。这项工作介绍了 Sven，这是一个共同设计的算法系统库，旨在加速多 GPU 平台上的 TGNN 训练。利用 TGNN 模型的依赖模式，我们开发了一种无冗余图组织，以减少冗余数据传输。此外，我们还研究了设备之间的通信不平衡问题，并将图划分问题表述为最大通信平衡成本最小化，这被证明是一个 NP 难问题。我们提出了一种名为 Re-FlexBiCut 的近似算法来解决这一问题。此外，我们还结合了预取、自适应微批量流水线和异步流水线，提出了一种分层流水线机制，以减轻通信开销。Sven 是首个针对基于内存的 TGNN 训练的全面优化解决方案。通过在 64GPU 集群上进行的大量实验，与最新方法相比，Sven 的速度提高了 1.9 到 3.5 倍，令人印象深刻。此外，Sven 的通信效率提高了 5.26 倍，通信不平衡降低了 59.2%。

{"title":"Redundancy-Free and Load-Balanced TGNN Training With Hierarchical Pipeline Parallelism","authors":"Yaqi Xia;Zheng Zhang;Donglin Yang;Chuang Hu;Xiaobo Zhou;Hongyang Chen;Qianlong Sang;Dazhao Cheng","doi":"10.1109/TPDS.2024.3432855","DOIUrl":"10.1109/TPDS.2024.3432855","url":null,"abstract":"Recently, Temporal Graph Neural Networks (TGNNs), as an extension of Graph Neural Networks, have demonstrated remarkable effectiveness in handling dynamic graph data. Distributed TGNN training requires efficiently tackling temporal dependency, which often leads to excessive cross-device communication that generates significant redundant data. However, existing systems are unable to remove the redundancy in data reuse and transfer, and suffer from severe communication overhead in a distributed setting. This work introduces Sven, a co-designed algorithm-system library aimed at accelerating TGNN training on a multi-GPU platform. Exploiting dependency patterns of TGNN models, we develop a redundancy-free graph organization to mitigate redundant data transfer. Additionally, we investigate communication imbalance issues among devices and formulate the graph partitioning problem as minimizing the maximum communication balance cost, which is proved to be an NP-hard problem. We propose an approximation algorithm called Re-FlexBiCut to tackle this problem. Furthermore, we incorporate prefetching, adaptive micro-batch pipelining, and asynchronous pipelining to present a hierarchical pipelining mechanism that mitigates the communication overhead. Sven represents the first comprehensive optimization solution for scaling memory-based TGNN training. Through extensive experiments conducted on a 64-GPU cluster, Sven demonstrates impressive speedup, ranging from 1.9x to 3.5x, compared to State-of-the-Art approaches. Additionally, Sven achieves up to 5.26x higher communication efficiency and reduces communication imbalance by up to 59.2%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1904-1919"},"PeriodicalIF":5.6,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0