IEEE Transactions on Parallel and Distributed Systems最新文献_第4页

FedVeca: Federated Vectorized Averaging on Non-IID Data With Adaptive Bi-Directional Global Objective FedVeca：非 IID 数据的联合矢量化平均与自适应双向全局目标

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-04 DOI: 10.1109/TPDS.2024.3454203

Ping Luo;Jieren Cheng;N. Xiong;Zhenhao Liu;Jie Wu

Federated Learning (FL) is a distributed machine learning framework in parallel and distributed systems. However, the systems’ Non-Independent and Identically Distributed (Non-IID) data negatively affect the communication efficiency, since clients with different datasets may cause significant gaps to the local gradients in each communication round. In this article, we propose a Federated Vectorized Averaging (FedVeca) method to optimize the FL communication system on Non-IID data. Specifically, we set a novel objective for the global model which is related to the local gradients. The local gradient is defined as a bi-directional vector with step size and direction, where the step size is the number of local updates and the direction is divided into positive and negative according to our definition. In FedVeca, the direction is influenced by the step size, thus we average the bi-directional vectors to reduce the effect of different step sizes. Then, we theoretically analyze the relationship between the step sizes and the global objective, and obtain upper bounds on the step sizes per communication round. Based on the upper bounds, we design an algorithm for the server and the client to adaptively adjusts the step sizes that make the objective close to the optimum. Finally, we conduct experiments on different datasets, models and scenarios by building a prototype system, and the experimental results demonstrate the effectiveness and efficiency of the FedVeca method.

联合学习（FL）是并行和分布式系统中的一种分布式机器学习框架。然而，系统中的非独立和相同分布（Non-IID）数据会对通信效率产生负面影响，因为拥有不同数据集的客户端可能会在每轮通信中对本地梯度造成巨大差距。在本文中，我们提出了一种联邦矢量化平均（FedVeca）方法，用于优化非独立同分布数据的 FL 通信系统。具体来说，我们为全局模型设定了一个与局部梯度相关的新目标。根据我们的定义，局部梯度被定义为具有步长和方向的双向向量，其中步长是局部更新的次数，方向分为正向和负向。在 FedVeca 中，方向受步长的影响，因此我们将双向向量平均化，以减少不同步长的影响。然后，我们从理论上分析了步长与全局目标之间的关系，并得出了每轮通信的步长上限。在此基础上，我们为服务器和客户端设计了一种算法，用于自适应地调整步长，使目标接近最优。最后，我们通过构建原型系统，在不同的数据集、模型和场景下进行了实验，实验结果证明了 FedVeca 方法的有效性和高效性。

{"title":"FedVeca: Federated Vectorized Averaging on Non-IID Data With Adaptive Bi-Directional Global Objective","authors":"Ping Luo;Jieren Cheng;N. Xiong;Zhenhao Liu;Jie Wu","doi":"10.1109/TPDS.2024.3454203","DOIUrl":"10.1109/TPDS.2024.3454203","url":null,"abstract":"Federated Learning (FL) is a distributed machine learning framework in parallel and distributed systems. However, the systems’ Non-Independent and Identically Distributed (Non-IID) data negatively affect the communication efficiency, since clients with different datasets may cause significant gaps to the local gradients in each communication round. In this article, we propose a Federated Vectorized Averaging (FedVeca) method to optimize the FL communication system on Non-IID data. Specifically, we set a novel objective for the global model which is related to the local gradients. The local gradient is defined as a bi-directional vector with step size and direction, where the step size is the number of local updates and the direction is divided into positive and negative according to our definition. In FedVeca, the direction is influenced by the step size, thus we average the bi-directional vectors to reduce the effect of different step sizes. Then, we theoretically analyze the relationship between the step sizes and the global objective, and obtain upper bounds on the step sizes per communication round. Based on the upper bounds, we design an algorithm for the server and the client to adaptively adjusts the step sizes that make the objective close to the optimum. Finally, we conduct experiments on different datasets, models and scenarios by building a prototype system, and the experimental results demonstrate the effectiveness and efficiency of the FedVeca method.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2102-2113"},"PeriodicalIF":5.6,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-Throughput GPU Implementation of Dilithium Post-Quantum Digital Signature 锂后量子数字签名的高吞吐量 GPU 实现

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453289

Shiyu Shen;Hao Yang;Wangchen Dai;Hong Zhang;Zhe Liu;Yunlei Zhao

Digital signatures are fundamental building blocks in various protocols to provide integrity and authenticity. The development of the quantum computing has raised concerns about the security guarantees afforded by classical signature schemes. CRYSTALS-Dilithium is an efficient post-quantum digital signature scheme based on lattice cryptography and has been selected as the primary algorithm for standardization by the National Institute of Standards and Technology. In this work, we present a high-throughput GPU implementation of Dilithium. For individual operations, we employ a range of computational and memory optimizations to overcome sequential constraints, reduce memory usage and IO latency, address bank conflicts, and mitigate pipeline stalls. This results in high and balanced compute throughput and memory throughput for each operation. In terms of concurrent task processing, we leverage task-level batching to fully utilize parallelism and implement a memory pool mechanism for rapid memory access. We propose a dynamic task scheduling mechanism to improve multiprocessor occupancy and significantly reduce execution time. Furthermore, we apply asynchronous computing and launch multiple streams to hide data transfer latencies and maximize the computing capabilities of both CPU and GPU. Across all three security levels, our GPU implementation achieves over 160× speedups for signing and over 80× speedups for verification on both commercial and server-grade GPUs. This achieves microsecond-level amortized execution times for each task, offering a high-throughput and quantum-resistant solution suitable for a wide array of applications in real systems.

数字签名是各种协议中提供完整性和真实性的基本构件。量子计算的发展引发了人们对经典签名方案所提供的安全保证的担忧。CRYSTALS-Dilithium 是一种基于晶格密码学的高效后量子数字签名方案，已被美国国家标准与技术研究院选为标准化的主要算法。在这项工作中，我们介绍了 Dilithium 的高吞吐量 GPU 实现。对于单个操作，我们采用了一系列计算和内存优化措施，以克服顺序限制、减少内存使用和 IO 延迟、解决库冲突并缓解流水线停滞。因此，每项操作的计算吞吐量和内存吞吐量都很高，而且很均衡。在并发任务处理方面，我们利用任务级批处理来充分利用并行性，并实施了快速内存访问的内存池机制。我们提出了一种动态任务调度机制，以提高多处理器占用率并显著缩短执行时间。此外，我们还应用异步计算并启动多个流来隐藏数据传输延迟，最大限度地发挥 CPU 和 GPU 的计算能力。在所有三个安全级别中，我们的 GPU 实现在商用和服务器级 GPU 上的签名速度提高了 160 倍以上，验证速度提高了 80 倍以上。这使得每个任务的摊销执行时间达到了微秒级，从而提供了一种适合实际系统中各种应用的高吞吐量和抗量子解决方案。

{"title":"High-Throughput GPU Implementation of Dilithium Post-Quantum Digital Signature","authors":"Shiyu Shen;Hao Yang;Wangchen Dai;Hong Zhang;Zhe Liu;Yunlei Zhao","doi":"10.1109/TPDS.2024.3453289","DOIUrl":"10.1109/TPDS.2024.3453289","url":null,"abstract":"Digital signatures are fundamental building blocks in various protocols to provide integrity and authenticity. The development of the quantum computing has raised concerns about the security guarantees afforded by classical signature schemes. CRYSTALS-Dilithium is an efficient post-quantum digital signature scheme based on lattice cryptography and has been selected as the primary algorithm for standardization by the National Institute of Standards and Technology. In this work, we present a high-throughput GPU implementation of Dilithium. For individual operations, we employ a range of computational and memory optimizations to overcome sequential constraints, reduce memory usage and IO latency, address bank conflicts, and mitigate pipeline stalls. This results in high and balanced compute throughput and memory throughput for each operation. In terms of concurrent task processing, we leverage task-level batching to fully utilize parallelism and implement a memory pool mechanism for rapid memory access. We propose a dynamic task scheduling mechanism to improve multiprocessor occupancy and significantly reduce execution time. Furthermore, we apply asynchronous computing and launch multiple streams to hide data transfer latencies and maximize the computing capabilities of both CPU and GPU. Across all three security levels, our GPU implementation achieves over 160× speedups for signing and over 80× speedups for verification on both commercial and server-grade GPUs. This achieves microsecond-level amortized execution times for each task, offering a high-throughput and quantum-resistant solution suitable for a wide array of applications in real systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1964-1976"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SC-CGRA: An Energy-Efficient CGRA Using Stochastic Computing SC-CGRA：使用随机计算的高能效 CGRA

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453310

Di Mou;Bo Wang;Dajiang Liu

Stochastic Computing (SC) offers a promising computing paradigm for low-power and cost-effective applications, with the added advantage of high error tolerance. In parallel, Coarse-Grained Reconfigurable Arrays (CGRA) prove to be a highly promising platform for domain-specific applications due to their combination of energy efficiency and flexibility. Intuitively, introducing SC to CGRA would significantly reinforce the strengths of both paradigms. However, existing SC-based architectures often encounter inherent computation errors, while the stochastic number generators employed in SC result in exponentially growing latency, which is deemed unacceptable in CGRA. In this work, we propose an SC-based CGRA by replacing the exact multiplication in traditional CGRA with an SC-based multiplication. To improve the accuracy of SC and shorten the latency of Stochastic Number Generators (SNG), we introduce the leading zero shifting and comparator truncation, while keeping the length of bitstream fixed. In addition, due to the flexible interconnections among PEs, we propose a quality scaling strategy that combines neighbor PEs to achieve high-accuracy operations without switching costs like power-gating. Compared to the state-of-the-art approximate computing design of CGRA, our proposed CGRA can averagely achieve a 65.3% reduction in output error while having a 21.2% reduction in energy consumption and a noteworthy 28.37% area savings.

随机计算（Schochastic Computing，SC）为低功耗、高成本效益的应用提供了一种前景广阔的计算范式，并具有高容错性的额外优势。与此同时，粗粒度可重构阵列（CGRA）由于兼具能效和灵活性，被证明是一种非常有前途的特定领域应用平台。直观地说，将 SC 引入 CGRA 将大大加强这两种模式的优势。然而，现有的基于 SC 的架构经常会遇到固有的计算错误，而 SC 中采用的随机数字生成器会导致指数级增长的延迟，这在 CGRA 中被认为是不可接受的。在这项工作中，我们提出了一种基于 SC 的 CGRA，用基于 SC 的乘法取代传统 CGRA 中的精确乘法。为了提高 SC 的精度并缩短随机数发生器 (SNG) 的延迟，我们引入了前导零移位和比较器截断，同时保持比特流的长度不变。此外，由于 PE 之间具有灵活的互连，我们提出了一种质量缩放策略，即结合相邻 PE 实现高精度操作，而无需电源门等开关成本。与最先进的近似计算 CGRA 设计相比，我们提出的 CGRA 平均可将输出误差减少 65.3%，同时能耗减少 21.2%，面积节省 28.37%。

{"title":"SC-CGRA: An Energy-Efficient CGRA Using Stochastic Computing","authors":"Di Mou;Bo Wang;Dajiang Liu","doi":"10.1109/TPDS.2024.3453310","DOIUrl":"10.1109/TPDS.2024.3453310","url":null,"abstract":"Stochastic Computing (SC) offers a promising computing paradigm for low-power and cost-effective applications, with the added advantage of high error tolerance. In parallel, Coarse-Grained Reconfigurable Arrays (CGRA) prove to be a highly promising platform for domain-specific applications due to their combination of energy efficiency and flexibility. Intuitively, introducing SC to CGRA would significantly reinforce the strengths of both paradigms. However, existing SC-based architectures often encounter inherent computation errors, while the stochastic number generators employed in SC result in exponentially growing latency, which is deemed unacceptable in CGRA. In this work, we propose an SC-based CGRA by replacing the exact multiplication in traditional CGRA with an SC-based multiplication. To improve the accuracy of SC and shorten the latency of Stochastic Number Generators (SNG), we introduce the leading zero shifting and comparator truncation, while keeping the length of bitstream fixed. In addition, due to the flexible interconnections among PEs, we propose a quality scaling strategy that combines neighbor PEs to achieve high-accuracy operations without switching costs like power-gating. Compared to the state-of-the-art approximate computing design of CGRA, our proposed CGRA can averagely achieve a 65.3% reduction in output error while having a 21.2% reduction in energy consumption and a noteworthy 28.37% area savings.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2023-2038"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Efficient Graph Processing in Geo-Distributed Data Centers 在地理分布式数据中心实现高效图形处理

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453872

Feng Yao;Qian Tao;Shengyuan Lin;Yanfeng Zhang;Wenyuan Yu;Shufeng Gong;Qiange Wang;Ge Yu;Jingren Zhou

Iterative graph processing is widely used as a significant paradigm for large-scale data analysis. In many global businesses of multinational enterprises, graph-structure data is usually geographically distributed in different regions to support low-latency services. Geo-distributed graph processing suffers from the Wide Area Networks (WANs) with scarce and heterogeneous bandwidth, thus essentially differs from traditional distributed graph processing. In this paper, we propose RAGraph, a Region-Aware framework for geo-distributed graph processing. At the core of RAGraph, we design a region-aware graph processing framework that allows advancing inefficient global updates locally and enables sensible coordination-free message interactions and flexible replaceable communication module. In terms of graph data preprocessing, RAGraph introduces a contribution-driven edge migration algorithm to effectively utilize network resources. RAGraph also contains an adaptive hierarchical message interaction engine to switch interaction modes adaptively based on network heterogeneity and fluctuation, and a discrepancy-aware message filtering strategy to filter important messages. Experimental results show that RAGraph can achieve an average speedup of 9.7× (up to 98×) and an average WAN cost reduction of 78.5

$%$

(up to 97.3

$%$

) compared with state-of-the-art systems.

迭代图处理作为大规模数据分析的重要范例得到广泛应用。在许多跨国企业的全球业务中，图结构数据通常地理分布在不同地区，以支持低延迟服务。地理分布式图处理受广域网（WAN）带宽稀缺和异构的影响，因此与传统的分布式图处理存在本质区别。在本文中，我们提出了用于地理分布式图处理的区域感知框架 RAGraph。作为 RAGraph 的核心，我们设计了一个区域感知图处理框架，允许在本地推进低效的全局更新，实现合理的免协调消息交互和灵活的可替换通信模块。在图数据预处理方面，RAGraph 引入了贡献驱动的边迁移算法，以有效利用网络资源。RAGraph还包含一个自适应分层消息交互引擎，可根据网络异构性和波动性自适应地切换交互模式，还包含一个差异感知消息过滤策略，可过滤重要消息。实验结果表明，与最先进的系统相比，RAGraph 的平均速度提高了 9.7 倍（最高 98 倍），平均广域网成本降低了 78.5%（最高 97.3%）。

{"title":"Towards Efficient Graph Processing in Geo-Distributed Data Centers","authors":"Feng Yao;Qian Tao;Shengyuan Lin;Yanfeng Zhang;Wenyuan Yu;Shufeng Gong;Qiange Wang;Ge Yu;Jingren Zhou","doi":"10.1109/TPDS.2024.3453872","DOIUrl":"10.1109/TPDS.2024.3453872","url":null,"abstract":"Iterative graph processing is widely used as a significant paradigm for large-scale data analysis. In many global businesses of multinational enterprises, graph-structure data is usually geographically distributed in different regions to support low-latency services. Geo-distributed graph processing suffers from the Wide Area Networks (WANs) with scarce and heterogeneous bandwidth, thus essentially differs from traditional distributed graph processing. In this paper, we propose RAGraph, a \u0000Region-Aware framework for geo-distributed graph processing\u0000. At the core of RAGraph, we design a region-aware graph processing framework that allows advancing inefficient global updates locally and enables sensible coordination-free message interactions and flexible replaceable communication module. In terms of graph data preprocessing, RAGraph introduces a contribution-driven edge migration algorithm to effectively utilize network resources. RAGraph also contains an adaptive hierarchical message interaction engine to switch interaction modes adaptively based on network heterogeneity and fluctuation, and a discrepancy-aware message filtering strategy to filter important messages. Experimental results show that RAGraph can achieve an average speedup of 9.7× (up to 98×) and an average WAN cost reduction of 78.5\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 (up to 97.3\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000) compared with state-of-the-art systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2147-2160"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ComboFunc: Joint Resource Combination and Container Placement for Serverless Function Scaling With Heterogeneous Container ComboFunc：联合资源组合与容器放置，实现无服务器功能与异构容器的扩展

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3454071

Zhaojie Wen;Qiong Chen;Quanfeng Deng;Yipei Niu;Zhen Song;Fangming Liu

Serverless computing provides developers with a maintenance-free approach to resource usage, but it also transfers resource management responsibility to the cloud platform. However, the fine granularity of serverless function resources can lead to performance bottlenecks and resource fragmentation on nodes when creating many function containers. This poses challenges in effectively scaling function resources and optimizing node resource allocation, hindering overall agility. To address these challenges, we have introduced ComboFunc, an innovative resource scaling system for serverless platforms. ComboFunc associates function with heterogeneous containers of varying specifications and optimizes their resource combination and placement. This approach not only selects appropriate nodes for container creation, but also leverages the new feature of Kubernetes In-place Pod Vertical Scaling to enhance resource scaling agility and efficiency. By allowing a single function to correspond to heterogeneous containers with varying resource specifications and providing the ability to modify the resource specifications of existing containers in place, ComboFunc effectively utilizes fragmented resources on nodes. This, in turn, enhances the overall resource utilization of the entire cluster and improves scaling agility. We also model the problem of combining and placing heterogeneous containers as an NP-hard problem and design a heuristic solution based on a greedy algorithm that solves it in polynomial time. We implemented a prototype of ComboFunc on the Kubernetes platform and conducted experiments using real traces on a local cluster. The results demonstrate that, compared to existing strategies, ComboFunc achieves up to 3.01 × faster function resource scaling and reduces resource costs by up to 42.6%.

无服务器计算为开发人员提供了一种免维护的资源使用方法，但同时也将资源管理责任转移给了云平台。然而，当创建许多功能容器时，无服务器功能资源的细粒度可能会导致节点上出现性能瓶颈和资源碎片。这给有效扩展功能资源和优化节点资源分配带来了挑战，阻碍了整体敏捷性。为了应对这些挑战，我们为无服务器平台推出了创新的资源扩展系统 ComboFunc。ComboFunc 将函数与不同规格的异构容器关联起来，并优化它们的资源组合和布局。这种方法不仅能为容器创建选择合适的节点，还能利用 Kubernetes 就地 Pod 垂直扩展的新功能来提高资源扩展的灵活性和效率。ComboFunc 允许一个函数对应具有不同资源规格的异构容器，并提供就地修改现有容器资源规格的功能，从而有效利用了节点上的零散资源。这反过来又提高了整个集群的整体资源利用率，提高了扩展灵活性。我们还将异构容器的组合和放置问题建模为一个 NP 难问题，并设计了一个基于贪婪算法的启发式解决方案，该方案可在多项式时间内解决该问题。我们在 Kubernetes 平台上实现了 ComboFunc 的原型，并使用本地集群上的真实痕迹进行了实验。结果表明，与现有策略相比，ComboFunc 的函数资源扩展速度提高了 3.01 倍，资源成本降低了 42.6%。

{"title":"ComboFunc: Joint Resource Combination and Container Placement for Serverless Function Scaling With Heterogeneous Container","authors":"Zhaojie Wen;Qiong Chen;Quanfeng Deng;Yipei Niu;Zhen Song;Fangming Liu","doi":"10.1109/TPDS.2024.3454071","DOIUrl":"10.1109/TPDS.2024.3454071","url":null,"abstract":"Serverless computing provides developers with a maintenance-free approach to resource usage, but it also transfers resource management responsibility to the cloud platform. However, the fine granularity of serverless function resources can lead to performance bottlenecks and resource fragmentation on nodes when creating many function containers. This poses challenges in effectively scaling function resources and optimizing node resource allocation, hindering overall agility. To address these challenges, we have introduced ComboFunc, an innovative resource scaling system for serverless platforms. ComboFunc associates function with heterogeneous containers of varying specifications and optimizes their resource combination and placement. This approach not only selects appropriate nodes for container creation, but also leverages the new feature of Kubernetes In-place Pod Vertical Scaling to enhance resource scaling agility and efficiency. By allowing a single function to correspond to heterogeneous containers with varying resource specifications and providing the ability to modify the resource specifications of existing containers in place, ComboFunc effectively utilizes fragmented resources on nodes. This, in turn, enhances the overall resource utilization of the entire cluster and improves scaling agility. We also model the problem of combining and placing heterogeneous containers as an NP-hard problem and design a heuristic solution based on a greedy algorithm that solves it in polynomial time. We implemented a prototype of ComboFunc on the Kubernetes platform and conducted experiments using real traces on a local cluster. The results demonstrate that, compared to existing strategies, ComboFunc achieves up to 3.01 × faster function resource scaling and reduces resource costs by up to 42.6%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1989-2005"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CODE$^{+}$+: Fast and Accurate Inference for Compact Distributed IoT Data Collection CODE+：针对紧凑型分布式物联网数据采集的快速准确推理。

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453607

Huali Lu;Feng Lyu;Ju Ren;Huaqing Wu;Conghao Zhou;Zhongyuan Liu;Yaoxue Zhang;Xuemin Shen

In distributed IoT data systems, full-size data collection is impractical due to the energy constraints and large system scales. Our previous work has investigated the advantages of integrating matrix sampling and inference for compact distributed IoT data collection, to minimize the data collection cost while guaranteeing the data benefits. This paper further advances the technology by boosting fast and accurate inference for those distributed IoT data systems that are sensitive to computation time, training stability, and inference accuracy. Particularly, we propose CODE

$^{+}$

+

, i.e., Compact Distributed IOT Data CollEction Plus, which features a cluster-based sampling module and a Convolutional Neural Network (CNN)-Transformer Autoencoders-based inference module, to reduce cost and guarantee the data benefits. The sampling component employs a cluster-based matrix sampling approach, in which data clustering is first conducted and then a two-step sampling is performed in accordance with the number of clusters and clustering errors. The inference component integrates a CNN-Transformer Autoencoders-based matrix inference model to estimate the full-size spatio-temporal data matrix, which consists of a CNN-Transformer encoder that extracts the underlying features from the sampled data matrix and a lightweight decoder that maps the learned latent features back to the original full-size data matrix. We implement CODE

$^{+}$

+

under three operational large-scale IoT systems and one synthetic Gaussian distribution dataset, and extensive experiments are provided to demonstrate its efficiency and robustness. With a 20% sampling ratio, CODE

$^{+}$

+

achieves an average data reconstruction accuracy of 94% across four datasets, outperforming our previous version of 87% and state-of-the-art baseline of 71%.

在分布式物联网数据系统中，由于能源限制和庞大的系统规模，全尺寸数据收集是不切实际的。我们之前的工作研究了在紧凑型分布式物联网数据收集中集成矩阵采样和推理的优势，从而在保证数据效益的同时最大限度地降低数据收集成本。本文针对对计算时间、训练稳定性和推理准确性敏感的分布式物联网数据系统，通过提高推理的快速性和准确性，进一步推动了该技术的发展。特别是，我们提出了 CODE$^{+}$+，即 Compact Distributed IOT Data CollEction Plus，它具有基于集群的采样模块和基于卷积神经网络（CNN）-变换器自动编码器的推理模块，以降低成本并保证数据效益。采样组件采用基于聚类的矩阵采样方法，首先对数据进行聚类，然后根据聚类数量和聚类误差进行两步采样。推理组件集成了一个基于 CNN-Transformer Autoencoders 的矩阵推理模型来估计全尺寸时空数据矩阵，它由一个 CNN-Transformer 编码器和一个轻量级解码器组成，前者从采样数据矩阵中提取底层特征，后者则将学习到的潜在特征映射回原始全尺寸数据矩阵。我们在三个运行中的大规模物联网系统和一个合成高斯分布数据集下实现了 CODE$^{+}$+，并通过大量实验证明了其效率和鲁棒性。在采样率为 20% 的情况下，CODE$^{+}$+ 在四个数据集上实现了 94% 的平均数据重建准确率，优于我们之前版本的 87% 和最先进基线的 71%。

{"title":"CODE$^{+}$+: Fast and Accurate Inference for Compact Distributed IoT Data Collection","authors":"Huali Lu;Feng Lyu;Ju Ren;Huaqing Wu;Conghao Zhou;Zhongyuan Liu;Yaoxue Zhang;Xuemin Shen","doi":"10.1109/TPDS.2024.3453607","DOIUrl":"10.1109/TPDS.2024.3453607","url":null,"abstract":"In distributed IoT data systems, full-size data collection is impractical due to the energy constraints and large system scales. Our previous work has investigated the advantages of integrating matrix sampling and inference for compact distributed IoT data collection, to minimize the data collection cost while guaranteeing the data benefits. This paper further advances the technology by boosting fast and accurate inference for those distributed IoT data systems that are sensitive to computation time, training stability, and inference accuracy. Particularly, we propose \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula>\u0000, i.e., \u0000<underline>C\u0000ompact Distributed I\u0000<underline>O\u0000T \u0000<underline>D\u0000ata Coll\u0000<underline>E\u0000ction Plus, which features a cluster-based sampling module and a Convolutional Neural Network (CNN)-Transformer Autoencoders-based inference module, to reduce cost and guarantee the data benefits. The sampling component employs a cluster-based matrix sampling approach, in which data clustering is first conducted and then a two-step sampling is performed in accordance with the number of clusters and clustering errors. The inference component integrates a CNN-Transformer Autoencoders-based matrix inference model to estimate the full-size spatio-temporal data matrix, which consists of a CNN-Transformer encoder that extracts the underlying features from the sampled data matrix and a lightweight decoder that maps the learned latent features back to the original full-size data matrix. We implement \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula>\u0000 under three operational large-scale IoT systems and one synthetic Gaussian distribution dataset, and extensive experiments are provided to demonstrate its efficiency and robustness. With a 20% sampling ratio, \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula>\u0000 achieves an average data reconstruction accuracy of 94% across four datasets, outperforming our previous version of 87% and state-of-the-art baseline of 71%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2006-2022"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring the Design Space of Distributed Parallel Sparse Matrix-Multiple Vector Multiplication 探索分布式并行稀疏矩阵-多矢量乘法的设计空间

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-08-30 DOI: 10.1109/TPDS.2024.3452478

Hua Huang;Edmond Chow

We consider the distributed memory parallel multiplication of a sparse matrix by a dense matrix (SpMM). The dense matrix is often a collection of dense vectors. Standard implementations will multiply the sparse matrix by multiple dense vectors at the same time, to exploit the computational efficiencies therein. But such approaches generally utilize the same sparse matrix partitioning as if multiplying by a single vector. This article explores the design space of parallelizing SpMM and shows that a coarser grain partitioning of the matrix combined with a column-wise partitioning of the block of vectors can often require less communication volume and achieve higher SpMM performance. An algorithm is presented that chooses a process grid geometry for a given number of processes to optimize the performance of parallel SpMM. The algorithm can augment existing graph partitioners by utilizing the additional concurrency available when multiplying by multiple dense vectors to further reduce communication.

我们考虑的是稀疏矩阵与密集矩阵的分布式内存并行乘法（SpMM）。稠密矩阵通常是稠密向量的集合。标准实现方法会同时用稀疏矩阵与多个稠密向量相乘，以利用其中的计算效率。但这种方法通常使用的稀疏矩阵分区与单个向量相乘的方法相同。本文探讨了 SpMM 并行化的设计空间，并表明较粗粒度的矩阵划分与按列划分的向量块相结合，往往能减少通信量，实现更高的 SpMM 性能。本文提出了一种算法，可为给定数量的进程选择进程网格几何形状，以优化并行 SpMM 性能。该算法可以利用多个密集向量相乘时的额外并发性，进一步减少通信量，从而增强现有的图分割器。

引用次数: 0

Beyond Belady to Attain a Seemingly Unattainable Byte Miss Ratio for Content Delivery Networks 超越 Belady，为内容交付网络实现看似遥不可及的字节遗漏率

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-08-30 DOI: 10.1109/TPDS.2024.3452096

Peng Wang;Hong Jiang;Yu Liu;Zhelong Zhao;Ke Zhou;Zhihai Huang

Reducing the byte miss ratio (BMR) in the Content Delivery Network (CDN) caches can help providers save on the cost of paying for traffic. When evicting objects or files of different sizes in the caches of CDNs, it is no longer sufficient to pursue an optimal object miss ratio (OMR) by approximating Belady to ensure an optimal BMR. Our experimental observations suggest that there are multiple request sequence windows. In these windows, a replacement policy prioritizes the eviction of objects with large sizes and ultimately evicts the object with the longest reuse distance, lowering the BMR without increasing the OMR. To accurately capture those windows, we monitor the changes in OMR and BMR using a deep reinforcement learning (RL) model and then implement a BMR-friendly replacement algorithm in these windows. Based on this policy, we propose a Belady and Size Eviction (LRU-BaSE) algorithm that reduces BMR while maintaining OMR. To make LRU-BaSE efficient and practical, we address the feedback delay problem of RL with a two-pronged approach. On the one hand, we shorten the LRU-base decision region based on the observation that the rear section of the cache queue contains most of the eviction candidates. On the other hand, the request distribution on CDNs makes it feasible to divide the learning region into multiple sub-regions that are each learned with reduced time and increased accuracy. In real CDN systems, LRU-BaSE outperforms LRU by reducing “backing to OS” traffic and access latency by 30.05% and 17.07%, respectively, on average. In simulator tests, LRU-BaSE outperforms state-of-the-art cache replacement policies. On average, LRU-BaSE's BMR is 0.63% and 0.33% less than that of Belady and Practical Flow-based Offline Optimal (PFOO), respectively. In addition, compared to Learning Relaxed Belady (LRB), LRU-BaSE can yield relatively stable performance when facing workload drift.

降低内容分发网络（CDN）缓存中的字节遗漏率（BMR）可以帮助提供商节省流量付费成本。在驱逐 CDN 缓存中不同大小的对象或文件时，通过近似贝拉迪（Belady）来追求最佳对象遗漏率（OMR）以确保最佳字节遗漏率（BMR）已经不够了。我们的实验观察表明，存在多个请求序列窗口。在这些窗口中，替换策略会优先驱逐尺寸较大的对象，并最终驱逐重用距离最长的对象，从而在不增加 OMR 的情况下降低 BMR。为了准确捕捉这些窗口，我们使用深度强化学习（RL）模型监控 OMR 和 BMR 的变化，然后在这些窗口中实施 BMR 友好替换算法。基于这一策略，我们提出了一种 "Belady and Size Eviction"（LRU-BaSE）算法，可在保持 OMR 的同时降低 BMR。为了使 LRU-BaSE 高效实用，我们采用双管齐下的方法来解决 RL 的反馈延迟问题。一方面，我们根据高速缓存队列后部包含大部分驱逐候选对象的观察结果，缩短了 LRU 基准决策区域。另一方面，CDN 上的请求分布使得将学习区域划分为多个子区域成为可行，每个子区域的学习时间更短，准确率更高。在实际 CDN 系统中，LRU-BaSE 的性能优于 LRU，"备份到操作系统 "流量和访问延迟平均分别减少了 30.05% 和 17.07%。在模拟器测试中，LRU-BaSE 的性能优于最先进的缓存替换策略。平均而言，LRU-BaSE 的 BMR 分别比 Belady 和基于实践流的离线优化（PFOO）低 0.63% 和 0.33%。此外，与学习宽松贝拉迪（LRB）相比，LRU-BaSE 在面对工作负载漂移时能产生相对稳定的性能。

{"title":"Beyond Belady to Attain a Seemingly Unattainable Byte Miss Ratio for Content Delivery Networks","authors":"Peng Wang;Hong Jiang;Yu Liu;Zhelong Zhao;Ke Zhou;Zhihai Huang","doi":"10.1109/TPDS.2024.3452096","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3452096","url":null,"abstract":"Reducing the byte miss ratio (BMR) in the Content Delivery Network (CDN) caches can help providers save on the cost of paying for traffic. When evicting objects or files of different sizes in the caches of CDNs, it is no longer sufficient to pursue an optimal object miss ratio (OMR) by approximating Belady to ensure an optimal BMR. Our experimental observations suggest that there are multiple request sequence windows. In these windows, a replacement policy prioritizes the eviction of objects with large sizes and ultimately evicts the object with the longest reuse distance, lowering the BMR without increasing the OMR. To accurately capture those windows, we monitor the changes in OMR and BMR using a deep reinforcement learning (RL) model and then implement a BMR-friendly replacement algorithm in these windows. Based on this policy, we propose a Belady and Size Eviction (LRU-BaSE) algorithm that reduces BMR while maintaining OMR. To make LRU-BaSE efficient and practical, we address the feedback delay problem of RL with a two-pronged approach. On the one hand, we shorten the LRU-base decision region based on the observation that the rear section of the cache queue contains most of the eviction candidates. On the other hand, the request distribution on CDNs makes it feasible to divide the learning region into multiple sub-regions that are each learned with reduced time and increased accuracy. In real CDN systems, LRU-BaSE outperforms LRU by reducing “backing to OS” traffic and access latency by 30.05% and 17.07%, respectively, on average. In simulator tests, LRU-BaSE outperforms state-of-the-art cache replacement policies. On average, LRU-BaSE's BMR is 0.63% and 0.33% less than that of Belady and Practical Flow-based Offline Optimal (PFOO), respectively. In addition, compared to Learning Relaxed Belady (LRB), LRU-BaSE can yield relatively stable performance when facing workload drift.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1949-1963"},"PeriodicalIF":5.6,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BIRD+: Design of a Lightweight Communication Compressor for Resource-Constrained Distribution Learning Platforms BIRD+：为资源有限的分布式学习平台设计轻量级通信压缩器

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-08-21 DOI: 10.1109/TPDS.2024.3447221

Donglei Wu;Weihao Yang;Xiangyu Zou;Hao Feng;Dingwen Tao;Shiyi Li;Wen Xia;Binxing Fang

The Top-K sparsification-based compression framework is extensively explored for reducing communication costs in distributed learning. However, we identified several issues with existing Top-K sparsification-based compression methods: (i) The limited compressibility of the Top-K parameter's indexes critically restricts the overall communication compression ratio; (ii) Several time-consuming compression operations significantly offset the benefits of communication compression; (iii) The use of error feedback techniques to maintain model quality results in a high memory footprint consumption. To solve these issues, we propose BIRD, a lightweight tensor-wise Bi-Random sampling strategy with an expectation invariance property. Specifically, BIRD applies a tensor-wise index sharing mechanism that reduces the index proportion by allowing multiple tensor elements to share a single index, thus improving the overall compression ratio. Additionally, BIRD replaces the time-consuming Top-K sorting with a faster Bi-Random sampling strategy based on the aforementioned index sharing mechanism, significantly reducing compression overheads; Moreover, BIRD establishes an expectation invariance property into the Bi-Random sampling to ensure an approximate unbiased representation for the

$L_1$

-norm of the sampled tensors, effectively maintaining the model quality without incurring extra memory costs. We further optimize BIRD to BIRD+ by introducing the uniform distribution-based sampling and Gamma correction on the tensor-wise sampling process, achieving a more flexibly adjustment of the sparsity with better convergence performance. Experimental evaluations across multiple conventional distributed learning tasks demonstrate that compared to state-of-the-art approaches, BIRD+ achieves higher communication compression ratios up to 36.2

$times$

and higher computation throughput up to 149.6

$times$

while maintaining the model quality without incurring extra memory costs.

为降低分布式学习中的通信成本，基于 Top-K 稀疏化的压缩框架得到了广泛探索。然而，我们发现现有的基于 Top-K 稀疏化的压缩方法存在几个问题：(i) Top-K 参数索引的可压缩性有限，严重限制了整体通信压缩率；(ii) 一些耗时的压缩操作大大抵消了通信压缩的好处；(iii) 使用误差反馈技术来保持模型质量会消耗大量内存。为了解决这些问题，我们提出了具有期望不变性的轻量级张量双随机抽样策略 BIRD。具体来说，BIRD 采用了一种张量索引共享机制，通过允许多个张量元素共享一个索引来降低索引比例，从而提高整体压缩率。此外，BIRD 在上述索引共享机制的基础上采用了更快的双随机抽样策略，取代了耗时的 Top-K 排序，大大减少了压缩开销；而且，BIRD 在双随机抽样中建立了期望不变性属性，以确保对抽样张量的 $L_1$-norm 进行近似无偏表示，从而在不产生额外内存成本的情况下有效保持了模型质量。通过引入基于均匀分布的采样和张量采样过程中的伽马修正，我们进一步将 BIRD 优化为 BIRD+，实现了更灵活的稀疏性调整和更好的收敛性能。多个传统分布式学习任务的实验评估表明，与最先进的方法相比，BIRD+ 实现了更高的通信压缩比，最高可达 36.2 美元/次，计算吞吐量最高可达 149.6 美元/次，同时保持了模型质量，不会产生额外的内存成本。

{"title":"BIRD+: Design of a Lightweight Communication Compressor for Resource-Constrained Distribution Learning Platforms","authors":"Donglei Wu;Weihao Yang;Xiangyu Zou;Hao Feng;Dingwen Tao;Shiyi Li;Wen Xia;Binxing Fang","doi":"10.1109/TPDS.2024.3447221","DOIUrl":"10.1109/TPDS.2024.3447221","url":null,"abstract":"The Top-K sparsification-based compression framework is extensively explored for reducing communication costs in distributed learning. However, we identified several issues with existing Top-K sparsification-based compression methods: (\u0000i\u0000) The limited compressibility of the Top-K parameter's indexes critically restricts the overall communication compression ratio; (\u0000ii\u0000) Several time-consuming compression operations significantly offset the benefits of communication compression; (\u0000iii\u0000) The use of error feedback techniques to maintain model quality results in a high memory footprint consumption. To solve these issues, we propose BIRD, a lightweight tensor-wise \u0000Bi-Random sampling\u0000 strategy with an expectation invariance property. Specifically, BIRD applies a tensor-wise \u0000index sharing\u0000 mechanism that reduces the index proportion by allowing multiple tensor elements to share a single index, thus improving the overall compression ratio. Additionally, BIRD replaces the time-consuming Top-K sorting with a faster \u0000Bi-Random sampling\u0000 strategy based on the aforementioned \u0000index sharing\u0000 mechanism, significantly reducing compression overheads; Moreover, BIRD establishes an \u0000expectation invariance\u0000 property into the \u0000Bi-Random sampling\u0000 to ensure an approximate unbiased representation for the \u0000<inline-formula><tex-math>$L_1$</tex-math></inline-formula>\u0000-norm of the sampled tensors, effectively maintaining the model quality without incurring extra memory costs. We further optimize BIRD to BIRD+ by introducing the uniform distribution-based sampling and Gamma correction on the tensor-wise sampling process, achieving a more flexibly adjustment of the sparsity with better convergence performance. Experimental evaluations across multiple conventional distributed learning tasks demonstrate that compared to state-of-the-art approaches, BIRD+ achieves higher communication compression ratios up to 36.2\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 and higher computation throughput up to 149.6\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 while maintaining the model quality without incurring extra memory costs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2193-2207"},"PeriodicalIF":5.6,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fair Coflow Scheduling via Controlled Slowdown 通过受控减速实现公平的共流调度

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-08-20 DOI: 10.1109/TPDS.2024.3446188

Francesco De Pellegrini;Vaibhav Kumar Gupta;Rachid El Azouzi;Serigne Gueye;Cedric Richier;Jeremie Leguay

The average coflow completion time (CCT) is the standard performance metric in coflow scheduling. However, standard CCT minimization may introduce unfairness between the data transfer phase of different computing jobs. Thus, while progress guarantees have been introduced in the literature to mitigate this fairness issue, the trade-off between fairness and efficiency of data transfer is hard to control. This paper introduces a fairness framework for coflow scheduling based on the concept of slowdown, i.e., the performance loss of a coflow compared to isolation. By controlling the slowdown it is possible to enforce a target coflow progress while minimizing the average CCT. In the proposed framework, the minimum slowdown for a batch of coflows can be determined in polynomial time. By showing the equivalence with Gaussian elimination, slowdown constraints are introduced into primal-dual iterations of the CoFair algorithm. The algorithm extends the class of the

$sigma$

-order schedulers to solve the fair coflow scheduling problem in polynomial time. It provides a 4-approximation of the average CCT w.r.t. an optimal scheduler. Extensive numerical results demonstrate that this approach can trade off average CCT for slowdown more efficiently than existing state of the art schedulers.

平均共流完成时间（CCT）是共流调度的标准性能指标。然而，标准的 CCT 最小化可能会导致不同计算作业的数据传输阶段之间出现不公平现象。因此，虽然文献中引入了进度保证来缓解这一公平性问题，但数据传输的公平性和效率之间的权衡很难控制。本文基于 "减速 "的概念，即与隔离相比，共同流的性能损失，为共同流调度引入了一个公平性框架。通过控制减速，可以在最大限度降低平均 CCT 的同时，强制执行目标 coflow 进度。在所提出的框架中，一批共同流的最小减速可以在多项式时间内确定。通过证明与高斯消元的等价性，减速约束被引入到 CoFair 算法的基元-双迭代中。该算法扩展了$sigma$阶调度器的类别，可以在多项式时间内解决公平共流调度问题。它提供了与最优调度器相比平均 CCT 的 4 倍近似值。大量的数值结果表明，与现有的最先进调度器相比，这种方法能更有效地权衡平均 CCT 与速度减慢之间的关系。

{"title":"Fair Coflow Scheduling via Controlled Slowdown","authors":"Francesco De Pellegrini;Vaibhav Kumar Gupta;Rachid El Azouzi;Serigne Gueye;Cedric Richier;Jeremie Leguay","doi":"10.1109/TPDS.2024.3446188","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3446188","url":null,"abstract":"The average coflow completion time (CCT) is the standard performance metric in coflow scheduling. However, standard CCT minimization may introduce unfairness between the data transfer phase of different computing jobs. Thus, while progress guarantees have been introduced in the literature to mitigate this fairness issue, the trade-off between fairness and efficiency of data transfer is hard to control. This paper introduces a fairness framework for coflow scheduling based on the concept of slowdown, i.e., the performance loss of a coflow compared to isolation. By controlling the slowdown it is possible to enforce a target coflow progress while minimizing the average CCT. In the proposed framework, the minimum slowdown for a batch of coflows can be determined in polynomial time. By showing the equivalence with Gaussian elimination, slowdown constraints are introduced into primal-dual iterations of the CoFair algorithm. The algorithm extends the class of the \u0000<inline-formula><tex-math>$sigma$</tex-math></inline-formula>\u0000-order schedulers to solve the fair coflow scheduling problem in polynomial time. It provides a 4-approximation of the average CCT w.r.t. an optimal scheduler. Extensive numerical results demonstrate that this approach can trade off average CCT for slowdown more efficiently than existing state of the art schedulers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2347-2360"},"PeriodicalIF":5.6,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0