首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
Privacy-Preserving Data Selection for Horizontal and Vertical Federated Learning 为横向和纵向联合学习选择保护隐私的数据
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-19 DOI: 10.1109/TPDS.2024.3439709
Lan Zhang;Anran Li;Hongyi Peng;Feng Han;Fan Huang;Xiang-Yang Li
Federated learning (FL) enables distributed participants to collaboratively train a machine learning model without accessing to their local data. In FL systems, the selection of training samples has a significant impact on model performances, e.g., selecting participants whose datasets have low-quality samples, features would result in low accuracy, unstable models. In this work, we aim to solve the problem that selects a collection of high-quality training samples for a given FL task under a monetary budget. We propose a holistic design to efficiently select high-quality samples while preserve the privacy of participants’ local data, the server’s label set. We propose an efficient hierarchical sample selection mechanism to select relevant clients, their samples before training for horizontal federated learning (HFL). It uses the determinantal point process (DPP) to select both the statistical homogenous, content diverse clients, samples. Besides, we propose a private set intersection (PSI) based scheme to filter relevant features for the target VFL task. Finally, during training, an erroneous-aware importance based selection is proposed to dynamically select important clients, samples to accelerate model convergence. We verify the merits of our proposed solution with extensive experiments on a real AIoT system with 50 clients. The experimental results validate that our solution achieves accurate, efficient selection of high-quality data, consequently an FL model with a faster convergence speed, higher accuracy.
联邦学习(FL)使分布式参与者能够协作训练机器学习模型,而无需访问其本地数据。在联机学习系统中,训练样本的选择对模型性能有重大影响,例如,如果选择的参与者的数据集样本质量较低,则会导致模型准确率低、不稳定。在这项工作中,我们的目标是解决这样一个问题,即在资金预算允许的情况下,为给定的 FL 任务选择高质量的训练样本集。我们提出了一种整体设计方案,既能有效地选择高质量样本,又能保护参与者的本地数据(即服务器标签集)的隐私。我们提出了一种高效的分层样本选择机制,用于在水平联合学习(HFL)训练前选择相关客户及其样本。它使用行列式点过程(DPP)来选择统计同质和内容多样的客户、样本。此外,我们还提出了一种基于私有集交集(PSI)的方案,用于过滤目标 VFL 任务的相关特征。最后,在训练过程中,我们提出了一种基于错误感知重要性的选择方法,以动态选择重要的客户和样本,从而加速模型收敛。我们在一个拥有 50 个客户端的真实 AIoT 系统上进行了大量实验,验证了我们提出的解决方案的优点。实验结果验证了我们的解决方案能够准确、高效地选择高质量数据,从而使 FL 模型具有更快的收敛速度和更高的准确性。
{"title":"Privacy-Preserving Data Selection for Horizontal and Vertical Federated Learning","authors":"Lan Zhang;Anran Li;Hongyi Peng;Feng Han;Fan Huang;Xiang-Yang Li","doi":"10.1109/TPDS.2024.3439709","DOIUrl":"10.1109/TPDS.2024.3439709","url":null,"abstract":"Federated learning (FL) enables distributed participants to collaboratively train a machine learning model without accessing to their local data. In FL systems, the selection of training samples has a significant impact on model performances, e.g., selecting participants whose datasets have low-quality samples, features would result in low accuracy, unstable models. In this work, we aim to solve the problem that selects a collection of high-quality training samples for a given FL task under a monetary budget. We propose a holistic design to efficiently select high-quality samples while preserve the privacy of participants’ local data, the server’s label set. We propose an efficient hierarchical sample selection mechanism to select relevant clients, their samples before training for horizontal federated learning (HFL). It uses the determinantal point process (DPP) to select both the statistical homogenous, content diverse clients, samples. Besides, we propose a private set intersection (PSI) based scheme to filter relevant features for the target VFL task. Finally, during training, an erroneous-aware importance based selection is proposed to dynamically select important clients, samples to accelerate model convergence. We verify the merits of our proposed solution with extensive experiments on a real AIoT system with 50 clients. The experimental results validate that our solution achieves accurate, efficient selection of high-quality data, consequently an FL model with a faster convergence speed, higher accuracy.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2054-2068"},"PeriodicalIF":5.6,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Logical Synchrony and the Bittide Mechanism 逻辑同步和比特机制
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-16 DOI: 10.1109/TPDS.2024.3444739
Sanjay Lall;Călin Caşcaval;Martin Izzard;Tammo Spalink
We introduce logical synchrony, a framework that allows distributed computing to be coordinated as tightly as in synchronous systems without the distribution of a global clock or any reference to universal time. We develop a model of events called a logical synchrony network, in which nodes correspond to processors and every node has an associated local clock which generates the events. We construct a measure of logical latency and develop its properties. A further model, called a multiclock network, is then analyzed and shown to be a refinement of the logical synchrony network. We present the bittide mechanism as an instantiation of multiclock networks, and discuss the clock control mechanism that ensures that buffers do not overflow or underflow. Finally we give conditions under which a logical synchrony network has an equivalent synchronous realization.
我们介绍了逻辑同步,这是一个允许分布式计算像同步系统一样紧密协调的框架,而无需分配全局时钟或参考通用时间。我们建立了一个称为逻辑同步网络的事件模型,其中的节点与处理器相对应,每个节点都有一个相关的本地时钟来产生事件。我们构建了逻辑延迟的测量方法,并发展了其特性。然后,我们分析了另一种称为多时钟网络的模型,并证明它是逻辑同步网络的一种改进。我们介绍了作为多时钟网络实例化的比特化机制,并讨论了确保缓冲区不会溢出或下溢的时钟控制机制。最后,我们给出了逻辑同步网络具有等效同步实现的条件。
{"title":"Logical Synchrony and the Bittide Mechanism","authors":"Sanjay Lall;Călin Caşcaval;Martin Izzard;Tammo Spalink","doi":"10.1109/TPDS.2024.3444739","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3444739","url":null,"abstract":"We introduce logical synchrony, a framework that allows distributed computing to be coordinated as tightly as in synchronous systems without the distribution of a global clock or any reference to universal time. We develop a model of events called a logical synchrony network, in which nodes correspond to processors and every node has an associated local clock which generates the events. We construct a measure of logical latency and develop its properties. A further model, called a multiclock network, is then analyzed and shown to be a refinement of the logical synchrony network. We present the bittide mechanism as an instantiation of multiclock networks, and discuss the clock control mechanism that ensures that buffers do not overflow or underflow. Finally we give conditions under which a logical synchrony network has an equivalent synchronous realization.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1936-1948"},"PeriodicalIF":5.6,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10638228","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142159918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Paired Many-to-Many 2-Disjoint Path Covers in Meshes 网格中成对的多对多 2-Disjoint 路径覆盖
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-16 DOI: 10.1109/TPDS.2024.3445283
Fatemeh Keshavarz-Kohjerdi
In the paired many-to-many $k$-disjoint path cover ($k$-DPC) problem, given a set of $k$ pairs of vertices $(s_{i},t_{i})$, $1leqslant ileqslant k$, of a graph $G$ we want to find $k$ simple vertex-disjoint paths whose end-vertices are these $k$ pairs, such that each vertex of $G$ is covered by a path. This problem is a well-known problem in parallel processing and is a generalization of the well-known Hamiltonian $(s,t)$-path problem, which is equal to 1-DPC. In this paper, we consider the paired many-to-many 2-disjoint path cover problem (2-DPC) in meshes (rectangular grids). We give the necessary conditions for existence of such covers, and present a linear-time algorithm to compute them. Although the paired many-to-many $k$-disjoint path cover problem is well-known in parallel processing, our motivation to study this problem is its application in solving the Hamiltonian path problem in solid grid graphs. We consider the case where the pairs of vertices are on the outer face of the graph.
在成对的多对多 $k$-disjoint path cover($k$-DPC)问题中,给定图 $G$ 的一组 $k$ 对顶点 $(s_{i},t_{i})$,1leqslant ileqslant k$,我们要找到其末端顶点是这 $k$ 对的 $k$ 简单顶点-disjoint 路径,从而使 $G$ 的每个顶点都被路径覆盖。这个问题是并行处理中的一个著名问题,也是著名的哈密顿$(s,t)$路径问题的一般化,相当于 1-DPC。在本文中,我们考虑的是网格(矩形网格)中成对的多对多 2-disjoint 路径覆盖问题(2-DPC)。我们给出了这种覆盖存在的必要条件,并提出了一种计算这种覆盖的线性时间算法。尽管成对的多对多 $k$-isjoint 路径覆盖问题在并行处理中是众所周知的,但我们研究这个问题的动机是它在解决实体网格图中的哈密顿路径问题中的应用。我们考虑的情况是,顶点对位于图的外侧。
{"title":"Paired Many-to-Many 2-Disjoint Path Covers in Meshes","authors":"Fatemeh Keshavarz-Kohjerdi","doi":"10.1109/TPDS.2024.3445283","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3445283","url":null,"abstract":"In the paired many-to-many \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-disjoint path cover (\u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-DPC) problem, given a set of \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 pairs of vertices \u0000<inline-formula><tex-math>$(s_{i},t_{i})$</tex-math></inline-formula>\u0000, \u0000<inline-formula><tex-math>$1leqslant ileqslant k$</tex-math></inline-formula>\u0000, of a graph \u0000<inline-formula><tex-math>$G$</tex-math></inline-formula>\u0000 we want to find \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 simple vertex-disjoint paths whose end-vertices are these \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 pairs, such that each vertex of \u0000<inline-formula><tex-math>$G$</tex-math></inline-formula>\u0000 is covered by a path. This problem is a well-known problem in parallel processing and is a generalization of the well-known Hamiltonian \u0000<inline-formula><tex-math>$(s,t)$</tex-math></inline-formula>\u0000-path problem, which is equal to 1-DPC. In this paper, we consider the paired many-to-many 2-disjoint path cover problem (2-DPC) in meshes (rectangular grids). We give the necessary conditions for existence of such covers, and present a linear-time algorithm to compute them. Although the paired many-to-many \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-disjoint path cover problem is well-known in parallel processing, our motivation to study this problem is its application in solving the Hamiltonian path problem in solid grid graphs. We consider the case where the pairs of vertices are on the outer face of the graph.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1854-1866"},"PeriodicalIF":5.6,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142090712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FlexRaft: Exploiting Flexible Erasure Coding for Minimum-Cost Consensus and Fast Recovery FlexRaft:利用灵活的擦除编码实现最低成本共识和快速恢复
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-14 DOI: 10.1109/TPDS.2024.3443424
Mi Zhang;Qihan Kang;Patrick P. C. Lee
Consensus protocols like Paxos and Raft provide data consistency and fault tolerance for distributed services. Log replication in these protocols can be supported by erasure coding, which incurs lower redundancy than full-copy replication and significantly saves network and storage costs for overall performance improvements. However, existing consensus protocols with erasure coding cannot achieve the minimum network and storage costs during log replication. We propose FlexRaft, which dynamically varies the coding scheme used in Raft based on the server status to always achieve the theoretically minimum redundancy ratio, while maintaining the same liveness as in Raft. To address the issue of an inconsistent coding scheme between the leader and its followers, we specify the prerequisite of overwriting a log entry and also allow the leader and its followers to exactly track the coding scheme being used. We further extend FlexRaft into FlexRaft+, which provides a different storage layout to vary the coding scheme through a novel technique called re-encoding-free replication, so as to enable fast server recovery. We prove that both FlexRaft and FlexRaft+ maintain Raft safety. We implement a prototype of FlexRaft and FlexRaft+, atop which we build a distributed key-value store to show its efficacy. Experiments on Alibaba Cloud show that FlexRaft achieves the theoretically minimum network and storage costs in practice, and reduces the commit latency by 44.51% and 19.37% compared with state-of-the-art CRaft and HRaft, respectively. FlexRaft+ further reduces the commit latency when the coding scheme is being varied and improves the server recovery performance.
Paxos 和 Raft 等共识协议可为分布式服务提供数据一致性和容错性。这些协议中的日志复制可由擦除编码提供支持,擦除编码比全拷贝复制产生的冗余度更低,可显著节省网络和存储成本,从而提高整体性能。然而,采用擦除编码的现有共识协议无法在日志复制过程中实现最低的网络和存储成本。我们提出了 FlexRaft,它能根据服务器状态动态改变 Raft 中使用的编码方案,以始终达到理论上的最小冗余比,同时保持与 Raft 相同的有效性。为了解决领导者和跟随者之间编码方案不一致的问题,我们规定了覆盖日志条目的前提条件,并允许领导者和跟随者精确跟踪正在使用的编码方案。我们进一步将 FlexRaft 扩展为 FlexRaft+,它提供了不同的存储布局,通过一种称为无重码复制的新技术来改变编码方案,从而实现快速的服务器恢复。我们证明 FlexRaft 和 FlexRaft+ 都能保持 Raft 安全性。我们实现了 FlexRaft 和 FlexRaft+ 的原型,并在此基础上构建了分布式键值存储,以展示其功效。在阿里巴巴云上的实验表明,FlexRaft 实现了理论上最低的网络和存储成本,与最先进的 CRaft 和 HRaft 相比,提交延迟分别降低了 44.51% 和 19.37%。当编码方案发生变化时,FlexRaft+ 还能进一步降低提交延迟,并提高服务器恢复性能。
{"title":"FlexRaft: Exploiting Flexible Erasure Coding for Minimum-Cost Consensus and Fast Recovery","authors":"Mi Zhang;Qihan Kang;Patrick P. C. Lee","doi":"10.1109/TPDS.2024.3443424","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3443424","url":null,"abstract":"Consensus protocols like Paxos and Raft provide data consistency and fault tolerance for distributed services. Log replication in these protocols can be supported by erasure coding, which incurs lower redundancy than full-copy replication and significantly saves network and storage costs for overall performance improvements. However, existing consensus protocols with erasure coding cannot achieve the minimum network and storage costs during log replication. We propose FlexRaft, which dynamically varies the coding scheme used in Raft based on the server status to always achieve the theoretically minimum redundancy ratio, while maintaining the same liveness as in Raft. To address the issue of an inconsistent coding scheme between the leader and its followers, we specify the prerequisite of overwriting a log entry and also allow the leader and its followers to exactly track the coding scheme being used. We further extend FlexRaft into FlexRaft+, which provides a different storage layout to vary the coding scheme through a novel technique called re-encoding-free replication, so as to enable fast server recovery. We prove that both FlexRaft and FlexRaft+ maintain Raft safety. We implement a prototype of FlexRaft and FlexRaft+, atop which we build a distributed key-value store to show its efficacy. Experiments on Alibaba Cloud show that FlexRaft achieves the theoretically minimum network and storage costs in practice, and reduces the commit latency by 44.51% and 19.37% compared with state-of-the-art CRaft and HRaft, respectively. FlexRaft+ further reduces the commit latency when the coding scheme is being varied and improves the server recovery performance.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1826-1840"},"PeriodicalIF":5.6,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142090784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SSRAID: A Stripe-Queued and Stripe-Threaded Merging I/O Strategy to Improve Write Performance of Serial Interface SSD RAID SSRAID:提高串行接口固态盘 RAID 写入性能的条带-队列和条带-线程合并 I/O 策略
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-14 DOI: 10.1109/TPDS.2024.3443083
Peixuan Li;Ping Xie;Qiang Cao
RAID (Redundant Array of Independent Disks) has been widely used to enhance read and write performance of existing storage systems. Existing software RAID do not fully utilize write performance of Serial interface SSDs (Solid State Drive). The most popular software RAID currently is Linux Multiple-Disks (MD), and the latest software RAID is StRAID. We observe that both of these software RAID methods lead to thread contention in multi-threaded mode, especially when applied to Serial interface SSDs. Multiple threads writing to same address can limit write performance. In this paper, we propose a stripe-queued and stripe-threaded merging I/O strategy. First, SSRAID segregates write requests across different stripes using a set of stripe-queues and stripe-threads to prevent interference between them. As a result, write thread contention in SSRAID is eliminated, allowing stripe-threads to maintain the highest efficiency of parallelism. Secondly, SSRAID can merge write requests from the same stripe-queue multiple times through stripe-thread, effectively reducing the number of additional write I/Os. Finally, SSRAID presents a stage buffer based on data merging. During partial stripe-write, write-induced read I/Os on the SSD are transformed into direct access to the stage buffer, effectively reducing write-induced read I/Os. Compared to StRAID, SSRAID improves average sequential write throughput by 86% and reduces average sequential write latency by 61% in the optimal case.
RAID(独立磁盘冗余阵列)已被广泛用于提高现有存储系统的读写性能。现有的软件 RAID 无法充分利用串行接口固态硬盘(SSD)的写入性能。目前最流行的软件 RAID 是 Linux Multiple-Disks(MD),最新的软件 RAID 是 StRAID。我们发现,这两种软件 RAID 方法在多线程模式下都会导致线程争用,尤其是在应用于串行接口固态硬盘时。多个线程写入同一地址会限制写入性能。在本文中,我们提出了一种条带排队和条带线程合并 I/O 策略。首先,SSRAID 使用一组条带队列和条带线程将写入请求隔离到不同的条带上,以防止它们之间的干扰。因此,SSRAID 中的写线程竞争得以消除,从而使条带线程保持最高的并行效率。其次,SSRAID 可以通过条带线程多次合并来自同一条带队列的写入请求,从而有效减少额外的写入 I/O 数量。最后,SSRAID 提出了基于数据合并的阶段缓冲。在部分条带写入过程中,固态硬盘上由写入引起的读 I/O 将转化为对阶段缓冲区的直接访问,从而有效减少由写入引起的读 I/O。与 StRAID 相比,在最佳情况下,SSRAID 将平均连续写吞吐量提高了 86%,将平均连续写延迟降低了 61%。
{"title":"SSRAID: A Stripe-Queued and Stripe-Threaded Merging I/O Strategy to Improve Write Performance of Serial Interface SSD RAID","authors":"Peixuan Li;Ping Xie;Qiang Cao","doi":"10.1109/TPDS.2024.3443083","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3443083","url":null,"abstract":"RAID (Redundant Array of Independent Disks) has been widely used to enhance read and write performance of existing storage systems. Existing software RAID do not fully utilize write performance of Serial interface SSDs (Solid State Drive). The most popular software RAID currently is Linux Multiple-Disks (MD), and the latest software RAID is StRAID. We observe that both of these software RAID methods lead to thread contention in multi-threaded mode, especially when applied to Serial interface SSDs. Multiple threads writing to same address can limit write performance. In this paper, we propose a stripe-queued and stripe-threaded merging I/O strategy. First, SSRAID segregates write requests across different stripes using a set of stripe-queues and stripe-threads to prevent interference between them. As a result, write thread contention in SSRAID is eliminated, allowing stripe-threads to maintain the highest efficiency of parallelism. Secondly, SSRAID can merge write requests from the same stripe-queue multiple times through stripe-thread, effectively reducing the number of additional write I/Os. Finally, SSRAID presents a stage buffer based on data merging. During partial stripe-write, write-induced read I/Os on the SSD are transformed into direct access to the stage buffer, effectively reducing write-induced read I/Os. Compared to StRAID, SSRAID improves average sequential write throughput by 86% and reduces average sequential write latency by 61% in the optimal case.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1841-1853"},"PeriodicalIF":5.6,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142090952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proteus: Simulating the Performance of Distributed DNN Training Proteus:模拟分布式 DNN 训练的性能
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-14 DOI: 10.1109/TPDS.2024.3443255
Jiangfei Duan;Xiuhong Li;Ping Xu;Xingcheng Zhang;Shengen Yan;Yun Liang;Dahua Lin
DNN models are becoming increasingly larger to achieve unprecedented accuracy, and the accompanying increased computation and memory requirements necessitate the employment of massive clusters and elaborate parallelization strategies to accelerate DNN training. In order to better optimize the performance and analyze the cost, it is indispensable to model the training throughput of distributed DNN training. However, complex parallelization strategies and the resulting complex runtime behaviors make it challenging to construct an accurate performance model. In this article, we present Proteus, the first standalone simulator to model the performance of complex parallelization strategies through simulation execution. Proteus first models complex parallelization strategies with a unified representation named Strategy Tree. Then, it compiles the strategy tree into a distributed execution graph and simulates the complex runtime behaviors, comp-comm overlap and bandwidth sharing, with a Hierarchical Topo-Aware Executor (HTAE). We finally evaluate Proteus across a wide variety of DNNs on three hardware configurations. Experimental results show that Proteus achieves 3.0% average prediction error and preserves order for training throughput of various parallelization strategies. Compared to state-of-the-art approaches, Proteus reduces prediction error by up to 133.8%.
为了达到前所未有的精确度,DNN 模型变得越来越大,随之而来的计算和内存要求也越来越高,因此有必要使用大规模集群和精心设计的并行化策略来加速 DNN 训练。为了更好地优化性能和分析成本,建立分布式 DNN 训练吞吐量模型是必不可少的。然而,复杂的并行化策略和由此产生的复杂运行时行为使得构建精确的性能模型变得十分困难。在本文中,我们将介绍 Proteus,它是第一个通过模拟执行对复杂并行化策略的性能进行建模的独立模拟器。Proteus 首先用名为 "策略树 "的统一表示法对复杂并行化策略进行建模。然后,它将策略树编译成分布式执行图,并通过分层拓扑感知执行器(HTAE)模拟复杂的运行时行为、计算-通信重叠和带宽共享。最后,我们在三种硬件配置上对各种 DNN 进行了 Proteus 评估。实验结果表明,Proteus 实现了 3.0% 的平均预测误差,并保持了各种并行化策略的训练吞吐量顺序。与最先进的方法相比,Proteus 最多可将预测误差降低 133.8%。
{"title":"Proteus: Simulating the Performance of Distributed DNN Training","authors":"Jiangfei Duan;Xiuhong Li;Ping Xu;Xingcheng Zhang;Shengen Yan;Yun Liang;Dahua Lin","doi":"10.1109/TPDS.2024.3443255","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3443255","url":null,"abstract":"DNN models are becoming increasingly larger to achieve unprecedented accuracy, and the accompanying increased computation and memory requirements necessitate the employment of massive clusters and elaborate parallelization strategies to accelerate DNN training. In order to better optimize the performance and analyze the cost, it is indispensable to model the training throughput of distributed DNN training. However, complex parallelization strategies and the resulting complex runtime behaviors make it challenging to construct an accurate performance model. In this article, we present Proteus, the first standalone simulator to model the performance of complex parallelization strategies through simulation execution. Proteus first models complex parallelization strategies with a unified representation named \u0000<italic>Strategy Tree</i>\u0000. Then, it compiles the strategy tree into a distributed execution graph and simulates the complex runtime behaviors, \u0000<italic>comp-comm overlap</i>\u0000 and \u0000<italic>bandwidth sharing</i>\u0000, with a \u0000<underline>H</u>\u0000ierarchical \u0000<underline>T</u>\u0000opo-\u0000<underline>A</u>\u0000ware \u0000<underline>E</u>\u0000xecutor (\u0000<italic>HTAE</i>\u0000). We finally evaluate Proteus across a wide variety of DNNs on three hardware configurations. Experimental results show that Proteus achieves 3.0% average prediction error and preserves order for training throughput of various parallelization strategies. Compared to state-of-the-art approaches, Proteus reduces prediction error by up to 133.8%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1867-1878"},"PeriodicalIF":5.6,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10636756","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142090713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Opca: Enabling Optimistic Concurrent Access for Multiple Users in Oblivious Data Storage Opca:在遗忘数据存储中实现多用户优化并发访问
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-12 DOI: 10.1109/TPDS.2024.3441623
Yuezhi Che;Dazhao Cheng;Xiao Wang;Rujia Wang
The challenges of data privacy and security posed by data outsourcing are becoming increasingly prevalent. Oblivious RAM (ORAM)-based oblivious data storage guarantees data confidentiality through data encryption and access pattern obfuscation. However, it suffers from performance degradation and low throughput. To address these issues, the concurrency of ORAM in a multi-user scenario has been explored. We investigate several existing concurrent oblivious data storage solutions and discover that a trusted proxy is used to serve concurrent accesses between users and storage, with processing locks involved in the proxy to ensure correctness and prevent conflicts. The proxy-based system is inherently prone to pessimistic concurrency control, and as the number of users grows, a proxy might become a performance bottleneck, causing significant delays. In this study, we propose Opca, a novel oblivious data storage framework that enables optimistic concurrent access. Opca refines the proxy design by temporally storing multiple versions of modified data with labeled timestamps, committing only the latest version to the storage during a separate processing period. Opca is implemented and evaluated in different real-world storage backends with a scalable number of users, and its performance is compared to alternative schemes. Opca outperforms the state-of-the-art concurrent oblivious storage system TaoStore, which relies on a similar system setting. Our results show that Opca can improve 3.77x throughput and reduce 73.5% response time.
数据外包带来的数据隐私和安全挑战越来越普遍。基于遗忘内存(ORAM)的遗忘数据存储通过数据加密和访问模式混淆来保证数据的机密性。然而,它存在性能下降和吞吐量低的问题。为了解决这些问题,我们探索了多用户情况下遗忘内存的并发性。我们研究了几种现有的并发遗忘数据存储解决方案,发现用户和存储之间的并发访问使用可信代理服务,代理中涉及处理锁,以确保正确性并防止冲突。基于代理的系统在本质上容易造成并发控制的悲观,随着用户数量的增加,代理可能会成为性能瓶颈,造成严重的延迟。在本研究中,我们提出了一种新型遗忘式数据存储框架 Opca,它可以实现乐观的并发访问。Opca 改进了代理设计,在时间上存储了多个带时间戳的修改数据版本,在单独的处理期间只将最新版本提交到存储中。Opca 在用户数量可扩展的不同实际存储后端中进行了实施和评估,并将其性能与其他方案进行了比较。Opca 的性能优于最先进的并发遗忘存储系统 TaoStore,后者依赖于类似的系统设置。结果表明,Opca 的吞吐量提高了 3.77 倍,响应时间缩短了 73.5%。
{"title":"Opca: Enabling Optimistic Concurrent Access for Multiple Users in Oblivious Data Storage","authors":"Yuezhi Che;Dazhao Cheng;Xiao Wang;Rujia Wang","doi":"10.1109/TPDS.2024.3441623","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3441623","url":null,"abstract":"The challenges of data privacy and security posed by data outsourcing are becoming increasingly prevalent. Oblivious RAM (ORAM)-based oblivious data storage guarantees data confidentiality through data encryption and access pattern obfuscation. However, it suffers from performance degradation and low throughput. To address these issues, the concurrency of ORAM in a multi-user scenario has been explored. We investigate several existing concurrent oblivious data storage solutions and discover that a trusted proxy is used to serve concurrent accesses between users and storage, with processing locks involved in the proxy to ensure correctness and prevent conflicts. The proxy-based system is inherently prone to pessimistic concurrency control, and as the number of users grows, a proxy might become a performance bottleneck, causing significant delays. In this study, we propose Opca, a novel oblivious data storage framework that enables optimistic concurrent access. Opca refines the proxy design by temporally storing multiple versions of modified data with labeled timestamps, committing only the latest version to the storage during a separate processing period. Opca is implemented and evaluated in different real-world storage backends with a scalable number of users, and its performance is compared to alternative schemes. Opca outperforms the state-of-the-art concurrent oblivious storage system TaoStore, which relies on a similar system setting. Our results show that Opca can improve 3.77x throughput and reduce 73.5% response time.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1891-1903"},"PeriodicalIF":5.6,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142165005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Which Coupled is Best Coupled? An Exploration of AIMC Tile Interfaces and Load Balancing for CNNs 哪种耦合是最佳耦合?AIMC 瓦片接口和 CNN 负载平衡探索
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-02 DOI: 10.1109/TPDS.2024.3437657
Joshua Klein;Irem Boybat;Giovanni Ansaloni;Marina Zapater;David Atienza
Due to stringent energy and performance constraints, edge AI computing often employs heterogeneous systems that utilize both general-purpose CPUs and accelerators. Analog in-memory computing (AIMC) is a well-known AI inference solution that overcomes computational bottlenecks by performing matrix-vector multiplication operations (MVMs) in constant time. However, the tiles of AIMC-based accelerators are limited by the number of weights they can hold. State-of-the-art research often sizes neural networks to AIMC tiles (or vice-versa), but does not consider cases where AIMC tiles cannot cover the whole network due to lack of tile resources or the network size. In this work, we study the trade-offs of available AIMC tile resources, neural network coverage, AIMC tile proximity to compute resources, and multi-core load balancing techniques. We first perform a study of single-layer performance and energy scalability of AIMC tiles in the two most typical AIMC acceleration targets: dense/fully-connected layers and convolutional layers. This study guides the methodology with which we approach parameter allocation to AIMC tiles in the context of large edge neural networks, both where AIMC tiles are close to the CPU (tightly-coupled) and cannot share resources across the system, and where AIMC tiles are far from the CPU (loosely-coupled) and can employ workload stealing. We explore the performance and energy trends of six modern CNNs using different methods of load balancing for differently-coupled system configurations with variable AIMC tile resources. We show that, by properly distributing workloads, AIMC acceleration can be made highly effective even on under-provisioned systems. As an example, 5.9x speedup and 5.6x energy gains were measured on an 8-core system, for a 41% coverage of neural network parameters.
由于严格的能耗和性能限制,边缘人工智能计算通常采用同时使用通用 CPU 和加速器的异构系统。模拟内存计算(AIMC)是一种著名的人工智能推理解决方案,它通过在恒定时间内执行矩阵-向量乘法运算(MVM)来克服计算瓶颈。然而,基于 AIMC 的加速器所能容纳的权重数量有限。最先进的研究通常会将神经网络的大小调整为 AIMC 瓦片(反之亦然),但不会考虑 AIMC 瓦片因缺乏瓦片资源或网络大小而无法覆盖整个网络的情况。在这项工作中,我们研究了可用 AIMC 瓦片资源、神经网络覆盖率、AIMC 瓦片与计算资源的接近程度以及多核负载平衡技术之间的权衡。我们首先研究了 AIMC 瓦片在两个最典型的 AIMC 加速目标中的单层性能和能量可扩展性:密集/全连接层和卷积层。这项研究为我们在大型边缘神经网络中处理 AIMC 瓦片参数分配提供了方法论指导,在这种情况下,AIMC 瓦片靠近 CPU(紧密耦合),无法在整个系统中共享资源,而在 AIMC 瓦片远离 CPU(松散耦合)的情况下,则可以采用工作负载窃取。我们探索了六种现代 CNN 的性能和能耗趋势,这些 CNN 采用了不同的负载均衡方法,适用于 AIMC 瓦片资源可变的不同耦合系统配置。我们的研究表明,通过适当分配工作负载,即使在配置不足的系统中,AIMC 加速也能非常有效。例如,在神经网络参数覆盖率为 41% 的 8 核系统上,我们测得了 5.9 倍的速度提升和 5.6 倍的能量增益。
{"title":"Which Coupled is Best Coupled? An Exploration of AIMC Tile Interfaces and Load Balancing for CNNs","authors":"Joshua Klein;Irem Boybat;Giovanni Ansaloni;Marina Zapater;David Atienza","doi":"10.1109/TPDS.2024.3437657","DOIUrl":"10.1109/TPDS.2024.3437657","url":null,"abstract":"Due to stringent energy and performance constraints, edge AI computing often employs heterogeneous systems that utilize both general-purpose CPUs and accelerators. Analog in-memory computing (AIMC) is a well-known AI inference solution that overcomes computational bottlenecks by performing matrix-vector multiplication operations (MVMs) in constant time. However, the tiles of AIMC-based accelerators are limited by the number of weights they can hold. State-of-the-art research often sizes neural networks to AIMC tiles (or vice-versa), but does not consider cases where AIMC tiles cannot cover the whole network due to lack of tile resources or the network size. In this work, we study the trade-offs of available AIMC tile resources, neural network coverage, AIMC tile proximity to compute resources, and multi-core load balancing techniques. We first perform a study of single-layer performance and energy scalability of AIMC tiles in the two most typical AIMC acceleration targets: dense/fully-connected layers and convolutional layers. This study guides the methodology with which we approach parameter allocation to AIMC tiles in the context of large edge neural networks, both where AIMC tiles are close to the CPU (tightly-coupled) and cannot share resources across the system, and where AIMC tiles are far from the CPU (loosely-coupled) and can employ workload stealing. We explore the performance and energy trends of six modern CNNs using different methods of load balancing for differently-coupled system configurations with variable AIMC tile resources. We show that, by properly distributing workloads, AIMC acceleration can be made highly effective even on under-provisioned systems. As an example, 5.9x speedup and 5.6x energy gains were measured on an 8-core system, for a 41% coverage of neural network parameters.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1780-1795"},"PeriodicalIF":5.6,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Locality-Preserving Graph Traversal With Split Live Migration 利用分割实时迁移实现位置保护图遍历
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-02 DOI: 10.1109/TPDS.2024.3436828
Rong Chen;Xingda Wei;Xiating Xie;Haibo Chen
Graph models many real-world data like social, transportation, biology, and communication data. Hence, graph traversal including multi-hop or graph-walking queries has been the key operation atop graph stores. However, since different graph traversals may touch different sets of vertices, it is hard or even impossible to have a one-size-fits-all graph partitioning algorithm that preserves access locality for various graph traversal workloads. Meanwhile, prior shard-based migration faces a dilemma such that coarse-grained migration may incur more migration overhead over increased locality benefits, while fine-grained migration usually requires excessive metadata and incurs non-trivial maintenance costs. We present Pragh, an efficient locality-preserving live graph migration scheme for graph stores in the form of key-value pairs. The key idea of Pragh is a split migration model that only migrates values physically while retaining keys in the initial location. This allows fine-grained migration while avoiding the need to maintain excessive metadata. Pragh integrates an RDMA-friendly location cache from DrTM-KV to provide fully-localized access to migrated data and further makes a novel reuse of the cache replacement policy for lightweight monitoring. Pragh further supports evolving graphs through a check-and-forward mechanism to resolve the conflict between updates and migration of graph data. Evaluations on an 8-node RDMA-capable cluster (100 Gbps) using a representative graph traversal benchmark show that Pragh can increase the throughput by up to 19× and decrease the median latency by up to 94%, thanks to split live migration that eliminates 97% remote accesses. A port of split live migration to Wukong shows up to 2.53× throughput improvement on representative workloads like LUBM-10240, thanks to a reduction of 88% remote accesses. This further confirms the effectiveness and generality of Pragh. Finally, though Pragh focuses on RDMA-based graph traversal, we show its generality by extending it to support graph traversals under traditional networking. Evaluations on the graph traversal benchmarks and graph query workloads on the same cluster but with 10 Gbps TCP/IP network further confirm its effectiveness without RDMA. Specifically, when evaluating on the LUBM-10240, Wukong-TCP with Pragh can achieve up to 1.87× throughput improvement with a 56% decrease in remote accesses.
图是许多现实世界数据的模型,如社会、交通、生物和通信数据。因此,图遍历(包括多跳或图行走查询)一直是图存储的关键操作。然而,由于不同的图遍历可能会触及不同的顶点集,因此很难甚至不可能有一种放之四海而皆准的图分区算法,能为各种图遍历工作负载保留访问局部性。与此同时,之前基于分片的迁移也面临着两难选择,粗粒度迁移可能会带来更多迁移开销,而不是更多的本地性优势,而细粒度迁移通常需要过多的元数据,并产生非同小可的维护成本。我们提出了 Pragh,这是一种针对键值对形式图存储的高效本地性保护实时图迁移方案。Pragh 的关键理念是一种拆分迁移模型,只对值进行物理迁移,而将键保留在初始位置。这样既能实现细粒度迁移,又能避免维护过多的元数据。Pragh 整合了来自 DrTM-KV 的 RDMA 友好位置缓存,为迁移数据提供完全本地化的访问,并进一步对缓存替换策略进行了新颖的重用,以实现轻量级监控。Pragh 还通过检查和转发机制进一步支持演化图,以解决图数据更新和迁移之间的冲突。在一个支持 RDMA 的 8 节点集群(100 Gbps)上使用具有代表性的图形遍历基准进行的评估表明,Pragh 可将吞吐量提高 19 倍,将中位延迟降低 94%,这要归功于可消除 97% 远程访问的拆分实时迁移。在 LUBM-10240 等代表性工作负载上,由于减少了 88% 的远程访问,将拆分实时迁移移植到 "悟空 "后,吞吐量最多提高了 2.53 倍。这进一步证实了 Pragh 的有效性和通用性。最后,虽然 Pragh 专注于基于 RDMA 的图遍历,但我们通过扩展它来支持传统网络下的图遍历,从而展示了它的通用性。在使用 10 Gbps TCP/IP 网络的同一集群上对图遍历基准和图查询工作负载进行的评估进一步证实了 Pragh 在不使用 RDMA 的情况下的有效性。具体而言,在 LUBM-10240 上进行评估时,使用 Pragh 的 Wukong-TCP 可实现高达 1.87 倍的吞吐量改进,远程访问量减少了 56%。
{"title":"Locality-Preserving Graph Traversal With Split Live Migration","authors":"Rong Chen;Xingda Wei;Xiating Xie;Haibo Chen","doi":"10.1109/TPDS.2024.3436828","DOIUrl":"10.1109/TPDS.2024.3436828","url":null,"abstract":"Graph models many real-world data like social, transportation, biology, and communication data. Hence, graph traversal including multi-hop or graph-walking queries has been the key operation atop graph stores. However, since different graph traversals may touch different sets of vertices, it is hard or even impossible to have a one-size-fits-all graph partitioning algorithm that preserves access locality for various graph traversal workloads. Meanwhile, prior shard-based migration faces a dilemma such that coarse-grained migration may incur more migration overhead over increased locality benefits, while fine-grained migration usually requires excessive metadata and incurs non-trivial maintenance costs. We present Pragh, an efficient locality-preserving live graph migration scheme for graph stores in the form of key-value pairs. The key idea of Pragh is a split migration model that only migrates values physically while retaining keys in the initial location. This allows fine-grained migration while avoiding the need to maintain excessive metadata. Pragh integrates an RDMA-friendly location cache from DrTM-KV to provide fully-localized access to migrated data and further makes a novel reuse of the cache replacement policy for lightweight monitoring. Pragh further supports evolving graphs through a check-and-forward mechanism to resolve the conflict between updates and migration of graph data. Evaluations on an 8-node RDMA-capable cluster (100 Gbps) using a representative graph traversal benchmark show that Pragh can increase the throughput by up to 19× and decrease the median latency by up to 94%, thanks to split live migration that eliminates 97% remote accesses. A port of split live migration to Wukong shows up to 2.53× throughput improvement on representative workloads like LUBM-10240, thanks to a reduction of 88% remote accesses. This further confirms the effectiveness and generality of Pragh. Finally, though Pragh focuses on RDMA-based graph traversal, we show its generality by extending it to support graph traversals under traditional networking. Evaluations on the graph traversal benchmarks and graph query workloads on the same cluster but with 10 Gbps TCP/IP network further confirm its effectiveness without RDMA. Specifically, when evaluating on the LUBM-10240, Wukong-TCP with Pragh can achieve up to 1.87× throughput improvement with a 56% decrease in remote accesses.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1810-1825"},"PeriodicalIF":5.6,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed Evolution Strategies With Multi-Level Learning for Large-Scale Black-Box Optimization 针对大规模黑箱优化的多级学习分布式进化策略
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-02 DOI: 10.1109/TPDS.2024.3437688
Qiqi Duan;Chang Shao;Guochen Zhou;Minghan Zhang;Qi Zhao;Yuhui Shi
In the post-Moore era, main performance gains of black-box optimizers are increasingly depending on parallelism, especially for large-scale optimization (LSO). Here we propose to parallelize the well-established covariance matrix adaptation evolution strategy (CMA-ES) and in particular its one latest LSO variant called limited-memory CMA-ES (LM-CMA). To achieve efficiency while approximating its powerful invariance property, we present a multilevel learning-based meta-framework for distributed LM-CMA. Owing to its hierarchically organized structure, Meta-ES is well-suited to implement our distributed meta-framework, wherein the outer-ES controls strategy parameters while all parallel inner-ESs run the serial LM-CMA with different settings. For the distribution mean update of the outer-ES, both the elitist and multi-recombination strategy are used in parallel to avoid stagnation and regression, respectively. To exploit spatiotemporal information, the global step-size adaptation combines Meta-ES with the parallel cumulative step-size adaptation. After each isolation time, our meta-framework employs both the structure and parameter learning strategy to combine aligned evolution paths for CMA reconstruction. Experiments on a set of large-scale benchmarking functions with memory-intensive evaluations, arguably reflecting many data-driven optimization problems, validate the benefits (e.g., effectiveness w.r.t. solution quality, and adaptability w.r.t. second-order learning) and costs of our meta-framework.
在后摩尔时代,黑盒优化器的主要性能提升越来越依赖于并行化,尤其是大规模优化(LSO)。在此,我们提议并行化成熟的协方差矩阵适应演化策略(CMA-ES),特别是其最新的 LSO 变体--有限内存 CMA-ES (LM-CMA)。为了在近似其强大不变性特性的同时提高效率,我们提出了一种基于多层次学习的分布式 LM-CMA 元框架。由于其分层组织结构,Meta-ES 非常适合实现我们的分布式元框架,其中外层 ES 控制策略参数,而所有并行的内层 ES 以不同的设置运行串行 LM-CMA。对于外层 ESP 的分布均值更新,将并行使用精英策略和多重组合策略,以分别避免停滞和回归。为了利用时空信息,全局步长适应将 Meta-ES 与并行累积步长适应相结合。在每次隔离时间之后,我们的元框架都会采用结构和参数学习策略,结合对齐的演化路径进行 CMA 重建。在一组大规模基准函数上进行的实验验证了我们元框架的优势(例如,在解决方案质量方面的有效性和在二阶学习方面的适应性)和成本,这些基准函数具有内存密集型评估,可以说反映了许多数据驱动的优化问题。
{"title":"Distributed Evolution Strategies With Multi-Level Learning for Large-Scale Black-Box Optimization","authors":"Qiqi Duan;Chang Shao;Guochen Zhou;Minghan Zhang;Qi Zhao;Yuhui Shi","doi":"10.1109/TPDS.2024.3437688","DOIUrl":"10.1109/TPDS.2024.3437688","url":null,"abstract":"In the post-Moore era, main performance gains of black-box optimizers are increasingly depending on parallelism, especially for large-scale optimization (LSO). Here we propose to parallelize the well-established covariance matrix adaptation evolution strategy (CMA-ES) and in particular its one latest LSO variant called limited-memory CMA-ES (LM-CMA). To achieve efficiency while approximating its powerful invariance property, we present a multilevel learning-based meta-framework for distributed LM-CMA. Owing to its hierarchically organized structure, Meta-ES is well-suited to implement our distributed meta-framework, wherein the outer-ES controls strategy parameters while all parallel inner-ESs run the serial LM-CMA with different settings. For the distribution mean update of the outer-ES, both the elitist and multi-recombination strategy are used in parallel to avoid stagnation and regression, respectively. To exploit spatiotemporal information, the global step-size adaptation combines Meta-ES with the parallel cumulative step-size adaptation. After each isolation time, our meta-framework employs both the structure and parameter learning strategy to combine aligned evolution paths for CMA reconstruction. Experiments on a set of large-scale benchmarking functions with memory-intensive evaluations, arguably reflecting many data-driven optimization problems, validate the benefits (e.g., effectiveness w.r.t. solution quality, and adaptability w.r.t. second-order learning) and costs of our meta-framework.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2087-2101"},"PeriodicalIF":5.6,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1