首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
HRCM: A Hierarchical Regularizing Mechanism for Sparse and Imbalanced Communication in Whole Human Brain Simulations HRCM:全人脑模拟中稀疏和不平衡通信的分层正则化机制
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-04-12 DOI: 10.1109/TPDS.2024.3387720
Xin Du;Minglong Wang;Zhihui Lu;Qiang Duan;Yuhao Liu;Jianfeng Feng;Huarui Wang
Brain simulation is one of the most important measures to understand how information is represented and processed in the brain, which usually needs to be realized in supercomputers with a large number of interconnected graphical processing units (GPUs). For the whole human brain simulation, tens of thousands of GPUs are utilized to simulate tens of billions of neurons and tens of trillions of synapses for the living brain to reveal functional connectivity patterns. However, as an application of the irregular spares communication problem on a large-scale system, the sparse and imbalanced communication patterns of the human brain make it particularly challenging to design a communication system for supporting large-scale brain simulations. To face this challenge, this paper proposes a hierarchical regularized communication mechanism, HRCM. The HRCM maintains a hierarchical virtual communication topology (HVCT) with a merge-forward algorithm that exploits the sparsity of neuron interactions to regularize inter-process communications in brain simulations. HRCM also provides a neuron-level partition scheme for assigning neurons to simulation processes to balance the communication load while improving resource utilization. In HRCM, neuron partition is formulated as a k-way graph partition problem and solved efficiently by the proposed hybrid multi-constraint greedy (HMCG) algorithm. HRCM has been implemented in human brain simulations at the scale of up to 86 billion neurons running on 10000 GPUs. Results obtained from extensive simulation experiments verify the effectiveness of HRCM in significantly reducing communication delay, increasing resource usage, and shortening simulation time for large-scale human brain models.
大脑模拟是了解大脑如何表示和处理信息的最重要措施之一,通常需要在具有大量互连图形处理单元(GPU)的超级计算机中实现。对于整个人类大脑的模拟,需要利用数以万计的 GPU 来模拟活体大脑的数百亿个神经元和数万亿个突触,以揭示功能连接模式。然而,作为不规则备件通信问题在大规模系统上的应用,人脑稀疏且不平衡的通信模式使得设计支持大规模大脑模拟的通信系统尤为困难。面对这一挑战,本文提出了一种分层正则化通信机制(HRCM)。HRCM 采用合并前向算法维护分层虚拟通信拓扑(HVCT),利用神经元交互的稀疏性来规范大脑模拟中的进程间通信。HRCM 还提供了神经元级分区方案,用于将神经元分配给仿真进程,以平衡通信负载,同时提高资源利用率。在 HRCM 中,神经元分区被表述为 k 路图分区问题,并通过所提出的混合多约束贪婪(HMCG)算法高效解决。HRCM 已在运行于 10000 个 GPU 上的高达 860 亿神经元规模的人脑仿真中实现。大量仿真实验的结果验证了 HRCM 在大幅减少通信延迟、提高资源利用率和缩短大规模人脑模型仿真时间方面的有效性。
{"title":"HRCM: A Hierarchical Regularizing Mechanism for Sparse and Imbalanced Communication in Whole Human Brain Simulations","authors":"Xin Du;Minglong Wang;Zhihui Lu;Qiang Duan;Yuhao Liu;Jianfeng Feng;Huarui Wang","doi":"10.1109/TPDS.2024.3387720","DOIUrl":"10.1109/TPDS.2024.3387720","url":null,"abstract":"Brain simulation is one of the most important measures to understand how information is represented and processed in the brain, which usually needs to be realized in supercomputers with a large number of interconnected graphical processing units (GPUs). For the whole human brain simulation, tens of thousands of GPUs are utilized to simulate tens of billions of neurons and tens of trillions of synapses for the living brain to reveal functional connectivity patterns. However, as an application of the irregular spares communication problem on a large-scale system, the sparse and imbalanced communication patterns of the human brain make it particularly challenging to design a communication system for supporting large-scale brain simulations. To face this challenge, this paper proposes a hierarchical regularized communication mechanism, HRCM. The HRCM maintains a hierarchical virtual communication topology (HVCT) with a merge-forward algorithm that exploits the sparsity of neuron interactions to regularize inter-process communications in brain simulations. HRCM also provides a neuron-level partition scheme for assigning neurons to simulation processes to balance the communication load while improving resource utilization. In HRCM, neuron partition is formulated as a k-way graph partition problem and solved efficiently by the proposed hybrid multi-constraint greedy (HMCG) algorithm. HRCM has been implemented in human brain simulations at the scale of up to 86 billion neurons running on 10000 GPUs. Results obtained from extensive simulation experiments verify the effectiveness of HRCM in significantly reducing communication delay, increasing resource usage, and shortening simulation time for large-scale human brain models.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 6","pages":"901-918"},"PeriodicalIF":5.3,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FastTuning: Enabling Fast and Efficient Hyper-Parameter Tuning With Partitioning and Parallelism of Search Space FastTuning:利用搜索空间的分割和并行性实现快速高效的超参数调整
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-04-10 DOI: 10.1109/TPDS.2024.3386939
Xiaqing Li;Qi Guo;Guangyan Zhang;Siwei Ye;Guanhua He;Yiheng Yao;Rui Zhang;Yifan Hao;Zidong Du;Weimin Zheng
Hyper-parameter tuning (HPT) for deep learning (DL) models is prohibitively expensive. Sequential model-based optimization (SMBO) emerges as the state-of-the-art (SOTA) approach to automatically optimize HPT performance due to its heuristic advantages. Unfortunately, focusing on algorithm optimization rather than a large-scale parallel HPT system, existing SMBO-based approaches still cannot effectively remove their strong sequential nature, posing two performance problems: (1) extremely low tuning speed and (2) sub-optimal model quality. In this paper, we propose FastTuning, a fast, scalable, and generic system aiming at parallelly accelerating SMBO-based HPT for large DL/ML models. The key is to partition the highly complex search space into multiple smaller sub-spaces, each of which is assigned to and optimized by a different tuning worker in parallel. However, determining the right level of resource allocation to strike a balance between quality and cost remains a challenge. To address this, we further propose NIMBLE, a dynamic scheduling strategy that is specially designed for FastTuning, including (1) Dynamic Elimination Algorithm, (2) Sub-space Re-division, and (3) Posterior Information Sharing. Finally, we incorporate 6 SOTAs (i.e., 3 tuning algorithms and 3 parallel tuning tools) into FastTuning. Experimental results, on ResNet18, VGG19, ResNet50, and ResNet152, show that FastTuning can consistently offer much faster tuning speed (up to $80times$) with better accuracy (up to 4.7% improvement), thereby enabling the application of automatic HPT to real-life DL models.
深度学习(DL)模型的超参数调整(HPT)成本过高。基于序列模型的优化(SMBO)因其启发式优势而成为自动优化 HPT 性能的最先进(SOTA)方法。遗憾的是,现有的基于 SMBO 的方法侧重于算法优化而非大规模并行 HPT 系统,仍无法有效消除其强烈的顺序性,从而带来两个性能问题:(1) 极低的调整速度和 (2) 次优模型质量。本文提出的 FastTuning 是一种快速、可扩展的通用系统,旨在并行加速基于 SMBO 的大型 DL/ML 模型的 HPT。其关键在于将高度复杂的搜索空间划分为多个较小的子空间,每个子空间分配给不同的调优人员并由其并行优化。然而,如何确定正确的资源分配水平,在质量和成本之间取得平衡仍然是一个挑战。为了解决这个问题,我们进一步提出了 NIMBLE,一种专为 FastTuning 设计的动态调度策略,包括:(1)动态消除算法;(2)子空间再划分;(3)后验信息共享。最后,我们在 FastTuning 中加入了 6 个 SOTAs(即 3 个调优算法和 3 个并行调优工具)。在ResNet18、VGG19、ResNet50和ResNet152上的实验结果表明,FastTuning可以持续提供更快的调谐速度(高达80美元/次)和更高的精度(高达4.7%的改进),从而使自动HPT应用于现实生活中的DL模型。
{"title":"FastTuning: Enabling Fast and Efficient Hyper-Parameter Tuning With Partitioning and Parallelism of Search Space","authors":"Xiaqing Li;Qi Guo;Guangyan Zhang;Siwei Ye;Guanhua He;Yiheng Yao;Rui Zhang;Yifan Hao;Zidong Du;Weimin Zheng","doi":"10.1109/TPDS.2024.3386939","DOIUrl":"10.1109/TPDS.2024.3386939","url":null,"abstract":"Hyper-parameter tuning (HPT) for deep learning (DL) models is prohibitively expensive. Sequential model-based optimization (SMBO) emerges as the state-of-the-art (SOTA) approach to automatically optimize HPT performance due to its heuristic advantages. Unfortunately, focusing on algorithm optimization rather than a large-scale parallel HPT system, existing SMBO-based approaches still cannot effectively remove their strong sequential nature, posing two performance problems: (1) \u0000<i>extremely low tuning speed</i>\u0000 and (2) \u0000<i>sub-optimal model quality</i>\u0000. In this paper, we propose FastTuning, a fast, scalable, and generic system aiming at parallelly accelerating SMBO-based HPT for large DL/ML models. The key is to partition the highly complex search space into multiple smaller sub-spaces, each of which is assigned to and optimized by a different tuning worker in parallel. However, determining the right level of resource allocation to strike a balance between quality and cost remains a challenge. To address this, we further propose NIMBLE, a dynamic scheduling strategy that is specially designed for FastTuning, including (1) Dynamic Elimination Algorithm, (2) Sub-space Re-division, and (3) Posterior Information Sharing. Finally, we incorporate 6 SOTAs (i.e., 3 tuning algorithms and 3 parallel tuning tools) into FastTuning. Experimental results, on ResNet18, VGG19, ResNet50, and ResNet152, show that FastTuning can consistently offer much faster tuning speed (up to \u0000<inline-formula><tex-math>$80times$</tex-math></inline-formula>\u0000) with better accuracy (up to 4.7% improvement), thereby enabling the application of automatic HPT to real-life DL models.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 7","pages":"1174-1188"},"PeriodicalIF":5.3,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism MPMoE:利用自适应管道并行性预训练模型的内存效率 MoE
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-04-08 DOI: 10.1109/TPDS.2024.3385639
Zheng Zhang;Yaqi Xia;Hulin Wang;Donglin Yang;Chuang Hu;Xiaobo Zhou;Dazhao Cheng
In recent years, the Mixture-of-Experts (MoE) technique has gained widespread popularity as a means to scale pre-trained models to exceptionally large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system and algorithm optimizations, there are significant challenges to be tackled when it comes to the inefficiencies of communication and memory consumption. In this paper, we present the design and implementation of MPMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages. We design a pipeline parallelism method for reducing communication latency by overlapping with computation operations. Further, we analyze the memory footprint breakdown of MoE training and identify that activations and temporary buffers are the primary contributors to the overall memory footprint. Toward memory efficiency, we propose memory reuse strategies to reduce memory requirements by eliminating memory redundancies. Finally, to optimize pipeline granularity and memory reuse strategies jointly, we propose a profile-based algorithm and a performance model to determine the configurations of MPMoE at runtime. We implement MPMoE upon PyTorch and evaluate it with common MoE models in two physical clusters, including 64 NVIDIA A100 GPU cards and 16 NVIDIA V100 GPU cards. Compared with the state-of-art approach, MPMoE achieves up to 2.3× speedup while reducing more than 30% memory footprint for training large models.
近年来,专家混合(MoE)技术作为一种将预训练模型扩展到超大规模的手段,受到了广泛欢迎。专家的动态激活允许进行条件计算,增加了神经网络的参数数量,这对于吸收许多深度学习领域的大量知识至关重要。然而,尽管现有的系统和算法已经进行了优化,但在通信和内存消耗效率低下方面仍有重大挑战需要解决。在本文中,我们介绍了 MPMoE 的设计与实现,这是一个高性能库,可通过自适应和内存效率高的流水线并行来加速 MoE 训练。受此启发,MoE 训练过程可分为多个独立的子阶段。我们设计了一种流水线并行方法,通过与计算操作重叠来减少通信延迟。此外,我们分析了 MoE 训练的内存占用细分,发现激活和临时缓冲区是造成整体内存占用的主要因素。为了提高内存效率,我们提出了内存重用策略,通过消除内存冗余来降低内存需求。最后,为了联合优化流水线粒度和内存重用策略,我们提出了基于配置文件的算法和性能模型,以确定 MPMoE 在运行时的配置。我们在 PyTorch 上实现了 MPMoE,并在两个物理集群(包括 64 个英伟达 A100 GPU 卡和 16 个英伟达 V100 GPU 卡)中用常见的 MoE 模型对其进行了评估。与最先进的方法相比,MPMoE 的速度提高了 2.3 倍,同时在训练大型模型时减少了 30% 以上的内存占用。
{"title":"MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism","authors":"Zheng Zhang;Yaqi Xia;Hulin Wang;Donglin Yang;Chuang Hu;Xiaobo Zhou;Dazhao Cheng","doi":"10.1109/TPDS.2024.3385639","DOIUrl":"10.1109/TPDS.2024.3385639","url":null,"abstract":"In recent years, the Mixture-of-Experts (MoE) technique has gained widespread popularity as a means to scale pre-trained models to exceptionally large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system and algorithm optimizations, there are significant challenges to be tackled when it comes to the inefficiencies of communication and memory consumption. In this paper, we present the design and implementation of MPMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages. We design a pipeline parallelism method for reducing communication latency by overlapping with computation operations. Further, we analyze the memory footprint breakdown of MoE training and identify that activations and temporary buffers are the primary contributors to the overall memory footprint. Toward memory efficiency, we propose memory reuse strategies to reduce memory requirements by eliminating memory redundancies. Finally, to optimize pipeline granularity and memory reuse strategies jointly, we propose a profile-based algorithm and a performance model to determine the configurations of MPMoE at runtime. We implement MPMoE upon PyTorch and evaluate it with common MoE models in two physical clusters, including 64 NVIDIA A100 GPU cards and 16 NVIDIA V100 GPU cards. Compared with the state-of-art approach, MPMoE achieves up to 2.3× speedup while reducing more than 30% memory footprint for training large models.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 6","pages":"843-856"},"PeriodicalIF":5.3,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140561554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
G-Learned Index: Enabling Efficient Learned Index on GPU G-学习索引:在 GPU 上实现高效学习索引
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-04-02 DOI: 10.1109/TPDS.2024.3381214
Jiesong Liu;Feng Zhang;Lv Lu;Chang Qi;Xiaoguang Guo;Dong Deng;Guoliang Li;Huanchen Zhang;Jidong Zhai;Hechen Zhang;Yuxing Chen;Anqun Pan;Xiaoyong Du
AI and GPU technologies have been widely applied to solve Big Data problems. The total data volume worldwide reaches 200 zettabytes in 2022. How to efficiently index the required content among massive data becomes serious. Recently, a promising learned index has been proposed to address this challenge: It has extremely high efficiency while retaining marginal space overhead. However, we notice that previous learned indexes have mainly focused on CPU architecture, while ignoring the advantages of GPU. Because traditional indexes like B-Tree, LSM, and bitmap have greatly benefited from GPU acceleration, a combination of a learned index and GPU has great potentials to reach tremendous speedups. In this paper, we propose a GPU-based learned index, called G-Learned Index, to significantly improve the performance of learned index structures. The primary challenges in developing G-Learned Index lie in the use of thousands of GPU cores including minimization of synchronization and branch divergence, data structure design for parallel operations, and usage of memory bandwidth including limited memory transactions and multi-memory hierarchy. To overcome these challenges, a series of novel technologies are developed, including efficient thread organization, succinct data structures, and heterogeneous memory hierarchy utilization. Compared to the state-of-the-art learned index, the proposed G-Learned Index achieves an average of 174× speedup (and 107× of its parallel version). Meanwhile, we attain 2× less query time over the state-of-the-art GPU B-Tree. Our further exploration of range queries shows that G-Learned Index is $17times$ faster than CPU multi-dimensional learned index.
人工智能和 GPU 技术已被广泛应用于解决大数据问题。2022 年,全球数据总量将达到 200 ZB。如何在海量数据中高效地索引所需的内容变得非常重要。最近,一种很有前途的学习索引被提出来应对这一挑战:它具有极高的效率,同时保留了边际空间开销。然而,我们注意到,以往的学习索引主要集中在 CPU 架构上,而忽略了 GPU 的优势。由于 B-Tree、LSM 和 bitmap 等传统索引已从 GPU 加速中获益良多,因此将学习索引与 GPU 结合在一起具有极大的潜力,可以实现极大的提速。在本文中,我们提出了一种基于 GPU 的学习索引,称为 G-Learned Index,以显著提高学习索引结构的性能。开发 G-Learned Index 的主要挑战在于如何使用成千上万的 GPU 内核,包括同步和分支发散的最小化、并行操作的数据结构设计以及内存带宽的使用(包括有限的内存事务和多内存分层)。为了克服这些挑战,我们开发了一系列新技术,包括高效线程组织、简洁数据结构和异构内存分级利用。与最先进的学习索引相比,所提出的 G-Learned Index 平均提速 174 倍(并行版本提速 107 倍)。同时,与最先进的 GPU B-Tree 相比,我们的查询时间缩短了 2 倍。我们对范围查询的进一步探索表明,G-Learned 索引比 CPU 多维学习索引快 17 倍。
{"title":"G-Learned Index: Enabling Efficient Learned Index on GPU","authors":"Jiesong Liu;Feng Zhang;Lv Lu;Chang Qi;Xiaoguang Guo;Dong Deng;Guoliang Li;Huanchen Zhang;Jidong Zhai;Hechen Zhang;Yuxing Chen;Anqun Pan;Xiaoyong Du","doi":"10.1109/TPDS.2024.3381214","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3381214","url":null,"abstract":"AI and GPU technologies have been widely applied to solve Big Data problems. The total data volume worldwide reaches 200 zettabytes in 2022. How to efficiently index the required content among massive data becomes serious. Recently, a promising learned index has been proposed to address this challenge: It has extremely high efficiency while retaining marginal space overhead. However, we notice that previous learned indexes have mainly focused on CPU architecture, while ignoring the advantages of GPU. Because traditional indexes like B-Tree, LSM, and bitmap have greatly benefited from GPU acceleration, a combination of a learned index and GPU has great potentials to reach tremendous speedups. In this paper, we propose a GPU-based learned index, called G-Learned Index, to significantly improve the performance of learned index structures. The primary challenges in developing G-Learned Index lie in the use of thousands of GPU cores including minimization of synchronization and branch divergence, data structure design for parallel operations, and usage of memory bandwidth including limited memory transactions and multi-memory hierarchy. To overcome these challenges, a series of novel technologies are developed, including efficient thread organization, succinct data structures, and heterogeneous memory hierarchy utilization. Compared to the state-of-the-art learned index, the proposed G-Learned Index achieves an average of 174× speedup (and 107× of its parallel version). Meanwhile, we attain 2× less query time over the state-of-the-art GPU B-Tree. Our further exploration of range queries shows that G-Learned Index is \u0000<inline-formula><tex-math>$17times$</tex-math></inline-formula>\u0000 faster than CPU multi-dimensional learned index.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 6","pages":"795-812"},"PeriodicalIF":5.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140546591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Static Allocation is Not a Static: Optimizing SSD Address Allocation Through Boosting Static Policy 静态分配并非一成不变:通过提升静态策略优化固态硬盘地址分配
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-30 DOI: 10.1109/TPDS.2024.3407367
Yang Zhou;Fang Wang;Zhan Shi;Dan Feng
The address allocation policy in SSD aims to translate the logical address of I/O requests into a physical address, and the static address allocation is widely used in modern SSD. Through extensive experiments, we find that there are significant differences in the utilization of SSD parallelism among different static address allocation policies. We also observe that the fixed address allocation design prevents SSDs from continuing to meet the challenges posed by cloud workloads and misses the possibility of further optimization. These situations stem from our excessive reliance on SSD parallelism over time. In this paper, we propose HsaP, a hybrid static address allocation policy, that adaptively chooses the best static allocation policy to meet the SSD performance at runtime. HsaP is a dynamic scheduling scheme based on static address allocation policy. The static policy ensures that HsaP has stable performance and light-weight overhead, while dynamic scheduling can effectively combine different allocation policies, selecting the best-performing static mapping mode for a given SSD state. Meanwhile, HsaP can further improve the read and write performance of SSDs simultaneously through plane reallocation and data rewrite. Experimental results show that HsaP achieves significant read and write performance gain of a wide range of the latest cloud block storage traces compared to several state-of-the-art address allocation approaches.
固态硬盘中的地址分配策略旨在将 I/O 请求的逻辑地址转换为物理地址,而静态地址分配在现代固态硬盘中得到了广泛应用。通过大量实验,我们发现不同静态地址分配策略对固态硬盘并行性的利用率存在显著差异。我们还发现,固定地址分配设计阻碍了固态硬盘继续应对云工作负载带来的挑战,并错失了进一步优化的可能性。这些情况都源于我们长期以来对固态硬盘并行性的过度依赖。在本文中,我们提出了混合静态地址分配策略HsaP,它能自适应地选择最佳静态分配策略,以满足运行时的固态硬盘性能。HsaP 是一种基于静态地址分配策略的动态调度方案。静态策略确保了 HsaP 性能稳定、开销轻巧,而动态调度能有效结合不同的分配策略,为给定的固态硬盘状态选择性能最佳的静态映射模式。同时,HsaP 还能通过平面重新分配和数据重写,进一步同时提高固态硬盘的读写性能。实验结果表明,与几种最先进的地址分配方法相比,HsaP 在各种最新的云块存储跟踪中实现了显著的读写性能提升。
{"title":"The Static Allocation is Not a Static: Optimizing SSD Address Allocation Through Boosting Static Policy","authors":"Yang Zhou;Fang Wang;Zhan Shi;Dan Feng","doi":"10.1109/TPDS.2024.3407367","DOIUrl":"10.1109/TPDS.2024.3407367","url":null,"abstract":"The address allocation policy in SSD aims to translate the logical address of I/O requests into a physical address, and the static address allocation is widely used in modern SSD. Through extensive experiments, we find that there are significant differences in the utilization of SSD parallelism among different static address allocation policies. We also observe that the fixed address allocation design prevents SSDs from continuing to meet the challenges posed by cloud workloads and misses the possibility of further optimization. These situations stem from our excessive reliance on SSD parallelism over time. In this paper, we propose \u0000<monospace>HsaP</monospace>\u0000, a \u0000<underline>h</u>\u0000ybrid \u0000<underline>s</u>\u0000tatic address \u0000<underline>a</u>\u0000llocation \u0000<underline>p</u>\u0000olicy, that adaptively chooses the best static allocation policy to meet the SSD performance at runtime. \u0000<monospace>HsaP</monospace>\u0000 is a \u0000<italic>dynamic</i>\u0000 scheduling scheme based on \u0000<italic>static</i>\u0000 address allocation policy. The static policy ensures that \u0000<monospace>HsaP</monospace>\u0000 has stable performance and light-weight overhead, while dynamic scheduling can effectively combine different allocation policies, selecting the best-performing static mapping mode for a given SSD state. Meanwhile, \u0000<monospace>HsaP</monospace>\u0000 can further improve the read and write performance of SSDs simultaneously through plane reallocation and data rewrite. Experimental results show that \u0000<monospace>HsaP</monospace>\u0000 achieves significant read and write performance gain of a wide range of the latest cloud block storage traces compared to several state-of-the-art address allocation approaches.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1373-1386"},"PeriodicalIF":5.3,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141192781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities 现代高性能计算系统中的可塑性:当前经验、挑战和未来机遇
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-29 DOI: 10.1109/TPDS.2024.3406764
Ahmad Tarraf;Martin Schreiber;Alberto Cascajo;Jean-Baptiste Besnard;Marc-André Vef;Dominik Huber;Sonja Happ;André Brinkmann;David E. Singh;Hans-Christian Hoppe;Alberto Miranda;Antonio J. Peña;Rui Machado;Marta Garcia-Gasulla;Martin Schulz;Paul Carpenter;Simon Pickartz;Tiberiu Rotaru;Sergio Iserte;Victor Lopez;Jorge Ejarque;Heena Sirwani;Jesus Carretero;Felix Wolf
With the increase of complex scientific simulations driven by workflows and heterogeneous workload profiles, managing system resources effectively is essential for improving performance and system throughput, especially due to trends like heterogeneous HPC and deeply integrated systems with on-chip accelerators. For optimal resource utilization, dynamic resource allocation can improve productivity across all system and application levels, by adapting the applications’ configurations to the system's resources. In this context, malleable jobs, which can change resources at runtime, can increase the system throughput and resource utilization while bringing various advantages for HPC users (e.g., shorter waiting time). Malleability has received much attention recently, even though it has been an active research area for more than two decades. This article presents the state-of-the-art of malleable implementations in HPC systems, targeting mainly malleability in compute and I/O resources. Based on our experiences, we state our current concerns and list future opportunities for research.
随着由工作流和异构工作负载特征驱动的复杂科学模拟的增加,有效管理系统资源对于提高性能和系统吞吐量至关重要,特别是在异构高性能计算和带有片上加速器的深度集成系统等趋势下。为了优化资源利用率,动态资源分配可以根据系统资源调整应用配置,从而提高所有系统和应用层面的生产率。在这种情况下,可在运行时改变资源的可延展作业可以提高系统吞吐量和资源利用率,同时为高性能计算用户带来各种优势(如缩短等待时间)。尽管可延展性是一个活跃了二十多年的研究领域,但它最近却受到了广泛关注。本文介绍了高性能计算系统中可延展性实现的最新进展,主要针对计算和 I/O 资源的可延展性。根据我们的经验,我们阐述了当前关注的问题,并列举了未来的研究机会。
{"title":"Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities","authors":"Ahmad Tarraf;Martin Schreiber;Alberto Cascajo;Jean-Baptiste Besnard;Marc-André Vef;Dominik Huber;Sonja Happ;André Brinkmann;David E. Singh;Hans-Christian Hoppe;Alberto Miranda;Antonio J. Peña;Rui Machado;Marta Garcia-Gasulla;Martin Schulz;Paul Carpenter;Simon Pickartz;Tiberiu Rotaru;Sergio Iserte;Victor Lopez;Jorge Ejarque;Heena Sirwani;Jesus Carretero;Felix Wolf","doi":"10.1109/TPDS.2024.3406764","DOIUrl":"10.1109/TPDS.2024.3406764","url":null,"abstract":"With the increase of complex scientific simulations driven by workflows and heterogeneous workload profiles, managing system resources effectively is essential for improving performance and system throughput, especially due to trends like heterogeneous HPC and deeply integrated systems with on-chip accelerators. For optimal resource utilization, dynamic resource allocation can improve productivity across all system and application levels, by adapting the applications’ configurations to the system's resources. In this context, malleable jobs, which can change resources at runtime, can increase the system throughput and resource utilization while bringing various advantages for HPC users (e.g., shorter waiting time). Malleability has received much attention recently, even though it has been an active research area for more than two decades. This article presents the state-of-the-art of malleable implementations in HPC systems, targeting mainly malleability in compute and I/O resources. Based on our experiences, we state our current concerns and list future opportunities for research.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1551-1564"},"PeriodicalIF":5.6,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10541114","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141192780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proactive Caching With Distributed Deep Reinforcement Learning in 6G Cloud-Edge Collaboration Computing 在 6 G 云边缘协作计算中利用分布式深度强化学习实现主动缓存
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-28 DOI: 10.1109/TPDS.2024.3406027
Changmao Wu;Zhengwei Xu;Xiaoming He;Qi Lou;Yuanyuan Xia;Shuman Huang
Proactive caching in 6G cloud-edge collaboration scenarios, intelligently and periodically updating the cached contents, can either alleviate the traffic congestion of backhaul link and edge cooperative link or bring multimedia services to mobile users. To further improve the network performance of 6G cloud-edge, we consider the issue of multi-objective joint optimization, i.e., maximizing edge hit ratio while minimizing content access latency and traffic cost. To solve this complex problem, we focus on the distributed deep reinforcement learning (DRL)-based method for proactive caching, including content prediction and content decision-making. Specifically, since the prior information of user requests is seldom available practically in the current time period, a novel method named temporal convolution sequence network (TCSN) based on the temporal convolution network (TCN) and attention model is used to improve the accuracy of content prediction. Furthermore, according to the value of content prediction, the distributional deep Q network (DDQN) seeks to build a distribution model on returns to optimize the policy of content decision-making. The generative adversarial network (GAN) is adapted in a distributed fashion, emphasizing learning the data distribution and generating compelling data across multiple nodes. In addition, the prioritized experience replay (PER) is helpful to learn from the most effective sample. So we propose a multivariate fusion algorithm called PG-DDQN. Finally, faced with such a complex scenario, a distributed learning architecture, i.e., multi-agent learning architecture is efficiently used to learn DRL-based methods in a manner of centralized training and distributed inference. The experiments prove that our proposal achieves satisfactory performance in terms of edge hit ratio, traffic cost and content access latency.
在 6G 云边协作场景中主动缓存,智能地定期更新缓存内容,既能缓解回程链路和边缘协作链路的流量拥塞,又能为移动用户带来多媒体服务。为了进一步提高 6G 云边缘的网络性能,我们考虑了多目标联合优化问题,即在最大化边缘命中率的同时,最小化内容访问延迟和流量成本。为解决这一复杂问题,我们重点研究了基于分布式深度强化学习(DRL)的主动缓存方法,包括内容预测和内容决策。具体来说,由于用户请求的先验信息在当前时间段很少能实际获得,因此我们采用了一种基于时空卷积网络(TCN)和注意力模型的名为时空卷积序列网络(TCSN)的新方法来提高内容预测的准确性。此外,根据内容预测的价值,分布式深度 Q 网络(DDQN)寻求建立回报分布模型,以优化内容决策策略。生成式对抗网络(GAN)以分布式方式进行调整,强调学习数据分布,并在多个节点上生成引人注目的数据。此外,优先经验重放(PER)有助于从最有效的样本中学习。因此,我们提出了一种名为 PG-DDQN 的多元融合算法。最后,面对如此复杂的场景,我们采用分布式学习架构,即多代理学习架构,以集中训练和分布式推理的方式有效地学习基于 DRL 的方法。实验证明,我们的建议在边缘命中率、流量成本和内容访问延迟方面都取得了令人满意的性能。
{"title":"Proactive Caching With Distributed Deep Reinforcement Learning in 6G Cloud-Edge Collaboration Computing","authors":"Changmao Wu;Zhengwei Xu;Xiaoming He;Qi Lou;Yuanyuan Xia;Shuman Huang","doi":"10.1109/TPDS.2024.3406027","DOIUrl":"10.1109/TPDS.2024.3406027","url":null,"abstract":"Proactive caching in 6G cloud-edge collaboration scenarios, intelligently and periodically updating the cached contents, can either alleviate the traffic congestion of backhaul link and edge cooperative link or bring multimedia services to mobile users. To further improve the network performance of 6G cloud-edge, we consider the issue of multi-objective joint optimization, i.e., maximizing edge hit ratio while minimizing content access latency and traffic cost. To solve this complex problem, we focus on the distributed deep reinforcement learning (DRL)-based method for proactive caching, including content prediction and content decision-making. Specifically, since the prior information of user requests is seldom available practically in the current time period, a novel method named temporal convolution sequence network (TCSN) based on the temporal convolution network (TCN) and attention model is used to improve the accuracy of content prediction. Furthermore, according to the value of content prediction, the distributional deep Q network (DDQN) seeks to build a distribution model on returns to optimize the policy of content decision-making. The generative adversarial network (GAN) is adapted in a distributed fashion, emphasizing learning the data distribution and generating compelling data across multiple nodes. In addition, the prioritized experience replay (PER) is helpful to learn from the most \u0000<italic>effective</i>\u0000 sample. So we propose a multivariate fusion algorithm called PG-DDQN. Finally, faced with such a complex scenario, a distributed learning architecture, i.e., multi-agent learning architecture is efficiently used to learn DRL-based methods in a manner of centralized training and distributed inference. The experiments prove that our proposal achieves satisfactory performance in terms of edge hit ratio, traffic cost and content access latency.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1387-1399"},"PeriodicalIF":5.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141192857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN Training 混合并行 DNN 训练的多维通信调度方法
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-28 DOI: 10.1109/TPDS.2024.3406420
Shengwei Li;Kai Lu;Zhiquan Lai;Weijie Liu;Keshi Ge;Dongsheng Li
The transformer-based deep neural network (DNN) models have shown considerable success across diverse tasks, prompting widespread adoption of distributed training methods such as data parallelism and pipeline parallelism. With the increasing parameter number, hybrid parallel training becomes imperative to scale training. The primary bottleneck in scaling remains the communication overhead. The communication scheduling technique, emphasizing the overlap of communication with computation, has demonstrated its benefits in scaling. However, most existing works focus on data parallelism, overlooking the nuances of hybrid parallel training. In this paper, we propose TriRace, an efficient communication scheduling framework for accelerating communications in hybrid parallel training of asynchronous pipeline parallelism and data parallelism. To achieve effective computation-communication overlap, TriRace introduces 3D communication scheduling, which adeptly leverages data dependencies between communication and computations, efficiently scheduling AllReduce communication, sparse communication, and peer-to-peer communication in hybrid parallel training. To avoid possible communication contentions, TriRace also incorporates a topology-aware runtime which optimizes the execution of communication operations by considering ongoing communication operations and real-time network status. We have implemented a prototype of TriRace based on PyTorch and Pipedream-2BW, and conducted comprehensive evaluations with three representative baselines. Experimental results show that TriRace achieves up to 1.07–1.45× speedup compared to the state-of-the-art pipeline parallelism training baseline Pipedream-2BW, and 1.24–1.81× speedup compared to the Megatron.
基于变压器的深度神经网络(DNN)模型在各种任务中取得了相当大的成功,促使分布式训练方法(如数据并行和管道并行)得到广泛采用。随着参数数量的增加,混合并行训练成为扩展训练的当务之急。扩展的主要瓶颈仍然是通信开销。通信调度技术强调通信与计算的重叠,已经证明了其在扩展方面的优势。然而,现有的大多数工作都侧重于数据并行性,忽略了混合并行训练的细微差别。在本文中,我们提出了一个高效的通信调度框架 TriRace,用于在异步流水线并行和数据并行的混合并行训练中加速通信。为了实现有效的计算-通信重叠,TriRace 引入了三维通信调度,善于利用通信与计算之间的数据依赖关系,在混合并行训练中高效调度 AllReduce 通信、稀疏通信和点对点通信。为了避免可能出现的通信争议,TriRace 还采用了拓扑感知运行时,通过考虑正在进行的通信操作和实时网络状态来优化通信操作的执行。我们基于 PyTorch 和 Pipedream-2BW 实现了 TriRace 的原型,并用三个具有代表性的基线进行了综合评估。实验结果表明,与最先进的流水线并行性训练基线 Pipedream-2BW 相比,TriRace 实现了高达 1.07-1.45 倍的速度提升,与 Megatron 相比,实现了 1.24-1.81 倍的速度提升。
{"title":"A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN Training","authors":"Shengwei Li;Kai Lu;Zhiquan Lai;Weijie Liu;Keshi Ge;Dongsheng Li","doi":"10.1109/TPDS.2024.3406420","DOIUrl":"10.1109/TPDS.2024.3406420","url":null,"abstract":"The transformer-based deep neural network (DNN) models have shown considerable success across diverse tasks, prompting widespread adoption of distributed training methods such as data parallelism and pipeline parallelism. With the increasing parameter number, hybrid parallel training becomes imperative to scale training. The primary bottleneck in scaling remains the communication overhead. The communication scheduling technique, emphasizing the overlap of communication with computation, has demonstrated its benefits in scaling. However, most existing works focus on data parallelism, overlooking the nuances of hybrid parallel training. In this paper, we propose \u0000<monospace>TriRace</monospace>\u0000, an efficient communication scheduling framework for accelerating communications in hybrid parallel training of asynchronous pipeline parallelism and data parallelism. To achieve effective computation-communication overlap, \u0000<monospace>TriRace</monospace>\u0000 introduces \u0000<italic>3D communication scheduling</i>\u0000, which adeptly leverages data dependencies between communication and computations, efficiently scheduling AllReduce communication, sparse communication, and peer-to-peer communication in hybrid parallel training. To avoid possible communication contentions, \u0000<monospace>TriRace</monospace>\u0000 also incorporates a \u0000<italic>topology-aware runtime</i>\u0000 which optimizes the execution of communication operations by considering ongoing communication operations and real-time network status. We have implemented a prototype of \u0000<monospace>TriRace</monospace>\u0000 based on PyTorch and Pipedream-2BW, and conducted comprehensive evaluations with three representative baselines. Experimental results show that \u0000<monospace>TriRace</monospace>\u0000 achieves up to 1.07–1.45× speedup compared to the state-of-the-art pipeline parallelism training baseline Pipedream-2BW, and 1.24–1.81× speedup compared to the Megatron.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1415-1428"},"PeriodicalIF":5.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141198394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel Computation of Dominance Scores for Multidimensional Datasets on GPUs 在 GPU 上并行计算多维数据集的优势分数
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-27 DOI: 10.1109/TPDS.2024.3382119
Wei-Mei Chen;Hsin-Hung Tsai;Joon Fong Ling
The dominance scoring problem in a multidimensional dataset is to return the number of points dominated by a given point, which is a common metric for evaluating the quality of a data point. Dominance scoring is an elementary operator for variations of the skyline operator, including top-$k$ dominating and $k$-skyband queries. This study proposes query processing for dominance scores that operates primarily on the graphics processing unit (GPU) to fully utilize its massive processing resources and restricted memory space while reducing the transfer overhead between the central processing unit (CPU) and GPU. We introduce a heap-based multidimensional data structure with complete and well-balanced characteristics. Using our preprocessed data, we can construct a complete R-tree with the non-overlapping property, ensuring that the bounding boxes of internal nodes of the same level do not overlap, thereby reducing redundant operations. In addition, we propose two algorithms based on depth-first and breadth-first traversals to accumulate the dominance score on GPUs in parallel. Both take full advantage of the GPU's computing resources and memory space supported by the non-overlapping tree structures. Experiments on synthetic and real-world datasets demonstrate that the proposed algorithms implemented on GPUs dramatically improve the efficiency of dominance scoring.
多维数据集中的占优评分问题是返回被给定点占优的点数,这是评估数据点质量的常用指标。支配评分是天际线算子变体的基本算子,包括顶部-$k$支配和$k$-天带查询。本研究提出的优势得分查询处理主要在图形处理器(GPU)上运行,以充分利用其庞大的处理资源和有限的内存空间,同时减少中央处理器(CPU)和 GPU 之间的传输开销。我们引入了一种基于堆的多维数据结构,它具有完整而均衡的特性。利用预处理数据,我们可以构建具有非重叠特性的完整 R 树,确保同级内部节点的边界框不重叠,从而减少冗余操作。此外,我们还提出了两种基于深度优先遍历和广度优先遍历的算法,用于在 GPU 上并行累积优势得分。这两种算法都充分利用了 GPU 的计算资源和无重叠树结构所支持的内存空间。在合成数据集和真实数据集上进行的实验表明,在 GPU 上实现的拟议算法极大地提高了优势得分的效率。
{"title":"Parallel Computation of Dominance Scores for Multidimensional Datasets on GPUs","authors":"Wei-Mei Chen;Hsin-Hung Tsai;Joon Fong Ling","doi":"10.1109/TPDS.2024.3382119","DOIUrl":"10.1109/TPDS.2024.3382119","url":null,"abstract":"The dominance scoring problem in a multidimensional dataset is to return the number of points dominated by a given point, which is a common metric for evaluating the quality of a data point. Dominance scoring is an elementary operator for variations of the skyline operator, including top-\u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 dominating and \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-skyband queries. This study proposes query processing for dominance scores that operates primarily on the graphics processing unit (GPU) to fully utilize its massive processing resources and restricted memory space while reducing the transfer overhead between the central processing unit (CPU) and GPU. We introduce a heap-based multidimensional data structure with complete and well-balanced characteristics. Using our preprocessed data, we can construct a complete R-tree with the non-overlapping property, ensuring that the bounding boxes of internal nodes of the same level do not overlap, thereby reducing redundant operations. In addition, we propose two algorithms based on depth-first and breadth-first traversals to accumulate the dominance score on GPUs in parallel. Both take full advantage of the GPU's computing resources and memory space supported by the non-overlapping tree structures. Experiments on synthetic and real-world datasets demonstrate that the proposed algorithms implemented on GPUs dramatically improve the efficiency of dominance scoring.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 6","pages":"764-776"},"PeriodicalIF":5.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140316344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Age-of-Event Aware: Sampling Period Optimization in a Three-Stage Wireless Cyber-Physical System With Diverse Parallelisms 事件年龄感知:具有多种并行性的三阶段无线网络物理系统中的采样周期优化
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-27 DOI: 10.1109/TPDS.2024.3405790
Yanxi Zhang;Muyu Mei;Dongqi Yan;Xu Zhang;Qinghai Yang;Mingwu Yao
With the emergence of parallel computing systems and distributed time-sensitive applications, it is urgent to provide statistical guarantees for age of information (AoI) in wireless cyber-physical systems (WCPS) with diverse parallelisms. However, most of the existing research on AoI have tended to focus on serial transmission, and the AoI performance of multi-stage parallel systems remains unclear. To help address these research gaps, in this work, we set out to investigate the age of event (AoE) violation probability in a three-stage WCPS with diverse parallelisms such as fork-join and split-merge. We analyze both transient and steady-state characteristics of AoE violation probability (AoEVP). Using these characteristics, we transform the AoEVP minimization problem into an equivalent minimization problem. Moreover, we develop a queuing model to capture the queue dynamics under the max-plus theory of stochastic network calculus (SNC) approach. Based on the max-plus model, we derive a closed-form Chernoff upper bound for the equivalent problem by applying the union bound and the Chernoff inequality. Furthermore, we characterize the service process for different parallelisms applicable to each stage. By solving the Chernoff upper bound with the service moment generation functions (MGFs), we obtain heuristic update period solutions for minimizing the AoEVP of three-stage WCPS. Simulation results validate our analysis and demonstrate that our heuristic update period solutions are near optimal for minimizing the AoEVP of three-stage WCPS with diverse parallelisms.
随着并行计算系统和分布式时间敏感应用的出现,迫切需要为具有不同并行性的无线网络物理系统(WCPS)中的信息年龄(AoI)提供统计保证。然而,现有的 AoI 研究大多倾向于串行传输,多级并行系统的 AoI 性能仍不明确。为了帮助解决这些研究空白,我们在本研究中着手研究具有叉接和拆分合并等多种并行方式的三阶段 WCPS 中的事件年龄(AoE)违反概率。我们分析了 AoE 违反概率 (AoEVP) 的瞬态和稳态特征。利用这些特征,我们将 AoEVP 最小化问题转化为等价最小化问题。此外,我们还建立了一个队列模型,以捕捉随机网络微积分(SNC)方法最大加理论下的队列动态。基于 max-plus 模型,我们应用联合约束和切尔诺夫不等式为等价问题推导出了闭式切尔诺夫上界。此外,我们还描述了适用于每个阶段的不同并行方法的服务流程。通过用服务时刻生成函数(MGF)求解切尔诺夫上界,我们得到了启发式更新周期解,用于最小化三阶段 WCPS 的 AoEVP。仿真结果验证了我们的分析,并证明我们的启发式更新周期解接近最优解,可最大限度地减少具有不同并行性的三阶段 WCPS 的 AoEVP。
{"title":"Age-of-Event Aware: Sampling Period Optimization in a Three-Stage Wireless Cyber-Physical System With Diverse Parallelisms","authors":"Yanxi Zhang;Muyu Mei;Dongqi Yan;Xu Zhang;Qinghai Yang;Mingwu Yao","doi":"10.1109/TPDS.2024.3405790","DOIUrl":"10.1109/TPDS.2024.3405790","url":null,"abstract":"With the emergence of parallel computing systems and distributed time-sensitive applications, it is urgent to provide statistical guarantees for age of information (AoI) in wireless cyber-physical systems (WCPS) with diverse parallelisms. However, most of the existing research on AoI have tended to focus on serial transmission, and the AoI performance of multi-stage parallel systems remains unclear. To help address these research gaps, in this work, we set out to investigate the age of event (AoE) violation probability in a three-stage WCPS with diverse parallelisms such as fork-join and split-merge. We analyze both transient and steady-state characteristics of AoE violation probability (AoEVP). Using these characteristics, we transform the AoEVP minimization problem into an equivalent minimization problem. Moreover, we develop a queuing model to capture the queue dynamics under the max-plus theory of stochastic network calculus (SNC) approach. Based on the max-plus model, we derive a closed-form Chernoff upper bound for the equivalent problem by applying the union bound and the Chernoff inequality. Furthermore, we characterize the service process for different parallelisms applicable to each stage. By solving the Chernoff upper bound with the service moment generation functions (MGFs), we obtain heuristic update period solutions for minimizing the AoEVP of three-stage WCPS. Simulation results validate our analysis and demonstrate that our heuristic update period solutions are near optimal for minimizing the AoEVP of three-stage WCPS with diverse parallelisms.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1360-1372"},"PeriodicalIF":5.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141168933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1