2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第2页

GPU-Accelerated Error-Bounded Compression Framework for Quantum Circuit Simulations 用于量子电路模拟的gpu加速错误边界压缩框架

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00081

Milan Shah, Xiaodong Yu, S. Di, Danylo Lykov, Y. Alexeev, M. Becchi, F. Cappello

Quantum circuit simulations enable researchers to develop quantum algorithms without the need for a physical quantum computer. Quantum computing simulators, however, all suffer from significant memory footprint requirements, which prevents large circuits from being simulated on classical super-computers. In this paper, we explore different lossy compression strategies to substantially shrink quantum circuit tensors in the QTensor package (a state-of-the-art tensor network quantum circuit simulator) while ensuring the reconstructed data satisfy the user-needed fidelity.Our contribution is fourfold. (1) We propose a series of optimized pre- and post-processing steps to boost the compression ratio of tensors with a very limited performance overhead. (2) We characterize the impact of lossy decompressed data on quantum circuit simulation results, and leverage the analysis to ensure the fidelity of reconstructed data. (3) We propose a configurable compression framework for GPU based on cuSZ and cuSZx, two state-of-the-art GPU-accelerated lossy compressors, to address different use-cases: either prioritizing compression ratios or prioritizing compression speed. (4) We perform a comprehensive evaluation by running 9 state-of-the-art compressors on an NVIDIA A100 GPU based on QTensor-generated tensors of varying sizes. When prioritizing compression ratio, our results show that our strategies can increase the compression ratio nearly 10 times compared to using only cuSZ. When prioritizing throughput, we can perform compression at the comparable speed as cuSZx while achieving 3-4× higher compression ratios. Decompressed tensors can be used in QTensor circuit simulation to yield a final energy result within 1-5% of the true energy value.

量子电路模拟使研究人员能够在不需要物理量子计算机的情况下开发量子算法。然而，量子计算模拟器都有显著的内存占用要求，这阻碍了在经典超级计算机上模拟大型电路。在本文中，我们探索了不同的有损压缩策略，以大幅缩小QTensor包(最先进的张量网络量子电路模拟器)中的量子电路张量，同时确保重构数据满足用户所需的保真度。我们的贡献是四倍的。(1)我们提出了一系列优化的预处理和后处理步骤，以非常有限的性能开销来提高张量的压缩比。(2)我们描述了有损解压缩数据对量子电路仿真结果的影响，并利用分析来保证重构数据的保真度。(3)我们提出了一个基于cuSZ和cuSZx两种最先进的GPU加速有损压缩器的GPU可配置压缩框架，以解决不同的用例:优先考虑压缩比或优先考虑压缩速度。(4)我们通过在NVIDIA A100 GPU上运行9个最先进的压缩器，基于qtensor生成的不同大小的张量，进行了全面的评估。当优先考虑压缩比时，我们的结果表明，与仅使用cuSZ相比，我们的策略可以将压缩比提高近10倍。在对吞吐量进行优先级排序时，我们可以以与cuSZx相当的速度执行压缩，同时实现3-4倍的高压缩比。解压缩张量可用于QTensor电路仿真，以产生在真实能量值1-5%以内的最终能量结果。

{"title":"GPU-Accelerated Error-Bounded Compression Framework for Quantum Circuit Simulations","authors":"Milan Shah, Xiaodong Yu, S. Di, Danylo Lykov, Y. Alexeev, M. Becchi, F. Cappello","doi":"10.1109/IPDPS54959.2023.00081","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00081","url":null,"abstract":"Quantum circuit simulations enable researchers to develop quantum algorithms without the need for a physical quantum computer. Quantum computing simulators, however, all suffer from significant memory footprint requirements, which prevents large circuits from being simulated on classical super-computers. In this paper, we explore different lossy compression strategies to substantially shrink quantum circuit tensors in the QTensor package (a state-of-the-art tensor network quantum circuit simulator) while ensuring the reconstructed data satisfy the user-needed fidelity.Our contribution is fourfold. (1) We propose a series of optimized pre- and post-processing steps to boost the compression ratio of tensors with a very limited performance overhead. (2) We characterize the impact of lossy decompressed data on quantum circuit simulation results, and leverage the analysis to ensure the fidelity of reconstructed data. (3) We propose a configurable compression framework for GPU based on cuSZ and cuSZx, two state-of-the-art GPU-accelerated lossy compressors, to address different use-cases: either prioritizing compression ratios or prioritizing compression speed. (4) We perform a comprehensive evaluation by running 9 state-of-the-art compressors on an NVIDIA A100 GPU based on QTensor-generated tensors of varying sizes. When prioritizing compression ratio, our results show that our strategies can increase the compression ratio nearly 10 times compared to using only cuSZ. When prioritizing throughput, we can perform compression at the comparable speed as cuSZx while achieving 3-4× higher compression ratios. Decompressed tensors can be used in QTensor circuit simulation to yield a final energy result within 1-5% of the true energy value.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"16 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113990403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Arithmetic Intensity of Distributed-Memory Dense Matrix Multiplication Involving a Symmetric Input Matrix (SYMM) 涉及对称输入矩阵(SYMM)的分布式存储密集矩阵乘法的算术强度

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00044

E. Agullo, A. Buttari, O. Coulaud, Lionel Eyraud-Dubois, Mathieu Faverge, Alain Franc, A. Guermouche, Antoine Jego, Romain Peressoni, Florent Pruvost

Dense matrix multiplication involving a symmetric input matrix (SYMM) is implemented in reference distributed-memory codes with the same data distribution as its general analogue (GEMM). We show that, when the symmetric matrix is dominant, such a 2D block-cyclic (2D BC) scheme leads to a lower arithmetic intensity (AI) of SYMM than that of GEMM by a factor of 2. We propose alternative data distributions preserving the memory benefit of SYMM of storing only half of the matrix while achieving up to the same AI as GEMM. We also show that, in the case we can afford the same memory footprint as GEMM, SYMM can achieve a higher AI. We propose a task-based design of SYMM independent of the data distribution. This design allows for scalable A-stationary SYMM with which all discussed data distributions, may they be very irregular, can be easily assessed. We have integrated the resulting code in a reduction dimension algorithm involving a randomized singular value decomposition dominated by SYMM. An experimental study shows a compelling impact on performance.

涉及对称输入矩阵(SYMM)的密集矩阵乘法在参考分布式存储代码中实现，其数据分布与其一般模拟(GEMM)相同。我们证明，当对称矩阵占主导时，这种2D块循环(2D BC)方案导致SYMM的算术强度(AI)比GEMM的低2倍。我们提出了替代的数据分布，保留了SYMM的内存优势，仅存储矩阵的一半，同时实现了与GEMM相同的AI。我们还表明，在我们可以负担得起与GEMM相同的内存占用的情况下，SYMM可以实现更高的AI。我们提出了一种独立于数据分布的基于任务的SYMM设计。这种设计允许可扩展的A-stationary SYMM，所有讨论的数据分布，可能是非常不规则的，可以很容易地评估。我们将结果代码集成到一个以SYMM为主导的随机奇异值分解的降维算法中。一项实验研究显示了对性能的显著影响。

{"title":"On the Arithmetic Intensity of Distributed-Memory Dense Matrix Multiplication Involving a Symmetric Input Matrix (SYMM)","authors":"E. Agullo, A. Buttari, O. Coulaud, Lionel Eyraud-Dubois, Mathieu Faverge, Alain Franc, A. Guermouche, Antoine Jego, Romain Peressoni, Florent Pruvost","doi":"10.1109/IPDPS54959.2023.00044","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00044","url":null,"abstract":"Dense matrix multiplication involving a symmetric input matrix (SYMM) is implemented in reference distributed-memory codes with the same data distribution as its general analogue (GEMM). We show that, when the symmetric matrix is dominant, such a 2D block-cyclic (2D BC) scheme leads to a lower arithmetic intensity (AI) of SYMM than that of GEMM by a factor of 2. We propose alternative data distributions preserving the memory benefit of SYMM of storing only half of the matrix while achieving up to the same AI as GEMM. We also show that, in the case we can afford the same memory footprint as GEMM, SYMM can achieve a higher AI. We propose a task-based design of SYMM independent of the data distribution. This design allows for scalable A-stationary SYMM with which all discussed data distributions, may they be very irregular, can be easily assessed. We have integrated the resulting code in a reduction dimension algorithm involving a randomized singular value decomposition dominated by SYMM. An experimental study shows a compelling impact on performance.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123635355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Efficient 2D Method for Training Super-Large Deep Learning Models 一种训练超大型深度学习模型的高效二维方法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00031

Qifan Xu, Shenggui Li, Chaoyu Gong, Yang You

Since the rise of Transformer [22] and BERT [6], large language models [7], [12] have been proposed and shown unprecedented performance in tasks like translation, classification, and text generation. However, due to the memory constraint, model parallelism must be used to split the model across multiple processors. Inter-layer partition, intra-layer partition, and sparse activation are the major approaches to achieve model parallelism. Among them, inter-layer partition [10], [11] often requires the model to be explicitly expressed as a stack of sub-modules, the number of which equals to the number of processors, and would introduce either gradient staleness or bubble overhead; while the sparse activation [12] is primarily designed for Google TPU cluster and hard to deploy on GPU servers, intra-layer partition [17], especially Megatron-LM [18], can be easily deployed on GPU servers and has been adopted in subsequent works like Turing-NLG and M6. Though as pioneers of intra-layer parallelism, they still show memory redundancy and sub-optimal communication efficiency, which reveals the space for further improvements. In this work, we leverage SUMMA [21] and propose Optimus, a highly efficient and scalable paradigm for training super-large language models. In Optimus, activations and gradients are partitioned and distributed along processors all the way through forward and backward propagations, with hardly any memory redundancy. The isoefficiency of communication in pure model parallelism improves from W ~ p3 for Megatron-LM, to $Wsim {(sqrt p log p)^3}$ for our Optimus. This framework is implemented with open-source deep learning framework, PyTorch, and consolidates existing techniques such as mixed precision training [13], activation checkpointing [5], and data parallelism. In experiments on TACC Frontera supercomputers, Optimus shows 1.48× the speed for training, 1.78× speed for inference, and 8× the maximum batch size over Megatron-LM on 64 GPUs in pure model parallelism; and 1.73× speed for training, 2.32× speed for inference with data parallelism size equaling 2 on 128 GPUs. In pure model parallelism, Optimus surpasses Megatron-LM in weak scaling efficiency by a great margin, and shows an extraordinary increasing strong scaling efficiency. Optimus would facilitate the scaling of language models and serve as a strong thrust in the space exploration of artificial intelligence.

自Transformer[22]和BERT[6]兴起以来，大型语言模型[7]、[12]被提出，并在翻译、分类和文本生成等任务中表现出前所未有的性能。然而，由于内存约束，必须使用模型并行性来跨多个处理器拆分模型。层间划分、层内划分和稀疏激活是实现模型并行化的主要方法。其中，层间划分[10]、[11]往往需要将模型显式地表示为一堆子模块，子模块的数量等于处理器的数量，并且会引入梯度过时或气泡开销;稀疏激活[12]主要是为Google TPU集群设计的，很难部署在GPU服务器上，而层内分区[17]，特别是Megatron-LM[18]，可以很容易地部署在GPU服务器上，并被后续的Turing-NLG和M6等作品所采用。虽然它们是层内并行的先驱，但它们仍然存在内存冗余和次优通信效率，这显示了进一步改进的空间。在这项工作中，我们利用SUMMA[21]并提出了Optimus，这是一种用于训练超大型语言模型的高效可扩展范例。在Optimus中，激活和梯度通过前向和后向传播沿着处理器进行分区和分布，几乎没有任何内存冗余。纯模型并行通信的等效率从Megatron-LM的wp3提高到Optimus的$Wsim {(sqrt p log p)^3}$。该框架使用开源深度学习框架PyTorch实现，并整合了混合精确训练[13]、激活检查点[5]和数据并行等现有技术。在TACC Frontera超级计算机上的实验中，在纯模型并行的64个gpu上，Optimus的训练速度是Megatron-LM的1.48倍，推理速度是1.78倍，最大批处理大小是8倍;训练速度为1.73倍，推理速度为2.32倍，在128 gpu上数据并行大小为2。在纯模型并行性方面，Optimus在弱缩放效率上大大超过Megatron-LM，并表现出非凡的递增的强缩放效率。Optimus将促进语言模型的扩展，并在人工智能的太空探索中发挥强大的推动作用。

{"title":"An Efficient 2D Method for Training Super-Large Deep Learning Models","authors":"Qifan Xu, Shenggui Li, Chaoyu Gong, Yang You","doi":"10.1109/IPDPS54959.2023.00031","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00031","url":null,"abstract":"Since the rise of Transformer [22] and BERT [6], large language models [7], [12] have been proposed and shown unprecedented performance in tasks like translation, classification, and text generation. However, due to the memory constraint, model parallelism must be used to split the model across multiple processors. Inter-layer partition, intra-layer partition, and sparse activation are the major approaches to achieve model parallelism. Among them, inter-layer partition [10], [11] often requires the model to be explicitly expressed as a stack of sub-modules, the number of which equals to the number of processors, and would introduce either gradient staleness or bubble overhead; while the sparse activation [12] is primarily designed for Google TPU cluster and hard to deploy on GPU servers, intra-layer partition [17], especially Megatron-LM [18], can be easily deployed on GPU servers and has been adopted in subsequent works like Turing-NLG and M6. Though as pioneers of intra-layer parallelism, they still show memory redundancy and sub-optimal communication efficiency, which reveals the space for further improvements. In this work, we leverage SUMMA [21] and propose Optimus, a highly efficient and scalable paradigm for training super-large language models. In Optimus, activations and gradients are partitioned and distributed along processors all the way through forward and backward propagations, with hardly any memory redundancy. The isoefficiency of communication in pure model parallelism improves from W ~ p3 for Megatron-LM, to $Wsim {(sqrt p log p)^3}$ for our Optimus. This framework is implemented with open-source deep learning framework, PyTorch, and consolidates existing techniques such as mixed precision training [13], activation checkpointing [5], and data parallelism. In experiments on TACC Frontera supercomputers, Optimus shows 1.48× the speed for training, 1.78× speed for inference, and 8× the maximum batch size over Megatron-LM on 64 GPUs in pure model parallelism; and 1.73× speed for training, 2.32× speed for inference with data parallelism size equaling 2 on 128 GPUs. In pure model parallelism, Optimus surpasses Megatron-LM in weak scaling efficiency by a great margin, and shows an extraordinary increasing strong scaling efficiency. Optimus would facilitate the scaling of language models and serve as a strong thrust in the space exploration of artificial intelligence.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121742856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Exact Fault-Tolerant Consensus with Voting Validity 具有投票有效性的精确容错共识

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00089

Zhangchen Xu, Yuetai Li, Chengli Feng, Lei Zhang

This paper investigates the multi-valued fault-tolerant distributed consensus problem that pursues exact output. To this end, the voting validity, which requires the consensus output of non-faulty nodes to be the exact plurality of the input of non-faulty nodes, is investigated. Considering a specific distribution of non-faulty votes, we first give the impossibility results and a tight lower bound of system tolerance achieving agreement, termination and voting validity. A practical consensus algorithm that satisfies voting validity in the Byzantine fault model is proposed subsequently. To ensure the exactness of outputs in any non-faulty vote distribution, we further propose safety-critical tolerance and a corresponding protocol that prioritizes voting validity over termination property. To refine the proposed protocols, we propose an incremental threshold algorithm that accelerates protocol operation speed. We also optimize consensus algorithms with the local broadcast model to enhance the protocol’s fault tolerance ability.

研究了追求精确输出的多值容错分布式一致性问题。为此，研究了要求非故障节点的共识输出恰好等于非故障节点输入的整数倍的投票有效性。考虑到非错误投票的特定分布，我们首先给出了不可能结果和系统容忍度的严格下界，从而达到一致、终止和投票有效性。随后提出了一种满足拜占庭故障模型下投票有效性的实用共识算法。为了确保在任何无错误的投票分布中输出的准确性，我们进一步提出了安全关键容忍和相应的协议，该协议优先考虑投票有效性而不是终止属性。为了改进所提出的协议，我们提出了一种增量阈值算法来加快协议的运行速度。我们还利用本地广播模型对一致性算法进行了优化，提高了协议的容错能力。

引用次数: 0

Accelerating Packet Processing in Container Overlay Networks via Packet-level Parallelism 通过包级并行性加速容器覆盖网络中的包处理

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00018

Jiaxin Lei, Manish Munikar, Hui Lu, J. Rao

Overlay networks serve as the de facto network virtualization technique for providing connectivity among distributed containers. Despite the flexibility in building customized private container networks, overlay networks incur significant performance loss compared to physical networks (i.e., the native). The culprit lies in the inclusion of multiple network processing stages in overlay networks, which prolongs the network processing path and overloads CPU cores. In this paper, we propose mFlow, a novel packet steering approach to parallelize the in-kernel data path of network flows. mFlow exploits packet-level parallelism in the kernel network stack by splitting the packets of the same flow into multiple micro-flows, which can be processed in parallel on multiple cores. mFlow devises new, generic mechanisms for flow splitting while preserving in-order packet delivery with little overhead. Our evaluation with both micro-benchmarks and real-world applications demonstrates the effectiveness of mFlow, with significantly improved performance – e.g., by 81% in TCP throughput and 139% in UDP compared to vanilla overlay networks. mFlow even achieved higher TCP throughput than the native (e.g., 29.8 vs. 26.6 Gbps).

覆盖网络作为事实上的网络虚拟化技术，用于在分布式容器之间提供连接。尽管在构建自定义私有容器网络方面具有灵活性，但与物理网络(即本机网络)相比，覆盖网络会导致显著的性能损失。其根源在于叠加网络中包含了多个网络处理阶段，延长了网络处理路径，使CPU内核过载。在本文中，我们提出了mFlow，一种新颖的数据包转向方法来并行化网络流的内核内数据路径。mFlow利用内核网络堆栈中的包级并行性，将同一流的数据包分成多个微流，这些微流可以在多个内核上并行处理。mFlow设计了新的、通用的流分裂机制，同时以很小的开销保持有序的数据包传递。我们对微基准测试和实际应用的评估都证明了mFlow的有效性，性能显著提高——例如，与普通覆盖网络相比，TCP吞吐量提高了81%，UDP吞吐量提高了139%。mFlow甚至实现了比本机更高的TCP吞吐量(例如，29.8 vs. 26.6 Gbps)。

{"title":"Accelerating Packet Processing in Container Overlay Networks via Packet-level Parallelism","authors":"Jiaxin Lei, Manish Munikar, Hui Lu, J. Rao","doi":"10.1109/IPDPS54959.2023.00018","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00018","url":null,"abstract":"Overlay networks serve as the de facto network virtualization technique for providing connectivity among distributed containers. Despite the flexibility in building customized private container networks, overlay networks incur significant performance loss compared to physical networks (i.e., the native). The culprit lies in the inclusion of multiple network processing stages in overlay networks, which prolongs the network processing path and overloads CPU cores. In this paper, we propose mFlow, a novel packet steering approach to parallelize the in-kernel data path of network flows. mFlow exploits packet-level parallelism in the kernel network stack by splitting the packets of the same flow into multiple micro-flows, which can be processed in parallel on multiple cores. mFlow devises new, generic mechanisms for flow splitting while preserving in-order packet delivery with little overhead. Our evaluation with both micro-benchmarks and real-world applications demonstrates the effectiveness of mFlow, with significantly improved performance – e.g., by 81% in TCP throughput and 139% in UDP compared to vanilla overlay networks. mFlow even achieved higher TCP throughput than the native (e.g., 29.8 vs. 26.6 Gbps).","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"08 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128278937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Data Distribution Schemes for Dense Linear Algebra Factorizations on Any Number of Nodes 任意数目节点上密集线性代数分解的数据分布方案

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00047

Olivier Beaumont, Jean-Alexandre Collin, Lionel Eyraud-Dubois, Mathieu Vérité

In this paper, we consider the problem of distributing the tiles of a dense matrix onto a set of homogeneous nodes. We consider both the case of non-symmetric (LU) and symmetric (Cholesky) factorizations. The efficiency of the well-known 2D Block-Cyclic (2DBC) distribution degrades significantly if the number of nodes P cannot be written as the product of two close numbers. Similarly, the recently introduced Symmetric Block Cyclic (SBC) distribution is only valid for specific values of P. In both contexts, we propose generalizations of these distributions to adapt them to any number of nodes. We show that this provides improvements to existing schemes (2DBC and SBC) both in theory and in practice, using the flexibility and ease of programming induced by task-based runtime systems like Chameleon and StarPU.

在本文中，我们考虑将一个密集矩阵的块分布到一组齐次节点上的问题。我们考虑了非对称(LU)和对称(Cholesky)分解的情况。如果节点数P不能写成两个相近数的乘积，那么众所周知的2D块循环(2DBC)分布的效率会显著降低。同样，最近引入的对称块循环(SBC)分布仅对p的特定值有效。在这两种情况下，我们提出了这些分布的推广，以使它们适应任意数量的节点。我们表明，这在理论上和实践中都为现有方案(2DBC和SBC)提供了改进，利用基于任务的运行时系统(如Chameleon和StarPU)带来的灵活性和编程便利性。

引用次数: 0

Fast Deterministic Gathering with Detection on Arbitrary Graphs: The Power of Many Robots 基于任意图检测的快速确定性采集:多机器人的力量

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00015

A. R. Molla, Kaushik Mondal, W. Moses

Over the years, much research involving mobile computational entities has been performed. From modeling actual microscopic (and smaller) robots, to modeling software processes on a network, many important problems have been studied in this context. Gathering is one such fundamental problem in this area. The problem of gathering k robots, initially arbitrarily placed on the nodes of an n-node graph, asks that these robots coordinate and communicate in a local manner, as opposed to global, to move around the graph, find each other, and settle down on a single node as fast as possible. A more difficult problem to solve is gathering with detection, where once the robots gather, they must subsequently realize that gathering has occurred and then terminate.In this paper, we propose a deterministic approach to solve gathering with detection for any arbitrary connected graph that is faster than existing deterministic solutions for even just gathering (without the requirement of detection) for arbitrary graphs. In contrast to earlier work on gathering, it leverages the fact that there are more robots present in the system to achieve gathering with detection faster than those previous papers that focused on just gathering. The state of the art solution for deterministic gathering [Ta-Shma and Zwick, TALG, 2014] takes $tilde Oleft({{n^5}log ell }right)$ rounds, where is the smallest label among robots and $tilde O$ hides a polylog factor. We design a deterministic algorithm for gathering with detection with the following trade-offs depending on how many robots are present: (i) when k ≥ ⌊n/2⌋ + 1, the algorithm takes O(n3) rounds, (ii) when k ≥ ⌊n/3⌋ + 1, the algorithm takes O(n4 log n) rounds, and (iii) otherwise, the algorithm takes $tilde Oleft({{n^5}}right)$ rounds. The algorithm is not required to know k, but only n.

多年来，已经进行了许多涉及移动计算实体的研究。从实际的微观(和更小的)机器人建模，到网络上的软件过程建模，许多重要的问题都在这一背景下进行了研究。在这个领域，聚集是一个基本问题。收集k个机器人的问题，最初是任意放置在n个节点图的节点上，要求这些机器人以局部方式(而不是全局方式)进行协调和通信，以便在图中移动，找到彼此，并尽快在单个节点上安顿下来。一个更难解决的问题是带检测的收集，一旦机器人收集，它们必须随后意识到收集已经发生，然后终止。在本文中，我们提出了一种确定性方法来解决任意连通图的带检测的收集问题，该方法比现有的任意图的仅收集(不要求检测)的确定性解决方案更快。与早期的收集工作相比，它利用了系统中存在更多机器人的事实，以更快的速度实现收集和检测，而不是之前那些只关注收集的论文。最先进的确定性收集解决方案[Ta-Shma和Zwick, TALG, 2014]需要$tilde Oleft({{n^5}log ell }right)$轮，其中是机器人中最小的标签，$tilde O$隐藏了一个多对数因子。我们根据存在的机器人数量设计了一种确定性算法，用于收集检测，并进行以下权衡:(i)当k≥⌊n/2⌋+ 1时，算法需要O(n3)轮，(ii)当k≥⌊n/3⌋+ 1时，算法需要O(n4 log n)轮，(iii)否则，算法需要$tilde Oleft({{n^5}}right)$轮。这个算法不需要知道k，只需要知道n。

{"title":"Fast Deterministic Gathering with Detection on Arbitrary Graphs: The Power of Many Robots","authors":"A. R. Molla, Kaushik Mondal, W. Moses","doi":"10.1109/IPDPS54959.2023.00015","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00015","url":null,"abstract":"Over the years, much research involving mobile computational entities has been performed. From modeling actual microscopic (and smaller) robots, to modeling software processes on a network, many important problems have been studied in this context. Gathering is one such fundamental problem in this area. The problem of gathering k robots, initially arbitrarily placed on the nodes of an n-node graph, asks that these robots coordinate and communicate in a local manner, as opposed to global, to move around the graph, find each other, and settle down on a single node as fast as possible. A more difficult problem to solve is gathering with detection, where once the robots gather, they must subsequently realize that gathering has occurred and then terminate.In this paper, we propose a deterministic approach to solve gathering with detection for any arbitrary connected graph that is faster than existing deterministic solutions for even just gathering (without the requirement of detection) for arbitrary graphs. In contrast to earlier work on gathering, it leverages the fact that there are more robots present in the system to achieve gathering with detection faster than those previous papers that focused on just gathering. The state of the art solution for deterministic gathering [Ta-Shma and Zwick, TALG, 2014] takes $tilde Oleft({{n^5}log ell }right)$ rounds, where is the smallest label among robots and $tilde O$ hides a polylog factor. We design a deterministic algorithm for gathering with detection with the following trade-offs depending on how many robots are present: (i) when k ≥ ⌊n/2⌋ + 1, the algorithm takes O(n3) rounds, (ii) when k ≥ ⌊n/3⌋ + 1, the algorithm takes O(n4 log n) rounds, and (iii) otherwise, the algorithm takes $tilde Oleft({{n^5}}right)$ rounds. The algorithm is not required to know k, but only n.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121089074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SW-LCM: A Scalable and Weakly-supervised Land Cover Mapping Method on a New Sunway Supercomputer SW-LCM:一种基于新神威超级计算机的可扩展和弱监督土地覆盖制图方法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00071

Yi Zhao, Juepeng Zheng, H. Fu, Wenzhao Wu, Jie Gao, Mengxuan Chen, Jinxiao Zhang, Lixian Zhang, Runmin Dong, Z. Du, Sha Liu, Xin Liu, Shaoqing Zhang, Le Yu

High-resolution land cover mapping (LCM) is an important application for studying and understanding the change of the earth surface. While deep learning (DL) methods demonstrate great potential in analyzing satellite images, they largely depend on massive high-quality labels. This paper proposes SW-LCM, a Scalable and Weakly-supervised two-stage Land Cover Mapping method on a new Sunway Supercomputer. Our method consists of a k-means clustering module as a first stage, and an iterative deep learning module as a second stage. With the k-means module providing a good enough starting point (taking inaccurate results as noisy labels), the deep learning module improves the classification results in an iterative way, without any labelling efforts required for processing large scenarios. To achieve efficiency for country-level land cover mapping, we design a customized data partition scheme and an on-the-fly assembly for k-means. Through careful parallelization and optimization, our k-means module scales to 98,304 computing nodes (over 38 million cores), and provides a sustained performance of 437.56 PFLOPS, in a real LCM task of the entire region of China; the iterative updating part scales to 24,576 nodes, with a performance of 11 PFLOPS. We produce a 10-m resolution land cover map of China, with an accuracy of 83.5% (10-class) or 73.2% (25-class), 7% to 8% higher than best existing products, paving ways for finer land surveys to support sustainability-related applications.

高分辨率土地覆盖制图(LCM)是研究和认识地球表面变化的重要手段。虽然深度学习(DL)方法在分析卫星图像方面显示出巨大的潜力，但它们在很大程度上依赖于大量高质量的标签。本文提出了一种基于双威超级计算机的可扩展、弱监督两阶段土地覆盖制图方法SW-LCM。我们的方法包括k-means聚类模块作为第一阶段，迭代深度学习模块作为第二阶段。由于k-means模块提供了一个足够好的起点(将不准确的结果作为有噪声的标签)，深度学习模块以迭代的方式改进了分类结果，而无需在处理大型场景时进行任何标记工作。为了提高国家级土地覆盖制图的效率，我们设计了一个定制的数据分区方案和k-means的动态组件。通过仔细的并行化和优化，我们的k-means模块扩展到98,304个计算节点(超过3800万内核)，并在整个中国地区的实际LCM任务中提供437.56 PFLOPS的持续性能;迭代更新部分扩展到24,576个节点，性能为11 PFLOPS。我们制作了10米分辨率的中国土地覆盖地图，精度为83.5%(10级)或73.2%(25级)，比现有最佳产品高出7%至8%，为更精细的土地调查铺平了道路，以支持与可持续发展相关的应用。

{"title":"SW-LCM: A Scalable and Weakly-supervised Land Cover Mapping Method on a New Sunway Supercomputer","authors":"Yi Zhao, Juepeng Zheng, H. Fu, Wenzhao Wu, Jie Gao, Mengxuan Chen, Jinxiao Zhang, Lixian Zhang, Runmin Dong, Z. Du, Sha Liu, Xin Liu, Shaoqing Zhang, Le Yu","doi":"10.1109/IPDPS54959.2023.00071","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00071","url":null,"abstract":"High-resolution land cover mapping (LCM) is an important application for studying and understanding the change of the earth surface. While deep learning (DL) methods demonstrate great potential in analyzing satellite images, they largely depend on massive high-quality labels. This paper proposes SW-LCM, a Scalable and Weakly-supervised two-stage Land Cover Mapping method on a new Sunway Supercomputer. Our method consists of a k-means clustering module as a first stage, and an iterative deep learning module as a second stage. With the k-means module providing a good enough starting point (taking inaccurate results as noisy labels), the deep learning module improves the classification results in an iterative way, without any labelling efforts required for processing large scenarios. To achieve efficiency for country-level land cover mapping, we design a customized data partition scheme and an on-the-fly assembly for k-means. Through careful parallelization and optimization, our k-means module scales to 98,304 computing nodes (over 38 million cores), and provides a sustained performance of 437.56 PFLOPS, in a real LCM task of the entire region of China; the iterative updating part scales to 24,576 nodes, with a performance of 11 PFLOPS. We produce a 10-m resolution land cover map of China, with an accuracy of 83.5% (10-class) or 73.2% (25-class), 7% to 8% higher than best existing products, paving ways for finer land surveys to support sustainability-related applications.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116337423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Novel Triangular Space-Filling Curve for Cache-Oblivious In-Place Transposition of Square Matrices 一种新的用于方阵缓存无关就地转置的三角形空间填充曲线

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00045

J. N. F. Alves, L. Russo, Alexandre P. Francisco, S. Benkner

This paper proposes a novel cache-oblivious blocking scheme based on a new triangular space-filling curve which preserves data locality. The proposed blocking-scheme reduces the movement of data within the host memory hierarchy for triangular matrix traversals, which inherently exhibit poor data locality, such as the in-place transposition of square matrices. We show that our cache-oblivious blocking-scheme can be generated iteratively in linear time and constant memory with regard to the number of entries present in the lower, or upper, triangle of the input matrix. In contrast to classical recursive cache-oblivious solutions, the iterative nature of our blocking-scheme does not inhibit other essential optimizations such as software prefetching. In order to assess the viability of our blocking-scheme as a cache-oblivious strategy, we applied it to the in-place transposition of square matrices. Extensive experiments show that our cache-oblivious transposition algorithm generally outperforms the cache-aware state-of-the-art algorithm in terms of throughput and energy efficiency in sequential as well as parallel environments.

本文提出了一种新的基于三角形空间填充曲线的缓存无关阻塞方案。所提出的阻塞方案减少了三角矩阵遍历在主机内存层次结构中的数据移动，三角矩阵遍历固有地表现出较差的数据局部性，例如方阵的就地转置。我们表明，我们的缓存无关阻塞方案可以在线性时间和恒定内存中迭代地生成，与输入矩阵的下三角形或上三角形中存在的条目数量有关。与经典的递归缓参无关解决方案相比，我们的阻塞方案的迭代特性不会抑制其他必要的优化，例如软件预取。为了评估我们的阻塞方案作为缓存无关策略的可行性，我们将其应用于方阵的就地转置。大量的实验表明，我们的缓存无关转置算法在串行和并行环境中的吞吐量和能源效率方面通常优于缓存感知的最先进算法。

{"title":"A Novel Triangular Space-Filling Curve for Cache-Oblivious In-Place Transposition of Square Matrices","authors":"J. N. F. Alves, L. Russo, Alexandre P. Francisco, S. Benkner","doi":"10.1109/IPDPS54959.2023.00045","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00045","url":null,"abstract":"This paper proposes a novel cache-oblivious blocking scheme based on a new triangular space-filling curve which preserves data locality. The proposed blocking-scheme reduces the movement of data within the host memory hierarchy for triangular matrix traversals, which inherently exhibit poor data locality, such as the in-place transposition of square matrices. We show that our cache-oblivious blocking-scheme can be generated iteratively in linear time and constant memory with regard to the number of entries present in the lower, or upper, triangle of the input matrix. In contrast to classical recursive cache-oblivious solutions, the iterative nature of our blocking-scheme does not inhibit other essential optimizations such as software prefetching. In order to assess the viability of our blocking-scheme as a cache-oblivious strategy, we applied it to the in-place transposition of square matrices. Extensive experiments show that our cache-oblivious transposition algorithm generally outperforms the cache-aware state-of-the-art algorithm in terms of throughput and energy efficiency in sequential as well as parallel environments.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127706918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LowFive: In Situ Data Transport for High-Performance Workflows LowFive:用于高性能工作流的现场数据传输

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00102

T. Peterka, D. Morozov, Arnur Nigmetov, Orcun Yildiz, Bogdan Nicolae, Philip E. Davis

We describe LowFive, a new data transport layer based on the HDF5 data model, for in situ workflows. Executables using LowFive can communicate in situ (using in-memory data and MPI message passing), reading and writing traditional HDF5 files to physical storage, and combining the two modes. Minimal and often no source-code modification is needed for programs that already use HDF5. LowFive maintains deep copies or shallow references of datasets, configurable by the user. More than one task can produce (write) data, and more than one task can consume (read) data, accommodating fan-in and fan-out in the workflow task graph. LowFive supports data redistribution from n producer processes to m consumer processes. We demonstrate the above features in a series of experiments featuring both synthetic benchmarks as well as a representative use case from a scientific workflow, and we also compare with other data transport solutions in the literature.

我们描述了LowFive，一个基于HDF5数据模型的新的数据传输层，用于现场工作流。使用LowFive的可执行文件可以就地通信(使用内存中的数据和MPI消息传递)，读取和写入传统HDF5文件到物理存储，并结合这两种模式。对于已经使用HDF5的程序，只需要很少的源代码修改，而且通常不需要修改。LowFive维护数据集的深拷贝或浅引用，可由用户配置。多个任务可以产生(写入)数据，多个任务可以使用(读取)数据，从而在工作流任务图中容纳扇入和扇出。LowFive支持从n个生产者进程到m个消费者进程的数据重新分配。我们在一系列实验中展示了上述功能，这些实验包括综合基准测试以及来自科学工作流的代表性用例，并且我们还与文献中的其他数据传输解决方案进行了比较。

引用次数: 0