Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00081
Milan Shah, Xiaodong Yu, S. Di, Danylo Lykov, Y. Alexeev, M. Becchi, F. Cappello
Quantum circuit simulations enable researchers to develop quantum algorithms without the need for a physical quantum computer. Quantum computing simulators, however, all suffer from significant memory footprint requirements, which prevents large circuits from being simulated on classical super-computers. In this paper, we explore different lossy compression strategies to substantially shrink quantum circuit tensors in the QTensor package (a state-of-the-art tensor network quantum circuit simulator) while ensuring the reconstructed data satisfy the user-needed fidelity.Our contribution is fourfold. (1) We propose a series of optimized pre- and post-processing steps to boost the compression ratio of tensors with a very limited performance overhead. (2) We characterize the impact of lossy decompressed data on quantum circuit simulation results, and leverage the analysis to ensure the fidelity of reconstructed data. (3) We propose a configurable compression framework for GPU based on cuSZ and cuSZx, two state-of-the-art GPU-accelerated lossy compressors, to address different use-cases: either prioritizing compression ratios or prioritizing compression speed. (4) We perform a comprehensive evaluation by running 9 state-of-the-art compressors on an NVIDIA A100 GPU based on QTensor-generated tensors of varying sizes. When prioritizing compression ratio, our results show that our strategies can increase the compression ratio nearly 10 times compared to using only cuSZ. When prioritizing throughput, we can perform compression at the comparable speed as cuSZx while achieving 3-4× higher compression ratios. Decompressed tensors can be used in QTensor circuit simulation to yield a final energy result within 1-5% of the true energy value.
{"title":"GPU-Accelerated Error-Bounded Compression Framework for Quantum Circuit Simulations","authors":"Milan Shah, Xiaodong Yu, S. Di, Danylo Lykov, Y. Alexeev, M. Becchi, F. Cappello","doi":"10.1109/IPDPS54959.2023.00081","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00081","url":null,"abstract":"Quantum circuit simulations enable researchers to develop quantum algorithms without the need for a physical quantum computer. Quantum computing simulators, however, all suffer from significant memory footprint requirements, which prevents large circuits from being simulated on classical super-computers. In this paper, we explore different lossy compression strategies to substantially shrink quantum circuit tensors in the QTensor package (a state-of-the-art tensor network quantum circuit simulator) while ensuring the reconstructed data satisfy the user-needed fidelity.Our contribution is fourfold. (1) We propose a series of optimized pre- and post-processing steps to boost the compression ratio of tensors with a very limited performance overhead. (2) We characterize the impact of lossy decompressed data on quantum circuit simulation results, and leverage the analysis to ensure the fidelity of reconstructed data. (3) We propose a configurable compression framework for GPU based on cuSZ and cuSZx, two state-of-the-art GPU-accelerated lossy compressors, to address different use-cases: either prioritizing compression ratios or prioritizing compression speed. (4) We perform a comprehensive evaluation by running 9 state-of-the-art compressors on an NVIDIA A100 GPU based on QTensor-generated tensors of varying sizes. When prioritizing compression ratio, our results show that our strategies can increase the compression ratio nearly 10 times compared to using only cuSZ. When prioritizing throughput, we can perform compression at the comparable speed as cuSZx while achieving 3-4× higher compression ratios. Decompressed tensors can be used in QTensor circuit simulation to yield a final energy result within 1-5% of the true energy value.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"16 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113990403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00044
E. Agullo, A. Buttari, O. Coulaud, Lionel Eyraud-Dubois, Mathieu Faverge, Alain Franc, A. Guermouche, Antoine Jego, Romain Peressoni, Florent Pruvost
Dense matrix multiplication involving a symmetric input matrix (SYMM) is implemented in reference distributed-memory codes with the same data distribution as its general analogue (GEMM). We show that, when the symmetric matrix is dominant, such a 2D block-cyclic (2D BC) scheme leads to a lower arithmetic intensity (AI) of SYMM than that of GEMM by a factor of 2. We propose alternative data distributions preserving the memory benefit of SYMM of storing only half of the matrix while achieving up to the same AI as GEMM. We also show that, in the case we can afford the same memory footprint as GEMM, SYMM can achieve a higher AI. We propose a task-based design of SYMM independent of the data distribution. This design allows for scalable A-stationary SYMM with which all discussed data distributions, may they be very irregular, can be easily assessed. We have integrated the resulting code in a reduction dimension algorithm involving a randomized singular value decomposition dominated by SYMM. An experimental study shows a compelling impact on performance.
{"title":"On the Arithmetic Intensity of Distributed-Memory Dense Matrix Multiplication Involving a Symmetric Input Matrix (SYMM)","authors":"E. Agullo, A. Buttari, O. Coulaud, Lionel Eyraud-Dubois, Mathieu Faverge, Alain Franc, A. Guermouche, Antoine Jego, Romain Peressoni, Florent Pruvost","doi":"10.1109/IPDPS54959.2023.00044","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00044","url":null,"abstract":"Dense matrix multiplication involving a symmetric input matrix (SYMM) is implemented in reference distributed-memory codes with the same data distribution as its general analogue (GEMM). We show that, when the symmetric matrix is dominant, such a 2D block-cyclic (2D BC) scheme leads to a lower arithmetic intensity (AI) of SYMM than that of GEMM by a factor of 2. We propose alternative data distributions preserving the memory benefit of SYMM of storing only half of the matrix while achieving up to the same AI as GEMM. We also show that, in the case we can afford the same memory footprint as GEMM, SYMM can achieve a higher AI. We propose a task-based design of SYMM independent of the data distribution. This design allows for scalable A-stationary SYMM with which all discussed data distributions, may they be very irregular, can be easily assessed. We have integrated the resulting code in a reduction dimension algorithm involving a randomized singular value decomposition dominated by SYMM. An experimental study shows a compelling impact on performance.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123635355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00031
Qifan Xu, Shenggui Li, Chaoyu Gong, Yang You
Since the rise of Transformer [22] and BERT [6], large language models [7], [12] have been proposed and shown unprecedented performance in tasks like translation, classification, and text generation. However, due to the memory constraint, model parallelism must be used to split the model across multiple processors. Inter-layer partition, intra-layer partition, and sparse activation are the major approaches to achieve model parallelism. Among them, inter-layer partition [10], [11] often requires the model to be explicitly expressed as a stack of sub-modules, the number of which equals to the number of processors, and would introduce either gradient staleness or bubble overhead; while the sparse activation [12] is primarily designed for Google TPU cluster and hard to deploy on GPU servers, intra-layer partition [17], especially Megatron-LM [18], can be easily deployed on GPU servers and has been adopted in subsequent works like Turing-NLG and M6. Though as pioneers of intra-layer parallelism, they still show memory redundancy and sub-optimal communication efficiency, which reveals the space for further improvements. In this work, we leverage SUMMA [21] and propose Optimus, a highly efficient and scalable paradigm for training super-large language models. In Optimus, activations and gradients are partitioned and distributed along processors all the way through forward and backward propagations, with hardly any memory redundancy. The isoefficiency of communication in pure model parallelism improves from W ~ p3 for Megatron-LM, to $Wsim {(sqrt p log p)^3}$ for our Optimus. This framework is implemented with open-source deep learning framework, PyTorch, and consolidates existing techniques such as mixed precision training [13], activation checkpointing [5], and data parallelism. In experiments on TACC Frontera supercomputers, Optimus shows 1.48× the speed for training, 1.78× speed for inference, and 8× the maximum batch size over Megatron-LM on 64 GPUs in pure model parallelism; and 1.73× speed for training, 2.32× speed for inference with data parallelism size equaling 2 on 128 GPUs. In pure model parallelism, Optimus surpasses Megatron-LM in weak scaling efficiency by a great margin, and shows an extraordinary increasing strong scaling efficiency. Optimus would facilitate the scaling of language models and serve as a strong thrust in the space exploration of artificial intelligence.
自Transformer[22]和BERT[6]兴起以来,大型语言模型[7]、[12]被提出,并在翻译、分类和文本生成等任务中表现出前所未有的性能。然而,由于内存约束,必须使用模型并行性来跨多个处理器拆分模型。层间划分、层内划分和稀疏激活是实现模型并行化的主要方法。其中,层间划分[10]、[11]往往需要将模型显式地表示为一堆子模块,子模块的数量等于处理器的数量,并且会引入梯度过时或气泡开销;稀疏激活[12]主要是为Google TPU集群设计的,很难部署在GPU服务器上,而层内分区[17],特别是Megatron-LM[18],可以很容易地部署在GPU服务器上,并被后续的Turing-NLG和M6等作品所采用。虽然它们是层内并行的先驱,但它们仍然存在内存冗余和次优通信效率,这显示了进一步改进的空间。在这项工作中,我们利用SUMMA[21]并提出了Optimus,这是一种用于训练超大型语言模型的高效可扩展范例。在Optimus中,激活和梯度通过前向和后向传播沿着处理器进行分区和分布,几乎没有任何内存冗余。纯模型并行通信的等效率从Megatron-LM的wp3提高到Optimus的$Wsim {(sqrt p log p)^3}$。该框架使用开源深度学习框架PyTorch实现,并整合了混合精确训练[13]、激活检查点[5]和数据并行等现有技术。在TACC Frontera超级计算机上的实验中,在纯模型并行的64个gpu上,Optimus的训练速度是Megatron-LM的1.48倍,推理速度是1.78倍,最大批处理大小是8倍;训练速度为1.73倍,推理速度为2.32倍,在128 gpu上数据并行大小为2。在纯模型并行性方面,Optimus在弱缩放效率上大大超过Megatron-LM,并表现出非凡的递增的强缩放效率。Optimus将促进语言模型的扩展,并在人工智能的太空探索中发挥强大的推动作用。
{"title":"An Efficient 2D Method for Training Super-Large Deep Learning Models","authors":"Qifan Xu, Shenggui Li, Chaoyu Gong, Yang You","doi":"10.1109/IPDPS54959.2023.00031","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00031","url":null,"abstract":"Since the rise of Transformer [22] and BERT [6], large language models [7], [12] have been proposed and shown unprecedented performance in tasks like translation, classification, and text generation. However, due to the memory constraint, model parallelism must be used to split the model across multiple processors. Inter-layer partition, intra-layer partition, and sparse activation are the major approaches to achieve model parallelism. Among them, inter-layer partition [10], [11] often requires the model to be explicitly expressed as a stack of sub-modules, the number of which equals to the number of processors, and would introduce either gradient staleness or bubble overhead; while the sparse activation [12] is primarily designed for Google TPU cluster and hard to deploy on GPU servers, intra-layer partition [17], especially Megatron-LM [18], can be easily deployed on GPU servers and has been adopted in subsequent works like Turing-NLG and M6. Though as pioneers of intra-layer parallelism, they still show memory redundancy and sub-optimal communication efficiency, which reveals the space for further improvements. In this work, we leverage SUMMA [21] and propose Optimus, a highly efficient and scalable paradigm for training super-large language models. In Optimus, activations and gradients are partitioned and distributed along processors all the way through forward and backward propagations, with hardly any memory redundancy. The isoefficiency of communication in pure model parallelism improves from W ~ p3 for Megatron-LM, to $Wsim {(sqrt p log p)^3}$ for our Optimus. This framework is implemented with open-source deep learning framework, PyTorch, and consolidates existing techniques such as mixed precision training [13], activation checkpointing [5], and data parallelism. In experiments on TACC Frontera supercomputers, Optimus shows 1.48× the speed for training, 1.78× speed for inference, and 8× the maximum batch size over Megatron-LM on 64 GPUs in pure model parallelism; and 1.73× speed for training, 2.32× speed for inference with data parallelism size equaling 2 on 128 GPUs. In pure model parallelism, Optimus surpasses Megatron-LM in weak scaling efficiency by a great margin, and shows an extraordinary increasing strong scaling efficiency. Optimus would facilitate the scaling of language models and serve as a strong thrust in the space exploration of artificial intelligence.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121742856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00089
Zhangchen Xu, Yuetai Li, Chengli Feng, Lei Zhang
This paper investigates the multi-valued fault-tolerant distributed consensus problem that pursues exact output. To this end, the voting validity, which requires the consensus output of non-faulty nodes to be the exact plurality of the input of non-faulty nodes, is investigated. Considering a specific distribution of non-faulty votes, we first give the impossibility results and a tight lower bound of system tolerance achieving agreement, termination and voting validity. A practical consensus algorithm that satisfies voting validity in the Byzantine fault model is proposed subsequently. To ensure the exactness of outputs in any non-faulty vote distribution, we further propose safety-critical tolerance and a corresponding protocol that prioritizes voting validity over termination property. To refine the proposed protocols, we propose an incremental threshold algorithm that accelerates protocol operation speed. We also optimize consensus algorithms with the local broadcast model to enhance the protocol’s fault tolerance ability.
{"title":"Exact Fault-Tolerant Consensus with Voting Validity","authors":"Zhangchen Xu, Yuetai Li, Chengli Feng, Lei Zhang","doi":"10.1109/IPDPS54959.2023.00089","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00089","url":null,"abstract":"This paper investigates the multi-valued fault-tolerant distributed consensus problem that pursues exact output. To this end, the voting validity, which requires the consensus output of non-faulty nodes to be the exact plurality of the input of non-faulty nodes, is investigated. Considering a specific distribution of non-faulty votes, we first give the impossibility results and a tight lower bound of system tolerance achieving agreement, termination and voting validity. A practical consensus algorithm that satisfies voting validity in the Byzantine fault model is proposed subsequently. To ensure the exactness of outputs in any non-faulty vote distribution, we further propose safety-critical tolerance and a corresponding protocol that prioritizes voting validity over termination property. To refine the proposed protocols, we propose an incremental threshold algorithm that accelerates protocol operation speed. We also optimize consensus algorithms with the local broadcast model to enhance the protocol’s fault tolerance ability.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123752080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00018
Jiaxin Lei, Manish Munikar, Hui Lu, J. Rao
Overlay networks serve as the de facto network virtualization technique for providing connectivity among distributed containers. Despite the flexibility in building customized private container networks, overlay networks incur significant performance loss compared to physical networks (i.e., the native). The culprit lies in the inclusion of multiple network processing stages in overlay networks, which prolongs the network processing path and overloads CPU cores. In this paper, we propose mFlow, a novel packet steering approach to parallelize the in-kernel data path of network flows. mFlow exploits packet-level parallelism in the kernel network stack by splitting the packets of the same flow into multiple micro-flows, which can be processed in parallel on multiple cores. mFlow devises new, generic mechanisms for flow splitting while preserving in-order packet delivery with little overhead. Our evaluation with both micro-benchmarks and real-world applications demonstrates the effectiveness of mFlow, with significantly improved performance – e.g., by 81% in TCP throughput and 139% in UDP compared to vanilla overlay networks. mFlow even achieved higher TCP throughput than the native (e.g., 29.8 vs. 26.6 Gbps).
覆盖网络作为事实上的网络虚拟化技术,用于在分布式容器之间提供连接。尽管在构建自定义私有容器网络方面具有灵活性,但与物理网络(即本机网络)相比,覆盖网络会导致显著的性能损失。其根源在于叠加网络中包含了多个网络处理阶段,延长了网络处理路径,使CPU内核过载。在本文中,我们提出了mFlow,一种新颖的数据包转向方法来并行化网络流的内核内数据路径。mFlow利用内核网络堆栈中的包级并行性,将同一流的数据包分成多个微流,这些微流可以在多个内核上并行处理。mFlow设计了新的、通用的流分裂机制,同时以很小的开销保持有序的数据包传递。我们对微基准测试和实际应用的评估都证明了mFlow的有效性,性能显著提高——例如,与普通覆盖网络相比,TCP吞吐量提高了81%,UDP吞吐量提高了139%。mFlow甚至实现了比本机更高的TCP吞吐量(例如,29.8 vs. 26.6 Gbps)。
{"title":"Accelerating Packet Processing in Container Overlay Networks via Packet-level Parallelism","authors":"Jiaxin Lei, Manish Munikar, Hui Lu, J. Rao","doi":"10.1109/IPDPS54959.2023.00018","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00018","url":null,"abstract":"Overlay networks serve as the de facto network virtualization technique for providing connectivity among distributed containers. Despite the flexibility in building customized private container networks, overlay networks incur significant performance loss compared to physical networks (i.e., the native). The culprit lies in the inclusion of multiple network processing stages in overlay networks, which prolongs the network processing path and overloads CPU cores. In this paper, we propose mFlow, a novel packet steering approach to parallelize the in-kernel data path of network flows. mFlow exploits packet-level parallelism in the kernel network stack by splitting the packets of the same flow into multiple micro-flows, which can be processed in parallel on multiple cores. mFlow devises new, generic mechanisms for flow splitting while preserving in-order packet delivery with little overhead. Our evaluation with both micro-benchmarks and real-world applications demonstrates the effectiveness of mFlow, with significantly improved performance – e.g., by 81% in TCP throughput and 139% in UDP compared to vanilla overlay networks. mFlow even achieved higher TCP throughput than the native (e.g., 29.8 vs. 26.6 Gbps).","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"08 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128278937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we consider the problem of distributing the tiles of a dense matrix onto a set of homogeneous nodes. We consider both the case of non-symmetric (LU) and symmetric (Cholesky) factorizations. The efficiency of the well-known 2D Block-Cyclic (2DBC) distribution degrades significantly if the number of nodes P cannot be written as the product of two close numbers. Similarly, the recently introduced Symmetric Block Cyclic (SBC) distribution is only valid for specific values of P. In both contexts, we propose generalizations of these distributions to adapt them to any number of nodes. We show that this provides improvements to existing schemes (2DBC and SBC) both in theory and in practice, using the flexibility and ease of programming induced by task-based runtime systems like Chameleon and StarPU.
{"title":"Data Distribution Schemes for Dense Linear Algebra Factorizations on Any Number of Nodes","authors":"Olivier Beaumont, Jean-Alexandre Collin, Lionel Eyraud-Dubois, Mathieu Vérité","doi":"10.1109/IPDPS54959.2023.00047","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00047","url":null,"abstract":"In this paper, we consider the problem of distributing the tiles of a dense matrix onto a set of homogeneous nodes. We consider both the case of non-symmetric (LU) and symmetric (Cholesky) factorizations. The efficiency of the well-known 2D Block-Cyclic (2DBC) distribution degrades significantly if the number of nodes P cannot be written as the product of two close numbers. Similarly, the recently introduced Symmetric Block Cyclic (SBC) distribution is only valid for specific values of P. In both contexts, we propose generalizations of these distributions to adapt them to any number of nodes. We show that this provides improvements to existing schemes (2DBC and SBC) both in theory and in practice, using the flexibility and ease of programming induced by task-based runtime systems like Chameleon and StarPU.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"321 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124546214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00015
A. R. Molla, Kaushik Mondal, W. Moses
Over the years, much research involving mobile computational entities has been performed. From modeling actual microscopic (and smaller) robots, to modeling software processes on a network, many important problems have been studied in this context. Gathering is one such fundamental problem in this area. The problem of gathering k robots, initially arbitrarily placed on the nodes of an n-node graph, asks that these robots coordinate and communicate in a local manner, as opposed to global, to move around the graph, find each other, and settle down on a single node as fast as possible. A more difficult problem to solve is gathering with detection, where once the robots gather, they must subsequently realize that gathering has occurred and then terminate.In this paper, we propose a deterministic approach to solve gathering with detection for any arbitrary connected graph that is faster than existing deterministic solutions for even just gathering (without the requirement of detection) for arbitrary graphs. In contrast to earlier work on gathering, it leverages the fact that there are more robots present in the system to achieve gathering with detection faster than those previous papers that focused on just gathering. The state of the art solution for deterministic gathering [Ta-Shma and Zwick, TALG, 2014] takes $tilde Oleft({{n^5}log ell }right)$ rounds, where is the smallest label among robots and $tilde O$ hides a polylog factor. We design a deterministic algorithm for gathering with detection with the following trade-offs depending on how many robots are present: (i) when k ≥ ⌊n/2⌋ + 1, the algorithm takes O(n3) rounds, (ii) when k ≥ ⌊n/3⌋ + 1, the algorithm takes O(n4 log n) rounds, and (iii) otherwise, the algorithm takes $tilde Oleft({{n^5}}right)$ rounds. The algorithm is not required to know k, but only n.
{"title":"Fast Deterministic Gathering with Detection on Arbitrary Graphs: The Power of Many Robots","authors":"A. R. Molla, Kaushik Mondal, W. Moses","doi":"10.1109/IPDPS54959.2023.00015","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00015","url":null,"abstract":"Over the years, much research involving mobile computational entities has been performed. From modeling actual microscopic (and smaller) robots, to modeling software processes on a network, many important problems have been studied in this context. Gathering is one such fundamental problem in this area. The problem of gathering k robots, initially arbitrarily placed on the nodes of an n-node graph, asks that these robots coordinate and communicate in a local manner, as opposed to global, to move around the graph, find each other, and settle down on a single node as fast as possible. A more difficult problem to solve is gathering with detection, where once the robots gather, they must subsequently realize that gathering has occurred and then terminate.In this paper, we propose a deterministic approach to solve gathering with detection for any arbitrary connected graph that is faster than existing deterministic solutions for even just gathering (without the requirement of detection) for arbitrary graphs. In contrast to earlier work on gathering, it leverages the fact that there are more robots present in the system to achieve gathering with detection faster than those previous papers that focused on just gathering. The state of the art solution for deterministic gathering [Ta-Shma and Zwick, TALG, 2014] takes $tilde Oleft({{n^5}log ell }right)$ rounds, where is the smallest label among robots and $tilde O$ hides a polylog factor. We design a deterministic algorithm for gathering with detection with the following trade-offs depending on how many robots are present: (i) when k ≥ ⌊n/2⌋ + 1, the algorithm takes O(n3) rounds, (ii) when k ≥ ⌊n/3⌋ + 1, the algorithm takes O(n4 log n) rounds, and (iii) otherwise, the algorithm takes $tilde Oleft({{n^5}}right)$ rounds. The algorithm is not required to know k, but only n.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121089074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00071
Yi Zhao, Juepeng Zheng, H. Fu, Wenzhao Wu, Jie Gao, Mengxuan Chen, Jinxiao Zhang, Lixian Zhang, Runmin Dong, Z. Du, Sha Liu, Xin Liu, Shaoqing Zhang, Le Yu
High-resolution land cover mapping (LCM) is an important application for studying and understanding the change of the earth surface. While deep learning (DL) methods demonstrate great potential in analyzing satellite images, they largely depend on massive high-quality labels. This paper proposes SW-LCM, a Scalable and Weakly-supervised two-stage Land Cover Mapping method on a new Sunway Supercomputer. Our method consists of a k-means clustering module as a first stage, and an iterative deep learning module as a second stage. With the k-means module providing a good enough starting point (taking inaccurate results as noisy labels), the deep learning module improves the classification results in an iterative way, without any labelling efforts required for processing large scenarios. To achieve efficiency for country-level land cover mapping, we design a customized data partition scheme and an on-the-fly assembly for k-means. Through careful parallelization and optimization, our k-means module scales to 98,304 computing nodes (over 38 million cores), and provides a sustained performance of 437.56 PFLOPS, in a real LCM task of the entire region of China; the iterative updating part scales to 24,576 nodes, with a performance of 11 PFLOPS. We produce a 10-m resolution land cover map of China, with an accuracy of 83.5% (10-class) or 73.2% (25-class), 7% to 8% higher than best existing products, paving ways for finer land surveys to support sustainability-related applications.
{"title":"SW-LCM: A Scalable and Weakly-supervised Land Cover Mapping Method on a New Sunway Supercomputer","authors":"Yi Zhao, Juepeng Zheng, H. Fu, Wenzhao Wu, Jie Gao, Mengxuan Chen, Jinxiao Zhang, Lixian Zhang, Runmin Dong, Z. Du, Sha Liu, Xin Liu, Shaoqing Zhang, Le Yu","doi":"10.1109/IPDPS54959.2023.00071","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00071","url":null,"abstract":"High-resolution land cover mapping (LCM) is an important application for studying and understanding the change of the earth surface. While deep learning (DL) methods demonstrate great potential in analyzing satellite images, they largely depend on massive high-quality labels. This paper proposes SW-LCM, a Scalable and Weakly-supervised two-stage Land Cover Mapping method on a new Sunway Supercomputer. Our method consists of a k-means clustering module as a first stage, and an iterative deep learning module as a second stage. With the k-means module providing a good enough starting point (taking inaccurate results as noisy labels), the deep learning module improves the classification results in an iterative way, without any labelling efforts required for processing large scenarios. To achieve efficiency for country-level land cover mapping, we design a customized data partition scheme and an on-the-fly assembly for k-means. Through careful parallelization and optimization, our k-means module scales to 98,304 computing nodes (over 38 million cores), and provides a sustained performance of 437.56 PFLOPS, in a real LCM task of the entire region of China; the iterative updating part scales to 24,576 nodes, with a performance of 11 PFLOPS. We produce a 10-m resolution land cover map of China, with an accuracy of 83.5% (10-class) or 73.2% (25-class), 7% to 8% higher than best existing products, paving ways for finer land surveys to support sustainability-related applications.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116337423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00045
J. N. F. Alves, L. Russo, Alexandre P. Francisco, S. Benkner
This paper proposes a novel cache-oblivious blocking scheme based on a new triangular space-filling curve which preserves data locality. The proposed blocking-scheme reduces the movement of data within the host memory hierarchy for triangular matrix traversals, which inherently exhibit poor data locality, such as the in-place transposition of square matrices. We show that our cache-oblivious blocking-scheme can be generated iteratively in linear time and constant memory with regard to the number of entries present in the lower, or upper, triangle of the input matrix. In contrast to classical recursive cache-oblivious solutions, the iterative nature of our blocking-scheme does not inhibit other essential optimizations such as software prefetching. In order to assess the viability of our blocking-scheme as a cache-oblivious strategy, we applied it to the in-place transposition of square matrices. Extensive experiments show that our cache-oblivious transposition algorithm generally outperforms the cache-aware state-of-the-art algorithm in terms of throughput and energy efficiency in sequential as well as parallel environments.
{"title":"A Novel Triangular Space-Filling Curve for Cache-Oblivious In-Place Transposition of Square Matrices","authors":"J. N. F. Alves, L. Russo, Alexandre P. Francisco, S. Benkner","doi":"10.1109/IPDPS54959.2023.00045","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00045","url":null,"abstract":"This paper proposes a novel cache-oblivious blocking scheme based on a new triangular space-filling curve which preserves data locality. The proposed blocking-scheme reduces the movement of data within the host memory hierarchy for triangular matrix traversals, which inherently exhibit poor data locality, such as the in-place transposition of square matrices. We show that our cache-oblivious blocking-scheme can be generated iteratively in linear time and constant memory with regard to the number of entries present in the lower, or upper, triangle of the input matrix. In contrast to classical recursive cache-oblivious solutions, the iterative nature of our blocking-scheme does not inhibit other essential optimizations such as software prefetching. In order to assess the viability of our blocking-scheme as a cache-oblivious strategy, we applied it to the in-place transposition of square matrices. Extensive experiments show that our cache-oblivious transposition algorithm generally outperforms the cache-aware state-of-the-art algorithm in terms of throughput and energy efficiency in sequential as well as parallel environments.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127706918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00102
T. Peterka, D. Morozov, Arnur Nigmetov, Orcun Yildiz, Bogdan Nicolae, Philip E. Davis
We describe LowFive, a new data transport layer based on the HDF5 data model, for in situ workflows. Executables using LowFive can communicate in situ (using in-memory data and MPI message passing), reading and writing traditional HDF5 files to physical storage, and combining the two modes. Minimal and often no source-code modification is needed for programs that already use HDF5. LowFive maintains deep copies or shallow references of datasets, configurable by the user. More than one task can produce (write) data, and more than one task can consume (read) data, accommodating fan-in and fan-out in the workflow task graph. LowFive supports data redistribution from n producer processes to m consumer processes. We demonstrate the above features in a series of experiments featuring both synthetic benchmarks as well as a representative use case from a scientific workflow, and we also compare with other data transport solutions in the literature.
{"title":"LowFive: In Situ Data Transport for High-Performance Workflows","authors":"T. Peterka, D. Morozov, Arnur Nigmetov, Orcun Yildiz, Bogdan Nicolae, Philip E. Davis","doi":"10.1109/IPDPS54959.2023.00102","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00102","url":null,"abstract":"We describe LowFive, a new data transport layer based on the HDF5 data model, for in situ workflows. Executables using LowFive can communicate in situ (using in-memory data and MPI message passing), reading and writing traditional HDF5 files to physical storage, and combining the two modes. Minimal and often no source-code modification is needed for programs that already use HDF5. LowFive maintains deep copies or shallow references of datasets, configurable by the user. More than one task can produce (write) data, and more than one task can consume (read) data, accommodating fan-in and fan-out in the workflow task graph. LowFive supports data redistribution from n producer processes to m consumer processes. We demonstrate the above features in a series of experiments featuring both synthetic benchmarks as well as a representative use case from a scientific workflow, and we also compare with other data transport solutions in the literature.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115625635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}