首页 > 最新文献

ACM Transactions on Parallel Computing最新文献

英文 中文
Joinable Parallel Balanced Binary Trees 可连接并行平衡二叉树
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-04-11 DOI: 10.1145/3512769
G. Blelloch, Daniel Ferizovic, Yihan Sun
In this article, we show how a single function, join, can be used to implement parallel balanced binary search trees (BSTs) simply and efficiently. Based on join , our approach applies to multiple balanced tree data structures, and a variety of functions for ordered sets and maps. We describe our technique as an algorithmic framework called join-based algorithms. We show that the join function fully captures what is needed for rebalancing trees for a variety of tree algorithms, as long as the balancing scheme satisfies certain properties, which we refer to as joinable trees. We discuss four balancing schemes that are joinable: AVL trees, red-black trees, weight-balanced trees, and treaps. We present a variety of tree algorithms that apply to joinable trees, including insert , delete , union , intersection , difference , split , range , filter , and so on, most of them also parallel. These algorithms are generic across balancing schemes. Many algorithms are optimal in the comparison model, and we provide a general proof to show the efficiency in work for joinable trees. The algorithms are highly parallel, all with polylogarithmic span (parallel dependence). Specifically, the set-set operations union , intersection , and difference have work ( O(mlog (frac{n}{m}+1)) ) and polylogarithmic span for input set sizes ( n ) and ( mle n ) . We implemented and tested our algorithms on the four balancing schemes. In general, all four schemes have quite similar performance, but the weight-balanced tree slightly outperforms the others. They have the same speedup characteristics, getting around 73 ( times ) speedup on 72 cores (144 hyperthreads). Experimental results also show that our implementation outperforms existing parallel implementations, and our sequential version achieves close or much better performance than the sequential merging algorithm in C++ Standard Template Library (STL) on various input sizes.
在本文中,我们展示了如何使用单个函数join来简单高效地实现并行平衡二进制搜索树(BST)。基于联接,我们的方法适用于多个平衡树数据结构,以及有序集和映射的各种函数。我们将我们的技术描述为一个称为基于联接的算法的算法框架。我们表明,只要平衡方案满足某些性质,我们称之为可连接树,连接函数就可以完全捕获各种树算法重新平衡树所需的内容。我们讨论了四种可合并的平衡方案:AVL树、红黑树、权重平衡树和treaps。我们提出了各种适用于可连接树的树算法,包括插入、删除、并集、交集、差分、拆分、范围、过滤等,其中大多数也是并行的。这些算法在平衡方案中是通用的。在比较模型中,许多算法都是最优的,我们提供了一个通用的证明来证明可连接树的工作效率。这些算法是高度并行的,都具有多对数跨度(并行依赖性)。具体地说,集合集运算并集、交集和差具有功(O(mlog(frac{n}{m}+1))和输入集大小(n)和(mle n)的多对数跨度。我们在四种平衡方案上实现并测试了我们的算法。一般来说,所有四种方案都具有非常相似的性能,但权重平衡树的性能略优于其他方案。它们具有相同的加速特性,在72个内核(144个超线程)上获得约73(times)的加速。实验结果还表明,我们的实现优于现有的并行实现,并且在各种输入大小上,我们的顺序版本的性能与C++标准模板库(STL)中的顺序合并算法接近或好得多。
{"title":"Joinable Parallel Balanced Binary Trees","authors":"G. Blelloch, Daniel Ferizovic, Yihan Sun","doi":"10.1145/3512769","DOIUrl":"https://doi.org/10.1145/3512769","url":null,"abstract":"In this article, we show how a single function, join, can be used to implement parallel balanced binary search trees (BSTs) simply and efficiently. Based on join , our approach applies to multiple balanced tree data structures, and a variety of functions for ordered sets and maps. We describe our technique as an algorithmic framework called join-based algorithms. We show that the join function fully captures what is needed for rebalancing trees for a variety of tree algorithms, as long as the balancing scheme satisfies certain properties, which we refer to as joinable trees. We discuss four balancing schemes that are joinable: AVL trees, red-black trees, weight-balanced trees, and treaps. We present a variety of tree algorithms that apply to joinable trees, including insert , delete , union , intersection , difference , split , range , filter , and so on, most of them also parallel. These algorithms are generic across balancing schemes. Many algorithms are optimal in the comparison model, and we provide a general proof to show the efficiency in work for joinable trees. The algorithms are highly parallel, all with polylogarithmic span (parallel dependence). Specifically, the set-set operations union , intersection , and difference have work ( O(mlog (frac{n}{m}+1)) ) and polylogarithmic span for input set sizes ( n ) and ( mle n ) . We implemented and tested our algorithms on the four balancing schemes. In general, all four schemes have quite similar performance, but the weight-balanced tree slightly outperforms the others. They have the same speedup characteristics, getting around 73 ( times ) speedup on 72 cores (144 hyperthreads). Experimental results also show that our implementation outperforms existing parallel implementations, and our sequential version achieves close or much better performance than the sequential merging algorithm in C++ Standard Template Library (STL) on various input sizes.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 41"},"PeriodicalIF":1.6,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48619182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms fgSpMSpV: HPC平台上的细粒度并行SpMSpV框架
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-04-11 DOI: 10.1145/3512770
Yuedan Chen, Guoqing Xiao, Kenli Li, F. Piccialli, Albert Y. Zomaya
Sparse matrix-sparse vector (SpMSpV) multiplication is one of the fundamental and important operations in many high-performance scientific and engineering applications. The inherent irregularity and poor data locality lead to two main challenges to scaling SpMSpV over high-performance computing (HPC) systems: (i) a large amount of redundant data limits the utilization of bandwidth and parallel resources; (ii) the irregular access pattern limits the exploitation of computing resources. This paper proposes a fine-grained parallel SpMSpV (fgSpMSpV) framework on Sunway TaihuLight supercomputer to alleviate the challenges for large-scale real-world applications. First, fgSpMSpV adopts an MPI ( + ) OpenMP ( +X ) parallelization model to exploit the multi-stage and hybrid parallelism of heterogeneous HPC architectures and accelerate both pre-/post-processing and main SpMSpV computation. Second, fgSpMSpV utilizes an adaptive parallel execution to reduce the pre-processing, adapt to the parallelism and memory hierarchy of the Sunway system, while still tame redundant and random memory accesses in SpMSpV, including a set of techniques like the fine-grained partitioner, re-collection method, and Compressed Sparse Column Vector (CSCV) matrix format. Third, fgSpMSpV uses several optimization techniques to further utilize the computing resources. fgSpMSpV on the Sunway TaihuLight gains a noticeable performance improvement from the key optimization techniques with various sparsity of the input. Additionally, fgSpMSpV is implemented on an NVIDIA Tesal P100 GPU and applied to the breath-first-search (BFS) application. fgSpMSpV on a P100 GPU obtains the speedup of up to ( 134.38times ) over the state-of-the-art SpMSpV algorithms, and the BFS application using fgSpMSpV achieves the speedup of up to ( 21.68times ) over the state-of-the-arts.
稀疏矩阵稀疏向量(SpMSpV)乘法是许多高性能科学和工程应用中的基本和重要运算之一。固有的不规则性和较差的数据局部性导致了在高性能计算(HPC)系统上扩展SpMSpV的两个主要挑战:(i)大量冗余数据限制了带宽和并行资源的利用;(ii)不规则访问模式限制了计算资源的开发。本文在神威太湖之光超级计算机上提出了一个细粒度并行的SpMSpV(fgSpMSpV)框架,以缓解大规模现实应用的挑战。首先,fgSpMSpV采用MPI(++)OpenMP(+X)并行化模型,利用异构HPC体系结构的多级和混合并行性,加速前/后处理和主SpMSpV计算。其次,fgSpMSpV利用自适应并行执行来减少预处理,适应Sunway系统的并行性和内存层次结构,同时仍然抑制SpMSpV中的冗余和随机内存访问,包括一组技术,如细粒度分割器、重新收集方法和压缩稀疏列向量(CSCV)矩阵格式。第三,fgSpMSpV使用了几种优化技术来进一步利用计算资源。阳光太湖之光上的fgSpMSpV通过各种输入稀疏性的关键优化技术获得了显著的性能改进。此外,fgSpMSpV在NVIDIA Tesal P100 GPU上实现,并应用于呼吸优先搜索(BFS)应用程序。在P100 GPU上的fgSpMSpV比最先进的SpMSpV算法获得了高达(134.38次)的加速,而使用fgSpMSp V的BFS应用程序比现有技术实现了高达[(21.68次])的加速。
{"title":"fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms","authors":"Yuedan Chen, Guoqing Xiao, Kenli Li, F. Piccialli, Albert Y. Zomaya","doi":"10.1145/3512770","DOIUrl":"https://doi.org/10.1145/3512770","url":null,"abstract":"Sparse matrix-sparse vector (SpMSpV) multiplication is one of the fundamental and important operations in many high-performance scientific and engineering applications. The inherent irregularity and poor data locality lead to two main challenges to scaling SpMSpV over high-performance computing (HPC) systems: (i) a large amount of redundant data limits the utilization of bandwidth and parallel resources; (ii) the irregular access pattern limits the exploitation of computing resources. This paper proposes a fine-grained parallel SpMSpV (fgSpMSpV) framework on Sunway TaihuLight supercomputer to alleviate the challenges for large-scale real-world applications. First, fgSpMSpV adopts an MPI ( + ) OpenMP ( +X ) parallelization model to exploit the multi-stage and hybrid parallelism of heterogeneous HPC architectures and accelerate both pre-/post-processing and main SpMSpV computation. Second, fgSpMSpV utilizes an adaptive parallel execution to reduce the pre-processing, adapt to the parallelism and memory hierarchy of the Sunway system, while still tame redundant and random memory accesses in SpMSpV, including a set of techniques like the fine-grained partitioner, re-collection method, and Compressed Sparse Column Vector (CSCV) matrix format. Third, fgSpMSpV uses several optimization techniques to further utilize the computing resources. fgSpMSpV on the Sunway TaihuLight gains a noticeable performance improvement from the key optimization techniques with various sparsity of the input. Additionally, fgSpMSpV is implemented on an NVIDIA Tesal P100 GPU and applied to the breath-first-search (BFS) application. fgSpMSpV on a P100 GPU obtains the speedup of up to ( 134.38times ) over the state-of-the-art SpMSpV algorithms, and the BFS application using fgSpMSpV achieves the speedup of up to ( 21.68times ) over the state-of-the-arts.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 29"},"PeriodicalIF":1.6,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48217478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Fast Concurrent Data Sketches 快速并发数据草图
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-04-11 DOI: 10.1145/3512758
Arik Rinberg, A. Spiegelman, Edward Bortnikov, Eshcar Hillel, I. Keidar, Lee Rhodes, Hadar Serviansky
Data sketches are approximate succinct summaries of long data streams. They are widely used for processing massive amounts of data and answering statistical queries about it. Existing libraries producing sketches are very fast, but do not allow parallelism for creating sketches using multiple threads or querying them while they are being built. We present a generic approach to parallelising data sketches efficiently and allowing them to be queried in real time, while bounding the error that such parallelism introduces. Utilising relaxed semantics and the notion of strong linearisability, we prove our algorithm’s correctness and analyse the error it induces in some specific sketches. Our implementation achieves high scalability while keeping the error small. We have contributed one of our concurrent sketches to the open-source data sketches library.
数据草图是长数据流的近似简洁摘要。它们被广泛用于处理大量数据并回答有关数据的统计查询。生成草图的现有库非常快,但不允许使用多线程创建草图或在构建草图时查询草图的并行性。我们提出了一种通用的方法来有效地并行化数据草图,并允许实时查询它们,同时限制这种并行性引入的错误。利用松弛语义和强线性性的概念,证明了算法的正确性,并分析了算法在一些具体图中引起的误差。我们的实现在保持小错误的同时实现了高可伸缩性。我们已经向开源数据草图库贡献了一个并发草图。
{"title":"Fast Concurrent Data Sketches","authors":"Arik Rinberg, A. Spiegelman, Edward Bortnikov, Eshcar Hillel, I. Keidar, Lee Rhodes, Hadar Serviansky","doi":"10.1145/3512758","DOIUrl":"https://doi.org/10.1145/3512758","url":null,"abstract":"Data sketches are approximate succinct summaries of long data streams. They are widely used for processing massive amounts of data and answering statistical queries about it. Existing libraries producing sketches are very fast, but do not allow parallelism for creating sketches using multiple threads or querying them while they are being built. We present a generic approach to parallelising data sketches efficiently and allowing them to be queried in real time, while bounding the error that such parallelism introduces. Utilising relaxed semantics and the notion of strong linearisability, we prove our algorithm’s correctness and analyse the error it induces in some specific sketches. Our implementation achieves high scalability while keeping the error small. We have contributed one of our concurrent sketches to the open-source data sketches library.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 35"},"PeriodicalIF":1.6,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45254851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Chronic giant cranial diploe hematoma in hemophiliac. 血友病患者的慢性巨型头颅二叶血肿。
IF 1 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-03-30 DOI: 10.1055/a-1813-0090
Weizhao Gong, Hanshi Wang, Taipeng Jiang, Dahui Zuo

Cranial diploe hematoma is a hematoma that occurs between the inner and outer layer of the skull and is often in infants and young children. Hemophilia A is an inherited X-linked bleeding disorder caused by a deficiency of coagulation factor VIII (FVIII) . Epidemiological survey results show that the prevalence of hemophilia in 24 provinces and cities in China is 2.73/100,000, while only about 5% of patients are registered . Hemophilia is mainly characterized by bleeding, which can occur anywhere in the pa-tient's body and manifest as intracranial, gastrointestinal, or pharyngeal bleeding, which can be life-threatening in severe cases. This article shares a case of a patient with he-mophilia A complicated by a chronic giant diploe hematoma.

颅底血肿是发生在颅骨内外层之间的血肿,通常发生在婴幼儿身上。血友病 A 是一种遗传性 X 连锁出血性疾病,由凝血因子 VIII(FVIII)缺乏引起。流行病学调查结果显示,中国 24 个省市的血友病患病率为 2.73/10 万,而登记在册的患者仅占 5%左右。血友病的主要特征是出血,出血可发生在患者身体的任何部位,表现为颅内出血、消化道出血或咽部出血,严重者可危及生命。本文分享了一例 A 型血友病患者并发慢性巨大二叶血肿的病例。
{"title":"Chronic giant cranial diploe hematoma in hemophiliac.","authors":"Weizhao Gong, Hanshi Wang, Taipeng Jiang, Dahui Zuo","doi":"10.1055/a-1813-0090","DOIUrl":"10.1055/a-1813-0090","url":null,"abstract":"<p><p>Cranial diploe hematoma is a hematoma that occurs between the inner and outer layer of the skull and is often in infants and young children. Hemophilia A is an inherited X-linked bleeding disorder caused by a deficiency of coagulation factor VIII (FVIII) . Epidemiological survey results show that the prevalence of hemophilia in 24 provinces and cities in China is 2.73/100,000, while only about 5% of patients are registered . Hemophilia is mainly characterized by bleeding, which can occur anywhere in the pa-tient's body and manifest as intracranial, gastrointestinal, or pharyngeal bleeding, which can be life-threatening in severe cases. This article shares a case of a patient with he-mophilia A complicated by a chronic giant diploe hematoma.</p>","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2022-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87263626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BQ: A Lock-Free Queue with Batching 基于批处理的无锁队列
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-03-24 DOI: 10.1145/3512757
Gal Milman-Sela, Alex Kogan, Yossi Lev, Victor Luchangco, E. Petrank
Concurrent data structures provide fundamental building blocks for concurrent programming. Standard concurrent data structures may be extended by allowing a sequence of operations to be submitted as a batch for later execution. A sequence of such operations can then be executed more efficiently than the standard execution of one operation at a time. In this article, we develop a novel algorithmic extension to the prevalent FIFO queue data structure that exploits such batching scenarios. An implementation in C++ on a multicore demonstrates significant performance improvement of more than an order of magnitude (depending on the batch lengths and the number of threads) compared to previous queue implementations.
并发数据结构为并发编程提供了基本的构建块。标准并发数据结构可以通过允许一系列操作作为批提交以供以后执行来扩展。这样的操作序列可以比一次一个操作的标准执行更有效地执行。在本文中,我们为流行的FIFO队列数据结构开发了一种新的算法扩展,利用了这种批处理场景。与以前的队列实现相比,在多核上用C++实现的性能显著提高了一个数量级以上(取决于批处理长度和线程数量)。
{"title":"BQ: A Lock-Free Queue with Batching","authors":"Gal Milman-Sela, Alex Kogan, Yossi Lev, Victor Luchangco, E. Petrank","doi":"10.1145/3512757","DOIUrl":"https://doi.org/10.1145/3512757","url":null,"abstract":"Concurrent data structures provide fundamental building blocks for concurrent programming. Standard concurrent data structures may be extended by allowing a sequence of operations to be submitted as a batch for later execution. A sequence of such operations can then be executed more efficiently than the standard execution of one operation at a time. In this article, we develop a novel algorithmic extension to the prevalent FIFO queue data structure that exploits such batching scenarios. An implementation in C++ on a multicore demonstrates significant performance improvement of more than an order of magnitude (depending on the batch lengths and the number of threads) compared to previous queue implementations.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 49"},"PeriodicalIF":1.6,"publicationDate":"2022-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49253770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
High-performance 3D Unstructured Mesh Deformation Using Rank Structured Matrix Computations 使用秩结构矩阵计算的高性能三维非结构化网格变形
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-03-24 DOI: 10.1145/3512756
Rabab Alomairy, W. Bader, H. Ltaief, Y. Mesri, D. Keyes
The Radial Basis Function (RBF) technique is an interpolation method that produces high-quality unstructured adaptive meshes. However, the RBF-based boundary problem necessitates solving a large dense linear system with cubic arithmetic complexity that is computationally expensive and prohibitive in terms of memory footprint. In this article, we accelerate the computations of 3D unstructured mesh deformation based on RBF interpolations by exploiting the rank structured property of the matrix operator. The main idea consists in approximating the matrix off-diagonal tiles up to an application-dependent accuracy threshold. We highlight the robustness of our multiscale solver by assessing its numerical accuracy using realistic 3D geometries. In particular, we model the 3D mesh deformation on a population of the novel coronaviruses. We report and compare performance results on various parallel systems against existing state-of-the-art matrix solvers.
径向基函数(RBF)技术是一种产生高质量非结构化自适应网格的插值方法。然而,基于rbf的边界问题需要求解具有三次算术复杂度的大型密集线性系统,这在计算上是昂贵的,并且在内存占用方面令人望而却步。本文利用矩阵算子的秩结构特性,加速了基于RBF插值的三维非结构化网格变形的计算。其主要思想是将矩阵的非对角线瓷砖近似到与应用程序相关的精度阈值。我们强调我们的多尺度求解器的鲁棒性通过评估其数值精度使用现实的三维几何。特别是,我们对新型冠状病毒种群的3D网格变形进行了建模。我们报告并比较了各种并行系统与现有最先进的矩阵求解器的性能结果。
{"title":"High-performance 3D Unstructured Mesh Deformation Using Rank Structured Matrix Computations","authors":"Rabab Alomairy, W. Bader, H. Ltaief, Y. Mesri, D. Keyes","doi":"10.1145/3512756","DOIUrl":"https://doi.org/10.1145/3512756","url":null,"abstract":"The Radial Basis Function (RBF) technique is an interpolation method that produces high-quality unstructured adaptive meshes. However, the RBF-based boundary problem necessitates solving a large dense linear system with cubic arithmetic complexity that is computationally expensive and prohibitive in terms of memory footprint. In this article, we accelerate the computations of 3D unstructured mesh deformation based on RBF interpolations by exploiting the rank structured property of the matrix operator. The main idea consists in approximating the matrix off-diagonal tiles up to an application-dependent accuracy threshold. We highlight the robustness of our multiscale solver by assessing its numerical accuracy using realistic 3D geometries. In particular, we model the 3D mesh deformation on a population of the novel coronaviruses. We report and compare performance results on various parallel systems against existing state-of-the-art matrix solvers.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2022-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45543029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient Distributed Matrix-free Multigrid Methods on Locally Refined Meshes for FEM Computations 用于有限元计算的局部精细网格上的高效分布式无矩阵多重网格方法
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-03-23 DOI: 10.1145/3580314
Peter Munch, T. Heister, Laura Prieto Saavedra, M. Kronbichler
This work studies three multigrid variants for matrix-free finite-element computations on locally refined meshes: geometric local smoothing, geometric global coarsening (both h-multigrid), and polynomial global coarsening (a variant of p-multigrid). We have integrated the algorithms into the same framework—the open source finite-element library deal.II—, which allows us to make fair comparisons regarding their implementation complexity, computational efficiency, and parallel scalability as well as to compare the measurements with theoretically derived performance metrics. Serial simulations and parallel weak and strong scaling on up to 147,456 CPU cores on 3,072 compute nodes are presented. The results obtained indicate that global-coarsening algorithms show a better parallel behavior for comparable smoothers due to the better load balance, particularly on the expensive fine levels. In the serial case, the costs of applying hanging-node constraints might be significant, leading to advantages of local smoothing, even though the number of solver iterations needed is slightly higher. When using p- and h-multigrid in sequence (hp-multigrid), the results indicate that it makes sense to decrease the degree of the elements first from a performance point of view due to the cheaper transfer.
这项工作研究了在局部精细网格上进行无矩阵有限元计算的三种多网格变体:几何局部平滑、几何全局粗化(都是h-multigrid)和多项式全局粗化(p-multigrid的一种变体)。我们已经将算法集成到同一个框架中——开源的有限元库协议。II -,它允许我们对它们的实现复杂性、计算效率和并行可伸缩性进行公平的比较,并将测量结果与理论推导的性能指标进行比较。给出了在3072个计算节点上多达147456个CPU核上的串行仿真和并行弱、强扩展。结果表明,全局粗化算法由于更好的负载平衡,特别是在昂贵的精细级别上,对可比平滑器表现出更好的并行行为。在串行情况下,应用悬挂节点约束的成本可能很大,从而带来局部平滑的优势,尽管所需的求解器迭代次数略高。当依次使用p-和h-多重网格(hp-多重网格)时,结果表明,从性能的角度来看,由于传输成本较低,首先降低元素的程度是有意义的。
{"title":"Efficient Distributed Matrix-free Multigrid Methods on Locally Refined Meshes for FEM Computations","authors":"Peter Munch, T. Heister, Laura Prieto Saavedra, M. Kronbichler","doi":"10.1145/3580314","DOIUrl":"https://doi.org/10.1145/3580314","url":null,"abstract":"This work studies three multigrid variants for matrix-free finite-element computations on locally refined meshes: geometric local smoothing, geometric global coarsening (both h-multigrid), and polynomial global coarsening (a variant of p-multigrid). We have integrated the algorithms into the same framework—the open source finite-element library deal.II—, which allows us to make fair comparisons regarding their implementation complexity, computational efficiency, and parallel scalability as well as to compare the measurements with theoretically derived performance metrics. Serial simulations and parallel weak and strong scaling on up to 147,456 CPU cores on 3,072 compute nodes are presented. The results obtained indicate that global-coarsening algorithms show a better parallel behavior for comparable smoothers due to the better load balance, particularly on the expensive fine levels. In the serial case, the costs of applying hanging-node constraints might be significant, leading to advantages of local smoothing, even though the number of solver iterations needed is slightly higher. When using p- and h-multigrid in sequence (hp-multigrid), the results indicate that it makes sense to decrease the degree of the elements first from a performance point of view due to the cheaper transfer.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 38"},"PeriodicalIF":1.6,"publicationDate":"2022-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47033768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Performance Analysis and Optimal Node-aware Communication for Enlarged Conjugate Gradient Methods 放大共轭梯度法的性能分析及最优节点感知通信
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2022-03-11 DOI: 10.1145/3580003
S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson
Krylov methods are a key way of solving large sparse linear systems of equations but suffer from poor strong scalability on distributed memory machines. This is due to high synchronization costs from large numbers of collective communication calls alongside a low computational workload. Enlarged Krylov methods address this issue by decreasing the total iterations to convergence, an artifact of splitting the initial residual and resulting in operations on block vectors. In this article, we present a performance study of an enlarged Krylov method, Enlarged Conjugate Gradients (ECG), noting the impact of block vectors on parallel performance at scale. Most notably, we observe the increased overhead of point-to-point communication as a result of denser messages in the sparse matrix-block vector multiplication kernel. Additionally, we present models to analyze expected performance of ECG, as well as motivate design decisions. Most importantly, we introduce a new point-to-point communication approach based on node-aware communication techniques that increases efficiency of the method at scale.
Krylov方法是求解大型稀疏线性方程组的关键方法,但在分布式存储机上的可扩展性较差。这是由于大量集体通信调用的高同步成本以及低计算工作量造成的。扩大的Krylov方法通过减少收敛的总迭代来解决这个问题,收敛是分裂初始残差并导致对块向量进行运算的伪影。在本文中,我们提出了一种放大Krylov方法——放大共轭梯度(ECG)的性能研究,注意到块向量对并行性能的影响。最值得注意的是,我们观察到点对点通信的开销增加,这是由于稀疏矩阵块向量乘法内核中的消息密度更大。此外,我们提出了模型来分析心电图的预期性能,并激励设计决策。最重要的是,我们引入了一种新的基于节点感知通信技术的点对点通信方法,该方法在规模上提高了该方法的效率。
{"title":"Performance Analysis and Optimal Node-aware Communication for Enlarged Conjugate Gradient Methods","authors":"S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson","doi":"10.1145/3580003","DOIUrl":"https://doi.org/10.1145/3580003","url":null,"abstract":"Krylov methods are a key way of solving large sparse linear systems of equations but suffer from poor strong scalability on distributed memory machines. This is due to high synchronization costs from large numbers of collective communication calls alongside a low computational workload. Enlarged Krylov methods address this issue by decreasing the total iterations to convergence, an artifact of splitting the initial residual and resulting in operations on block vectors. In this article, we present a performance study of an enlarged Krylov method, Enlarged Conjugate Gradients (ECG), noting the impact of block vectors on parallel performance at scale. Most notably, we observe the increased overhead of point-to-point communication as a result of denser messages in the sparse matrix-block vector multiplication kernel. Additionally, we present models to analyze expected performance of ECG, as well as motivate design decisions. Most importantly, we introduce a new point-to-point communication approach based on node-aware communication techniques that increases efficiency of the method at scale.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2022-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48394069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Deterministic Constant-Amortized-RMR Abortable Mutex for CC and DSM CC和DSM的确定性常摊销rmr可终止互斥
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2021-12-09 DOI: 10.1145/3490559
P. Jayanti, S. Jayanti
The abortable mutual exclusion problem, proposed by Scott and Scherer in response to the needs in real-time systems and databases, is a variant of mutual exclusion that allows processes to abort from their attempt to acquire the lock. Worst-case constant remote memory reference algorithms for mutual exclusion using hardware instructions such as Fetch&Add or Fetch&Store have long existed for both cache coherent (CC) and distributed shared memory multiprocessors, but no such algorithms are known for abortable mutual exclusion. Even relaxing the worst-case requirement to amortized, algorithms are only known for the CC model. In this article, we improve this state of the art by designing a deterministic algorithm that uses Fetch&Store to achieve amortized O(1) remote memory reference in both the CC and distributed shared memory models. Our algorithm supports Fast Abort (a process aborts within six steps of receiving the abort signal) and has the following additional desirable properties: it supports an arbitrary number of processes of arbitrary names, requires only O(1) space per process, and satisfies a novel fairness condition that we call Airline FCFS. Our algorithm is short with fewer than a dozen lines of code.
Scott和Scherer针对实时系统和数据库的需求提出的可中止互斥问题是互斥的一种变体,允许进程中止获取锁的尝试。使用硬件指令(如Fetch&Add或Fetch&Store)进行互斥的最坏情况恒定远程内存引用算法长期以来一直存在于缓存一致性(CC)和分布式共享内存多处理器中,但目前还没有已知的可中止互斥算法。即使将最坏情况的要求放宽到摊销,算法也只适用于CC模型。在本文中,我们通过设计一种确定性算法来改进这一技术现状,该算法使用Fetch&Store在CC和分布式共享内存模型中实现摊销的O(1)远程内存引用。我们的算法支持快速中止(一个进程在接收中止信号的六个步骤内中止),并具有以下额外的理想属性:它支持任意数量的任意名称的进程,每个进程只需要O(1)空间,并满足我们称之为Airline FCFS的新的公平条件。我们的算法很短,只有不到十几行代码。
{"title":"Deterministic Constant-Amortized-RMR Abortable Mutex for CC and DSM","authors":"P. Jayanti, S. Jayanti","doi":"10.1145/3490559","DOIUrl":"https://doi.org/10.1145/3490559","url":null,"abstract":"The abortable mutual exclusion problem, proposed by Scott and Scherer in response to the needs in real-time systems and databases, is a variant of mutual exclusion that allows processes to abort from their attempt to acquire the lock. Worst-case constant remote memory reference algorithms for mutual exclusion using hardware instructions such as Fetch&Add or Fetch&Store have long existed for both cache coherent (CC) and distributed shared memory multiprocessors, but no such algorithms are known for abortable mutual exclusion. Even relaxing the worst-case requirement to amortized, algorithms are only known for the CC model. In this article, we improve this state of the art by designing a deterministic algorithm that uses Fetch&Store to achieve amortized O(1) remote memory reference in both the CC and distributed shared memory models. Our algorithm supports Fast Abort (a process aborts within six steps of receiving the abort signal) and has the following additional desirable properties: it supports an arbitrary number of processes of arbitrary names, requires only O(1) space per process, and satisfies a novel fairness condition that we call Airline FCFS. Our algorithm is short with fewer than a dozen lines of code.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 26"},"PeriodicalIF":1.6,"publicationDate":"2021-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46508215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Adaptive Erasure Coded Fault Tolerant Linear System Solver 自适应擦除编码容错线性系统求解器
IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2021-12-08 DOI: 10.1145/3490557
X. Kang, D. Gleich, A. Sameh, A. Grama
As parallel and distributed systems scale, fault tolerance is an increasingly important problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded computations address this problem by augmenting a given problem instance with redundant data and then solving the augmented problem in a fault oblivious manner in a faulty parallel environment. In the event of faults, a computationally inexpensive procedure is used to compute the true solution from a potentially fault-prone solution. These techniques are significantly more efficient than conventional solutions to the fault tolerance problem. In this article, we show how we can minimize, to optimality, the overhead associated with our problem augmentation techniques for linear system solvers. Specifically, we present a technique that adaptively augments the problem only when faults are detected. At any point in execution, we only solve a system whose size is identical to the original input system. This has several advantages in terms of maintaining the size and conditioning of the system, as well as in only adding the minimal amount of computation needed to tolerate observed faults. We present, in detail, the augmentation process, the parallel formulation, and evaluation of performance of our technique. Specifically, we show that the proposed adaptive fault tolerance mechanism has minimal overhead in terms of FLOP counts with respect to the original solver executing in a non-faulty environment, has good convergence properties, and yields excellent parallel performance. We also demonstrate that our approach significantly outperforms an optimized application-level checkpointing scheme that only checkpoints needed data structures.
随着并行和分布式系统的扩展,容错是一个越来越重要的问题——特别是在I/O容量和带宽有限的系统上。Erasure编码计算通过用冗余数据扩充给定的问题实例,然后在有缺陷的并行环境中以错误无关的方式解决扩充的问题来解决这个问题。在发生故障的情况下,使用计算成本不高的过程从可能容易出错的解决方案中计算出真正的解决方案。对于容错问题,这些技术比传统的解决方案要有效得多。在本文中,我们将展示如何将与线性系统求解器的问题扩展技术相关的开销最小化到最优状态。具体来说,我们提出了一种仅在检测到故障时自适应增强问题的技术。在执行的任何时候,我们只求解大小与原始输入系统相同的系统。这在维护系统的大小和调节方面有几个优点,并且只增加了容忍观察到的错误所需的最小计算量。我们详细介绍了增强过程、并行配方和对我们技术性能的评估。具体而言,我们证明了所提出的自适应容错机制在非故障环境中执行的原始求解器的FLOP计数方面具有最小的开销,具有良好的收敛特性,并产生出色的并行性能。我们还证明,我们的方法明显优于优化的应用程序级检查点方案,该方案只需要检查点的数据结构。
{"title":"Adaptive Erasure Coded Fault Tolerant Linear System Solver","authors":"X. Kang, D. Gleich, A. Sameh, A. Grama","doi":"10.1145/3490557","DOIUrl":"https://doi.org/10.1145/3490557","url":null,"abstract":"As parallel and distributed systems scale, fault tolerance is an increasingly important problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded computations address this problem by augmenting a given problem instance with redundant data and then solving the augmented problem in a fault oblivious manner in a faulty parallel environment. In the event of faults, a computationally inexpensive procedure is used to compute the true solution from a potentially fault-prone solution. These techniques are significantly more efficient than conventional solutions to the fault tolerance problem. In this article, we show how we can minimize, to optimality, the overhead associated with our problem augmentation techniques for linear system solvers. Specifically, we present a technique that adaptively augments the problem only when faults are detected. At any point in execution, we only solve a system whose size is identical to the original input system. This has several advantages in terms of maintaining the size and conditioning of the system, as well as in only adding the minimal amount of computation needed to tolerate observed faults. We present, in detail, the augmentation process, the parallel formulation, and evaluation of performance of our technique. Specifically, we show that the proposed adaptive fault tolerance mechanism has minimal overhead in terms of FLOP counts with respect to the original solver executing in a non-faulty environment, has good convergence properties, and yields excellent parallel performance. We also demonstrate that our approach significantly outperforms an optimized application-level checkpointing scheme that only checkpoints needed data structures.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 19"},"PeriodicalIF":1.6,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45097716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1