Parallel Computing最新文献

英文中文

Uphill resampling for particle filter and its implementation on graphics processing unit 粒子滤波的上坡重采样及其在图形处理器上的实现

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2022.102994

Özcan Dülger , Halit Oğuztüzün , Mübeccel Demirekler

We introduce a new resampling method, named Uphill, that is free from numerical instability and suitable for parallel implementation on graphics processing unit (GPU). Common resampling algorithms such as Systematic suffer from numerical instability when single precision floating point numbers are used. This is due to cumulative summation over the weights of particles when the weights differ widely or the number of particles is large. The Metropolis and Rejection resampling algorithms do not suffer from numerical instability as they only calculate the ratios of weights pairwise rather than perform collective operations over the weights. They are more suitable for the GPU implementation of the particle filter. However, they undergo non-coalesced global memory access patterns which cause their speed deteriorate rapidly as the number of particles gets large. Uphill also does not suffer from numerical instability but, experiences the same non-coalesced global memory access problem with Metropolis and Rejection. We introduce its faster version named Uphill-Fast which eliminates this problem. We make comparisons of Uphill and Uphill-Fast with the Systematic, Metropolis, Metropolis-C2 and Rejection resampling methods with respect to quality and speed. We also compare them on a highly non-linear system. Uphill-Fast runs faster and attains similar quality, in terms of RMSE, in comparison with Metropolis and Rejection when the number of particles is very large. Uphill-Fast runs with roughly same speed as Metropolis-C2 with better variance and MSE when the number of particles is very large.

我们介绍了一种新的重采样方法Uphill，它不存在数值不稳定性，适合在图形处理单元（GPU）上并行实现。当使用单精度浮点数时，诸如Systematic之类的常见重采样算法会受到数值不稳定性的影响。这是由于当权重差异很大或粒子数量很大时，粒子权重的累积总和。Metropolis和Rejection重采样算法不会受到数值不稳定性的影响，因为它们只成对计算权重的比率，而不是对权重执行集体运算。它们更适合于粒子过滤器的GPU实现。然而，它们经历了非合并的全局内存访问模式，这导致它们的速度随着粒子数量的增加而迅速恶化。Uphill也没有受到数值不稳定性的影响，但遇到了与Metropolis和Rejection相同的非联合全局内存访问问题。我们推出了名为Uphill Fast的更快版本，它消除了这个问题。我们将Uphill和Uphill Fast与Systematic、Metropolis-C2和Rejection重采样方法在质量和速度方面进行了比较。我们还在一个高度非线性的系统上对它们进行了比较。Uphill Fast在粒子数量非常大的情况下，与Metropolis和Rejection相比，在RMSE方面跑得更快，并达到类似的质量。Uphill Fast以与Metropolis-C2大致相同的速度运行，在粒子数量很大时具有更好的方差和MSE。

{"title":"Uphill resampling for particle filter and its implementation on graphics processing unit","authors":"Özcan Dülger , Halit Oğuztüzün , Mübeccel Demirekler","doi":"10.1016/j.parco.2022.102994","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102994","url":null,"abstract":"<div><p>We introduce a new resampling method, named Uphill, that is free from numerical instability and suitable for parallel implementation on graphics processing unit (GPU). Common resampling algorithms such as Systematic suffer from numerical instability when single precision floating point numbers are used. This is due to cumulative summation over the weights of particles when the weights differ widely or the number of particles is large. The Metropolis and Rejection resampling algorithms do not suffer from numerical instability as they only calculate the ratios of weights pairwise rather than perform collective operations over the weights. They are more suitable for the GPU implementation of the particle filter. However, they undergo non-coalesced global memory access patterns which cause their speed deteriorate rapidly as the number of particles gets large. Uphill also does not suffer from numerical instability but, experiences the same non-coalesced global memory access problem with Metropolis and Rejection. We introduce its faster version named Uphill-Fast which eliminates this problem. We make comparisons of Uphill and Uphill-Fast with the Systematic, Metropolis, Metropolis-C2 and Rejection resampling methods with respect to quality and speed. We also compare them on a highly non-linear system. Uphill-Fast runs faster and attains similar quality, in terms of RMSE, in comparison with Metropolis and Rejection when the number of particles is very large. Uphill-Fast runs with roughly same speed as Metropolis-C2 with better variance and MSE when the number of particles is very large.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102994"},"PeriodicalIF":1.4,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ParVoro++: A scalable parallel algorithm for constructing 3D Voronoi tessellations based on kd-tree decomposition parvoro++:一种基于kd-tree分解的可扩展并行算法，用于构建3D Voronoi镶嵌

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2023.102995

Guoqing Wu, Hongyun Tian, Guo Lu, Wei Wang

The Voronoi tessellation is a fundamental geometric data structure which has numerous applications in various scientific and technological fields. For large particle datasets, computing Voronoi tessellations must be conducted in parallel on a distributed-memory supercomputer in order to satisfy time and memory-size constraints. However, due to load balance and communication, the parallelization of the Voronoi tessellation renders a challenge. In this paper, we present a scalable parallel algorithm for constructing 3D Voronoi tessellations, which evenly distributes the input particles between blocks through kd-tree decomposition. In order to construct the correct global Voronoi topology, we investigate both parametric and non-parametric methods for particle communication among the blocks of a spatial decomposition. The algorithm is implemented exploiting process-level and thread-level parallelization and can be used in a diverse architectural landscape. Using datasets containing up to 330 million particles, we show that our algorithm achieves parallel efficiency up to 57% using 4096 cores on a distributed-memory computer. Moreover, we compare our algorithm with previous attempts to parallelize Voronoi tessellations showing encouraging improvements in terms of computation time.

Voronoi镶嵌是一种基本的几何数据结构，在各种科学技术领域有着广泛的应用。对于大型粒子数据集，计算Voronoi镶嵌必须在分布式内存超级计算机上并行进行，以满足时间和内存大小的限制。然而，由于负载平衡和通信，Voronoi镶嵌的并行化带来了挑战。在本文中，我们提出了一种用于构建3D Voronoi镶嵌的可扩展并行算法，该算法通过kd树分解将输入粒子均匀分布在块之间。为了构造正确的全局Voronoi拓扑，我们研究了空间分解块之间粒子通信的参数和非参数方法。该算法是利用进程级和线程级并行化实现的，可以在不同的体系结构环境中使用。使用包含多达3.3亿个粒子的数据集，我们表明，在分布式内存计算机上使用4096个内核，我们的算法实现了高达57%的并行效率。此外，我们将我们的算法与之前的Voronoi镶嵌并行化尝试进行了比较，显示出在计算时间方面的令人鼓舞的改进。

{"title":"ParVoro++: A scalable parallel algorithm for constructing 3D Voronoi tessellations based on kd-tree decomposition","authors":"Guoqing Wu, Hongyun Tian, Guo Lu, Wei Wang","doi":"10.1016/j.parco.2023.102995","DOIUrl":"https://doi.org/10.1016/j.parco.2023.102995","url":null,"abstract":"<div><p>The Voronoi tessellation is a fundamental geometric data structure which has numerous applications in various scientific and technological fields. For large particle datasets, computing Voronoi tessellations must be conducted in parallel on a distributed-memory supercomputer in order to satisfy time and memory-size constraints. However, due to load balance and communication, the parallelization of the Voronoi tessellation renders a challenge. In this paper, we present a scalable parallel algorithm for constructing 3D Voronoi tessellations, which evenly distributes the input particles between blocks through kd-tree decomposition. In order to construct the correct global Voronoi topology, we investigate both parametric and non-parametric methods for particle communication among the blocks of a spatial decomposition. The algorithm is implemented exploiting process-level and thread-level parallelization and can be used in a diverse architectural landscape. Using datasets containing up to 330 million particles, we show that our algorithm achieves parallel efficiency up to 57% using 4096 cores on a distributed-memory computer. Moreover, we compare our algorithm with previous attempts to parallelize Voronoi tessellations showing encouraging improvements in terms of computation time.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102995"},"PeriodicalIF":1.4,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating the scheduling of the network resources of the next-generation optical data centers 加快下一代光数据中心网络资源的调度

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2022.102993

G. Patronas, N. Vlassopoulos, Ph. Bellos, D. Reisis

Data centers (DCs) play a key role in the evolving IT applications and they rely heavily on the optical interconnects to improve their performance and scalability. Optically switched DCs most often exploit the slotted Time Division Multiplexing Access (TDMA) operation and the Wavelength Division Multiplexing (WDM) technology and rely on the effective scheduling of the TDMA frames to decide in real time the end-to-end connections that include the network links, switches and ports. This task becomes computationally intensive as the communication requests increase.

The current paper builds on a greedy scheduling algorithm to introduce a parallel technique that accelerates the scheduling process and improves optical DC’s performance. The proposed technique handles efficiently the scheduler’s data structures, minimizes the communication among the scheduler’s processors and it is scalable. Moreover, this work presents the technique’s performance results for a variety of scheduling scenarios and DC sizes executed on an algorithm-specific Single Instruction Multiple Data (SIMD) accelerator architecture and on a Graphics Processing Unit (GPU). The performance of the GPU and the SIMD accelerator implemented on FPGA validate the parallel scheduler technique.

数据中心（DC）在不断发展的IT应用中发挥着关键作用，它们在很大程度上依赖光学互连来提高性能和可扩展性。光交换DC通常利用时隙时分复用接入（TDMA）操作和波分复用（WDM）技术，并依靠TDMA帧的有效调度来实时决定包括网络链路、交换机和端口的端到端连接。随着通信请求的增加，该任务变得计算密集。本文在贪婪调度算法的基础上，引入了一种并行技术，加速了调度过程，提高了光DC的性能。所提出的技术有效地处理调度器的数据结构，最小化调度器处理器之间的通信，并且是可扩展的。此外，这项工作还介绍了该技术在特定于算法的单指令多数据（SIMD）加速器架构和图形处理单元（GPU）上执行的各种调度场景和DC大小的性能结果。在FPGA上实现的GPU和SIMD加速器的性能验证了并行调度技术。

{"title":"Accelerating the scheduling of the network resources of the next-generation optical data centers","authors":"G. Patronas, N. Vlassopoulos, Ph. Bellos, D. Reisis","doi":"10.1016/j.parco.2022.102993","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102993","url":null,"abstract":"<div><p>Data centers (DCs) play a key role in the evolving IT applications and they rely heavily on the optical interconnects to improve their performance and scalability. Optically switched DCs most often exploit the slotted Time Division Multiplexing Access (TDMA) operation and the Wavelength Division Multiplexing (WDM) technology and rely on the effective scheduling of the TDMA frames to decide in real time the end-to-end connections that include the network links, switches and ports. This task becomes computationally intensive as the communication requests increase.</p><p>The current paper builds on a greedy scheduling algorithm to introduce a parallel technique that accelerates the scheduling process and improves optical DC’s performance. The proposed technique handles efficiently the scheduler’s data structures, minimizes the communication among the scheduler’s processors and it is scalable. Moreover, this work presents the technique’s performance results for a variety of scheduling scenarios and DC sizes executed on an algorithm-specific Single Instruction Multiple Data (SIMD) accelerator architecture and on a Graphics Processing Unit (GPU). The performance of the GPU and the SIMD accelerator implemented on FPGA validate the parallel scheduler technique.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102993"},"PeriodicalIF":1.4,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Multi-level parallel multi-layer block reproducible summation algorithm 多级并行多层块可重复求和算法

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2023.102996

Kuan Li , Kang He , Stef Graillat , Hao Jiang , Tongxiang Gu , Jie Liu

Reproducibility means getting the bitwise identical floating point results from multiple runs of the same program, which plays an essential role in debugging and correctness checking in many codes (Villa et al., 2009). However, in parallel computing environments, the combination of dynamic scheduling of parallel computing resources. Moreover, floating point nonassociativity leads to non-reproducible results. Demmel and Nguyen proposed a floating-point summation algorithm that is reproducible independent of the order of summation (Demmel and Nguye, 2013; 2015) and accurate by using the 1-Reduction technique. Our work combines their work with the multi-layer block technology proposed by Castaldo et al. (2009), designs the multi-level parallel multi-layer block reproducible summation algorithm (MLP_rsum), including SIMD, OpenMP, and MPI based on each layer of blocks, and then attains reproducible and expected accurate results with high performance. Numerical experiments show that our algorithm is more efficient than the reproducible summation function in ReproBLAS (2018). With SIMD optimization, our algorithm is 2.41, 2.85, and 3.44 times faster than ReproBLAS on the three ARM platforms. With OpenMP optimization, our algorithm obtains linear speedup, showing that our method applies to multi-core processors. Finally, with reproducible MPI reduction, our algorithm’s parallel efficiency is 76% at 32 nodes with 4 threads and 32 processes.

再现性意味着从同一程序的多次运行中获得逐位相同的浮点结果，这在许多代码的调试和正确性检查中起着至关重要的作用（Villa等人，2009）。然而，在并行计算环境中，并行计算资源的动态调度组合。此外，浮点非关联性导致不可重现的结果。Demmel和Nguyen提出了一种浮点求和算法，该算法是可重复的，与求和的顺序无关（Demmel and Nguye，2013；2015），并通过使用1-归约技术进行精确计算。我们的工作将他们的工作与Castaldo等人提出的多层块技术相结合。（2009），设计了多级并行多层块可重复求和算法（MLP_rsum），包括基于每层块的SIMD、OpenMP和MPI，然后以高性能获得可重复和预期的精确结果。数值实验表明，我们的算法比ReproBLAS（2018）中的可重复求和函数更有效。通过SIMD优化，我们的算法在三个ARM平台上分别比ReproBLAS快2.41、2.85和3.44倍。通过OpenMP优化，我们的算法获得了线性加速，表明我们的方法适用于多核处理器。最后，通过可重复的MPI减少，我们的算法在具有4个线程和32个进程的32个节点上的并行效率为76%。

{"title":"Multi-level parallel multi-layer block reproducible summation algorithm","authors":"Kuan Li , Kang He , Stef Graillat , Hao Jiang , Tongxiang Gu , Jie Liu","doi":"10.1016/j.parco.2023.102996","DOIUrl":"https://doi.org/10.1016/j.parco.2023.102996","url":null,"abstract":"<div><p>Reproducibility means getting the bitwise identical floating point results from multiple runs of the same program, which plays an essential role in debugging and correctness checking in many codes (Villa et al., 2009). However, in parallel computing environments, the combination of dynamic scheduling of parallel computing resources. Moreover, floating point nonassociativity leads to non-reproducible results. Demmel and Nguyen proposed a floating-point summation algorithm that is reproducible independent of the order of summation (Demmel and Nguye, 2013; 2015) and accurate by using the 1-Reduction technique. Our work combines their work with the multi-layer block technology proposed by Castaldo et al. (2009), designs the multi-level parallel multi-layer block reproducible summation algorithm (MLP_rsum), including SIMD, OpenMP, and MPI based on each layer of blocks, and then attains reproducible and expected accurate results with high performance. Numerical experiments show that our algorithm is more efficient than the reproducible summation function in ReproBLAS (2018). With SIMD optimization, our algorithm is 2.41, 2.85, and 3.44 times faster than ReproBLAS on the three ARM platforms. With OpenMP optimization, our algorithm obtains linear speedup, showing that our method applies to multi-core processors. Finally, with reproducible MPI reduction, our algorithm’s parallel efficiency is 76% at 32 nodes with 4 threads and 32 processes.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102996"},"PeriodicalIF":1.4,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Spatial-aware data partition for distributed memory parallelization of ANN search in multimedia retrieval 多媒体检索中神经网络搜索分布式内存并行化的空间感知数据分区

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2022.102992

Guilherme Andrade, Renato Ferreira, George Teodoro

Content-based multimedia retrieval (CBMR) applications are becoming very popular in several online services which handles large volumes of data and are submitted to high query rates. While these applications may be complex, finding the nearest neighboring objects (multimedia descriptors) is typically their most time consuming operation. In order to address this problem, several recent works have proposed distributed memory parallelization of approximate nearest neighbors (ANN) search. These solutions employ a variety of ANN algorithms and different parallelization strategies. In this paper, we have identified the currently used parallelization strategies (Data Equal Split (DES) and Bucket Equal Split (BES)) and systematically evaluated their performance. We have also developed a framework to simplify the deployment of ANN algorithms in distributed memory machines with customized parallelization or data partition strategies. We further proposed a novel class of data partition/parallelization strategies that takes into account the data spatial proximity. Our approaches (SABES and SABES++) improves data locality and the system efficiency as compared to DES and BES. For instance, SABES++ achieved speedups of 4.2 $\times$ and 1.8 $\times$ on top of DES and BES, respectively, in the baseline case (40 nodes). Further, SABES and SABES++ also attained higher multi-node scalability and the gains vs DES and BES increase a larger number of nodes. SABES++ is 14.5 $\times$ faster than DES when 160 nodes are used.

基于内容的多媒体检索（CBMR）应用程序在处理大量数据并提交高查询率的几种在线服务中变得非常流行。虽然这些应用程序可能很复杂，但查找最近的相邻对象（多媒体描述符）通常是它们最耗时的操作。为了解决这个问题，最近的几项工作提出了近似最近邻（ANN）搜索的分布式存储器并行化。这些解决方案采用了各种ANN算法和不同的并行化策略。在本文中，我们确定了目前使用的并行化策略（数据相等分割（DES）和桶相等分割（BES）），并系统地评估了它们的性能。我们还开发了一个框架，通过定制的并行化或数据分割策略来简化分布式存储机中ANN算法的部署。我们进一步提出了一类新的数据划分/并行化策略，该策略考虑了数据的空间邻近性。与DES和BES相比，我们的方法（SABES和SABES++）提高了数据的局部性和系统效率。例如，在基线情况下（40个节点），SABES++在DES和BES之上分别实现了4.2倍和1.8倍的加速。此外，SABES和SABES++还获得了更高的多节点可扩展性，并且与DES和BES相比的增益增加了更多的节点数量。当使用160个节点时，SABES++比DES快14.5倍。

{"title":"Spatial-aware data partition for distributed memory parallelization of ANN search in multimedia retrieval","authors":"Guilherme Andrade, Renato Ferreira, George Teodoro","doi":"10.1016/j.parco.2022.102992","DOIUrl":"https://doi.org/10.1016/j.parco.2022.102992","url":null,"abstract":"<div><p>Content-based multimedia retrieval (CBMR) applications are becoming very popular in several online services which handles large volumes of data and are submitted to high query rates. While these applications may be complex, finding the nearest neighboring objects (multimedia descriptors) is typically their most time consuming operation. In order to address this problem, several recent works have proposed distributed memory parallelization of approximate nearest neighbors (ANN) search. These solutions employ a variety of ANN algorithms and different parallelization strategies. In this paper, we have identified the currently used parallelization strategies (Data Equal Split (DES) and Bucket Equal Split (BES)) and systematically evaluated their performance. We have also developed a framework to simplify the deployment of ANN algorithms in distributed memory machines with customized parallelization or data partition strategies. We further proposed a novel class of data partition/parallelization strategies that takes into account the data spatial proximity. Our approaches (SABES and SABES++) improves data locality and the system efficiency as compared to DES and BES. For instance, SABES++ achieved speedups of 4.2<span><math><mo>×</mo></math></span> and 1.8<span><math><mo>×</mo></math></span> on top of DES and BES, respectively, in the baseline case (40 nodes). Further, SABES and SABES++ also attained higher multi-node scalability and the gains vs DES and BES increase a larger number of nodes. SABES++ is 14.5<span><math><mo>×</mo></math></span> faster than DES when 160 nodes are used.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102992"},"PeriodicalIF":1.4,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Efficient parallel reduction of bandwidth for symmetric matrices 有效的并行减少带宽对称矩阵

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2023.102998

Valeriy Manin, Bruno Lang

Bandwidth reduction can be a first step in the computation of eigenvalues and eigenvectors for a wide-banded complex Hermitian (or real symmetric) matrix. We present algorithms for this reduction and the corresponding back-transformation of the eigenvectors. These algorithms rely on blocked Householder transformations, thus enabling level 3 BLAS performance, and they feature two levels of parallelism. The efficiency of our approach is demonstrated with numerical experiments.

带宽缩减可以是计算宽带复埃尔米特（或实对称）矩阵的特征值和特征向量的第一步。我们提出了这种约简的算法和相应的特征向量的反变换。这些算法依赖于阻塞的Householder转换，从而实现了级别3的BLAS性能，并且具有两个级别的并行性。数值实验证明了该方法的有效性。

引用次数: 0

Reviewer acknowledgment 评论家承认

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2023-02-01 DOI: 10.1016/S0167-8191(23)00010-8

引用次数: 0

Heterogeneous sparse matrix–vector multiplication via compressed sparse row format 异构稀疏矩阵-向量乘法压缩稀疏行格式

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2023-02-01 DOI: 10.1016/j.parco.2023.102997

Phillip Allen Lane, Joshua Dennis Booth

Sparse matrix–vector multiplication (SpMV) is one of the most important kernels in high-performance computing (HPC), yet SpMV normally suffers from ill performance on many devices. Due to ill performance, SpMV normally requires special care to store and tune for a given device. Moreover, HPC is facing heterogeneous hardware containing multiple different compute units, e.g., many-core CPUs and GPUs. Therefore, an emerging goal has been to produce heterogeneous formats and methods that allow critical kernels, e.g., SpMV, to be executed on different devices with portable performance and minimal changes to format and method. This paper presents a heterogeneous format based on CSR, named CSR- $k$ , that can be tuned quickly and outperforms the average performance of Intel MKL on Intel Xeon Platinum 838 and AMD Epyc 7742 CPUs while still outperforming NVIDIA’s cuSPARSE and Sandia National Laboratories’ KokkosKernels on NVIDIA A100 and V100 for regular sparse matrices, i.e., sparse matrices where the number of nonzeros per row has a variance $\leq$ 10, such as those commonly generated from two and three-dimensional finite difference and element problems. In particular, CSR- $k$ achieves this with reordering and by grouping rows into a hierarchical structure of super-rows and super–super-rows that are represented by just a few extra arrays of pointers. Due to its simplicity, a model can be tuned for a device, and this model can be used to select super-row and super–super-rows sizes in constant time.

稀疏矩阵-向量乘法（SpMV）是高性能计算（HPC）中最重要的核心之一，但SpMV在许多设备上通常性能不佳。由于性能不佳，SpMV通常需要特别小心存储和调谐特定设备。此外，HPC面临着包含多个不同计算单元的异构硬件，例如，许多核心CPU和GPU。因此，一个新兴的目标是产生异构格式和方法，使关键内核（如SpMV）能够在不同的设备上执行，具有便携性能，并且对格式和方法的更改最小。本文提出了一种基于CSR的异构格式，名为CSR-k，它可以快速调整，在英特尔至强Platinum 838和AMD Epyc 7742 CPU上的平均性能优于英特尔MKL，同时在NVIDIA A100和V100上的规则稀疏矩阵上仍优于NVIDIA的cuSPARSE和桑迪亚国家实验室的KokkosKernels，即。，稀疏矩阵，其中每行的非零个数方差≤10，例如通常由二维和三维有限差分和单元问题生成的稀疏矩阵。特别是，CSR-k通过重新排序和将行分组为超级行和超级行的分层结构来实现这一点，超级行和超超级行仅由几个额外的指针数组表示。由于其简单性，可以为设备调整模型，并且该模型可以用于在恒定时间内选择超级行和超级-超级行的大小。

{"title":"Heterogeneous sparse matrix–vector multiplication via compressed sparse row format","authors":"Phillip Allen Lane, Joshua Dennis Booth","doi":"10.1016/j.parco.2023.102997","DOIUrl":"https://doi.org/10.1016/j.parco.2023.102997","url":null,"abstract":"<div><p>Sparse matrix–vector multiplication (SpMV) is one of the most important kernels in high-performance computing (HPC), yet SpMV normally suffers from ill performance on many devices. Due to ill performance, SpMV normally requires special care to store and tune for a given device. Moreover, HPC is facing heterogeneous hardware containing multiple different compute units, e.g., many-core CPUs and GPUs. Therefore, an emerging goal has been to produce heterogeneous formats and methods that allow critical kernels, e.g., SpMV, to be executed on different devices with portable performance and minimal changes to format and method. This paper presents a heterogeneous format based on CSR, named CSR-<span><math><mi>k</mi></math></span>, that can be tuned quickly and outperforms the average performance of Intel MKL on Intel Xeon Platinum 838 and AMD Epyc 7742 CPUs while still outperforming NVIDIA’s cuSPARSE and Sandia National Laboratories’ KokkosKernels on NVIDIA A100 and V100 for regular sparse matrices, i.e., sparse matrices where the number of nonzeros per row has a variance <span><math><mo>≤</mo></math></span>10, such as those commonly generated from two and three-dimensional finite difference and element problems. In particular, CSR-<span><math><mi>k</mi></math></span> achieves this with reordering and by grouping rows into a hierarchical structure of super-rows and super–super-rows that are represented by just a few extra arrays of pointers. Due to its simplicity, a model can be tuned for a device, and this model can be used to select super-row and super–super-rows sizes in constant time.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102997"},"PeriodicalIF":1.4,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49705252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient parallel reduction of bandwidth for symmetric matrices 有效的并行减少带宽对称矩阵

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2023-01-01 DOI: 10.2139/ssrn.4050432

Valeriy Manin, B. Lang

引用次数: 0

Efficient parallel branch-and-bound approaches for exact graph edit distance problem 精确图编辑距离问题的高效并行分支定界方法

IF 1.4 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing

Pub Date : 2022-12-01 DOI: 10.1016/j.parco.2022.102984

Adel Dabah , Ibrahim Chegrane , Saïd Yahiaoui , Ahcene Bendjoudi , Nadia Nouali-Taboudjemat

Graph Edit Distance (GED) is a well-known measure used in the graph matching to measure the similarity/dissimilarity between two graphs by computing the minimum cost of edit operations needed to transform one graph into another. This process, Which appears to be simple, is known NP-hard and time consuming since the search space is increasing exponentially. One way to optimally solve this problem is by using Branch and Bound (B&B) algorithms, Which reduce the computation time required to explore the whole search space by performing an implicit enumeration of the search space instead of an exhaustive one based on a pruning technique. nevertheless, They remain inefficient when dealing with large problem instances due to the impractical running time needed to explore the whole search space. To overcome this issue, We propose in this paper three parallel B&B approaches based on shared memory to exploit the multi-core CPU processors: First, a work-stealing approach where several instances of the B&B algorithm explore a single search tree concurrently achieving speedups up to 24 $\times$ faster than the sequential version. Second, a tree-based approach where multiple parts of the search tree are explored simultaneously by independent B&B instances achieving speedups up to 28 $\times$ . Finally, Due to the irregular nature of the GED problem, two load-balancing strategies are proposed to ensure a fair workload between parallel processes achieving impressive speedups up to 300 $\times$ . all experiments have been carried out on well-known datasets

图编辑距离(GED)是图匹配中常用的度量方法，通过计算将一个图转换为另一个图所需的最小编辑操作成本来度量两个图之间的相似性/不相似性。这个过程看起来很简单，但由于搜索空间呈指数级增长，因此它是np困难且耗时的。最优解决这个问题的一种方法是使用Branch and Bound (B&B)算法，该算法通过执行搜索空间的隐式枚举而不是基于修剪技术的穷尽枚举来减少探索整个搜索空间所需的计算时间。然而，由于探索整个搜索空间所需的不切实际的运行时间，它们在处理大型问题实例时仍然效率低下。为了克服这个问题，我们在本文中提出了三种基于共享内存的并行B&B方法来利用多核CPU处理器:首先，一种工作窃取方法，其中B&B算法的多个实例并发地探索单个搜索树，其速度比顺序版本快24倍。第二种是基于树的方法，通过独立的B&B实例同时探索搜索树的多个部分，实现高达28倍的加速。最后，由于GED问题的不规则性，提出了两种负载平衡策略来确保并行进程之间的公平工作负载，从而实现高达300倍的惊人加速。所有的实验都是在已知的数据集上进行的

{"title":"Efficient parallel branch-and-bound approaches for exact graph edit distance problem","authors":"Adel Dabah , Ibrahim Chegrane , Saïd Yahiaoui , Ahcene Bendjoudi , Nadia Nouali-Taboudjemat","doi":"10.1016/j.parco.2022.102984","DOIUrl":"10.1016/j.parco.2022.102984","url":null,"abstract":"<div><p><span>Graph Edit Distance (GED) is a well-known measure used in the graph matching to measure the similarity/dissimilarity between two graphs by computing the minimum cost of edit operations needed to transform one graph into another. This process, Which appears to be simple, is known NP-hard and time consuming since the search space is increasing exponentially. One way to optimally solve this problem is by using Branch and Bound (B&B) algorithms, Which reduce the computation time required to explore the whole search space by performing an implicit enumeration of the search space instead of an exhaustive one based on a pruning technique. nevertheless, They remain inefficient when dealing with large problem instances due to the impractical running time needed to explore the whole search space. To overcome this issue, We propose in this paper three parallel B&B approaches based on shared memory to exploit the multi-core CPU processors: First, a work-stealing approach where several instances of the B&B algorithm explore a single search tree concurrently achieving speedups up to 24</span><span><math><mo>×</mo></math></span> faster than the sequential version. Second, a tree-based approach where multiple parts of the search tree are explored simultaneously by independent B&B instances achieving speedups up to 28<span><math><mo>×</mo></math></span>. Finally, Due to the irregular nature of the GED problem, two load-balancing strategies are proposed to ensure a fair workload between parallel processes achieving impressive speedups up to 300<span><math><mo>×</mo></math></span>. all experiments have been carried out on well-known datasets</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"114 ","pages":"Article 102984"},"PeriodicalIF":1.4,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72384574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Parallel Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀