Proceedings of the 37th International Conference on Supercomputing最新文献

Optimizing Multi-grid Computation and Parallelization on Multi-cores 优化多网格计算和多核并行化

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593726

Xiaojian Yang, Shengguo Li, Fan Yuan, Dezun Dong, Chun Huang, Z. Wang

Multigrid algorithms are widely used to solve large-scale sparse linear systems, which is essential for many high-performance workloads. The symmetric Gauss-Seidel (SYMGS) method is often responsible for the performance bottleneck of MG. This paper presents new methods to parallelize and enhance the computation and parallelization efficiency of the SYMGS and MG algorithms on multi-core CPUs. Our solution employs a matrix splitting strategy and a revised computation formula to decrease the computation operations and memory accesses in SYMGS. With this new SYMGS strategy, we can then merge the two most time-consuming components of MG. On top of these, we propose a new asynchronous parallelization scheme to reduce the synchronization overhead when parallelizing SYMGS. We demonstrate the benefit of our techniques by integrating them with the HPCG benchmark and two real-life applications. Evaluation conducted on four architectures, including three ARMv8 and one x86, shows that our techniques greatly surpass the performance of engineer- and vendor-tuned implementations across various workloads and platforms.

多网格算法被广泛用于求解大规模稀疏线性系统，这对于许多高性能工作负载来说是必不可少的。对称高斯-塞德尔(SYMGS)方法经常是MG的性能瓶颈。本文提出了新的并行化方法，提高了SYMGS和MG算法在多核cpu上的计算和并行化效率。我们的解决方案采用矩阵分裂策略和修改的计算公式来减少SYMGS中的计算操作和内存访问。有了这个新的SYMGS策略，我们就可以合并MG中两个最耗时的组件。在此基础上，我们提出了一种新的异步并行化方案，以减少并行化SYMGS时的同步开销。通过将我们的技术与HPCG基准测试和两个实际应用程序集成，我们展示了这些技术的优势。对四个架构(包括三个ARMv8和一个x86)进行的评估表明，我们的技术在各种工作负载和平台上的性能大大超过了工程师和供应商调优的实现。

{"title":"Optimizing Multi-grid Computation and Parallelization on Multi-cores","authors":"Xiaojian Yang, Shengguo Li, Fan Yuan, Dezun Dong, Chun Huang, Z. Wang","doi":"10.1145/3577193.3593726","DOIUrl":"https://doi.org/10.1145/3577193.3593726","url":null,"abstract":"Multigrid algorithms are widely used to solve large-scale sparse linear systems, which is essential for many high-performance workloads. The symmetric Gauss-Seidel (SYMGS) method is often responsible for the performance bottleneck of MG. This paper presents new methods to parallelize and enhance the computation and parallelization efficiency of the SYMGS and MG algorithms on multi-core CPUs. Our solution employs a matrix splitting strategy and a revised computation formula to decrease the computation operations and memory accesses in SYMGS. With this new SYMGS strategy, we can then merge the two most time-consuming components of MG. On top of these, we propose a new asynchronous parallelization scheme to reduce the synchronization overhead when parallelizing SYMGS. We demonstrate the benefit of our techniques by integrating them with the HPCG benchmark and two real-life applications. Evaluation conducted on four architectures, including three ARMv8 and one x86, shows that our techniques greatly surpass the performance of engineer- and vendor-tuned implementations across various workloads and platforms.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130683969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast All-Pairs Shortest Paths Algorithm in Large Sparse Graph 大型稀疏图中的快速全对最短路径算法

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593728

Shaofeng Yang, Xiandong Liu, Yun-Tsz Wang, Xin He, Guangming Tan

Finding the All-Pairs Shortest Paths (APSP) in a graph is the key for various domains. Motivated by the graphs are sparse in most real-world applications, we store the whole graph as a compressed storage format in each process of the distributed computing clusters and combine the Floyd algorithm with the Dijkstra algorithm to solve the APSP problem in this work, which leads to the novel Fast APSP algorithm. In contrast to the state-of-the-art Part APSP algorithm, our algorithm adds some memory overhead to store the original sparse graph and uses local Floyd and global Dijkstra algorithms simultaneously. The payoff is the circumvention of expensive global communication, reducing one local FW operation, simplifying the Minplus function, and making its data access continuous. Furthermore, we propose a parallel framework to solve the problem of mismatch between the number of GPUs and the number of divisible blocks of a graph. The Fast APSP algorithm exhibits an average speedup of 16.97x compared to the CPU Dijkstra algorithm, 7.09x compared to the GPU Dijkstra algorithm, 7.09x compared to the Part APSP algorithm, and 4.6x compared to the decentralized Part APSP algorithm. It also shows good scalability in our experiments. It takes about 12.45 minutes to solve the APSP problem for the graph with 11,548,845 vertices by engaging 2048 GPUs.

在图中寻找全对最短路径(APSP)是求解各种域的关键。由于实际应用中图的稀疏性，我们将整个图以压缩存储格式存储在分布式计算集群的每个进程中，并将Floyd算法与Dijkstra算法相结合来解决APSP问题，从而产生了新的快速APSP算法。与最先进的Part APSP算法相比，我们的算法增加了一些内存开销来存储原始稀疏图，并同时使用局部Floyd和全局Dijkstra算法。这样做的好处是避免了昂贵的全局通信，减少了一个本地FW操作，简化了Minplus功能，并使其数据访问连续。此外，我们提出了一个并行框架来解决图形处理器数量与图的可分块数量不匹配的问题。与CPU Dijkstra算法相比，Fast APSP算法的平均加速速度为16.97倍，与GPU Dijkstra算法相比为7.09倍，与Part APSP算法相比为7.09倍，与去中心化Part APSP算法相比为4.6倍。在实验中也显示出良好的可扩展性。对于有11548845个顶点的图形，使用2048个gpu来解决APSP问题大约需要12.45分钟。

{"title":"Fast All-Pairs Shortest Paths Algorithm in Large Sparse Graph","authors":"Shaofeng Yang, Xiandong Liu, Yun-Tsz Wang, Xin He, Guangming Tan","doi":"10.1145/3577193.3593728","DOIUrl":"https://doi.org/10.1145/3577193.3593728","url":null,"abstract":"Finding the All-Pairs Shortest Paths (APSP) in a graph is the key for various domains. Motivated by the graphs are sparse in most real-world applications, we store the whole graph as a compressed storage format in each process of the distributed computing clusters and combine the Floyd algorithm with the Dijkstra algorithm to solve the APSP problem in this work, which leads to the novel Fast APSP algorithm. In contrast to the state-of-the-art Part APSP algorithm, our algorithm adds some memory overhead to store the original sparse graph and uses local Floyd and global Dijkstra algorithms simultaneously. The payoff is the circumvention of expensive global communication, reducing one local FW operation, simplifying the Minplus function, and making its data access continuous. Furthermore, we propose a parallel framework to solve the problem of mismatch between the number of GPUs and the number of divisible blocks of a graph. The Fast APSP algorithm exhibits an average speedup of 16.97x compared to the CPU Dijkstra algorithm, 7.09x compared to the GPU Dijkstra algorithm, 7.09x compared to the Part APSP algorithm, and 4.6x compared to the decentralized Part APSP algorithm. It also shows good scalability in our experiments. It takes about 12.45 minutes to solve the APSP problem for the graph with 11,548,845 vertices by engaging 2048 GPUs.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124993488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

FLASH: FPGA-Accelerated Smart Switches with GCN Case Study FLASH: fpga加速的GCN智能交换机案例研究

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593739

Pouya Haghi, William Krska, Cheng Tan, Tong Geng, P. Chen, Connor Greenwood, Anqi Guo, Thomas M. Hines, Chunshu Wu, Ang Li, A. Skjellum, Martin C. Herbordt

Some communication switches, e.g., the Mellanox SHArP and those in the IBM BlueGene clusters, are augmented to process packets at the application level with fixed-function collectives. This approach, however, lacks flexibility, which limits their applicability in diverse and dynamic workloads. Recently, a new type of programmable packet processor, which uses high-level languages, e.g., P4, has emerged as a possible candidate. P4-based switches, however, fall short in certain applications, including machine learning, where capabilities not currently supported by P4 are needed. These include more complex calculation, such as sparse computation and fused multiply-accumulate, data-intensive floating point operations, data reuse, and significant memory. The problem addressed here is that such a switch augmentation needs to support: a large amount of state, significant flexible compute capability, and ease of programming, all while maintaining full functionality, including ensuring high throughput, and demonstrating utility. In this work, we propose a programmable look-aside-type accelerator that can be embedded into, or attached to, existing communication switch pipelines and that is capable of processing packets at line rate. The proposed in-switch accelerator is based on mixing an ISA (subset of RISC-V instructions) with dataflow graphs (found in CGRAs). To augment performance, vector instructions are also supported. To facilitate usability, we have developed a complete toolchain to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. While this approach is flexible enough to support various workloads, in this paper, we consider Graph Convolutional Networks (GCNs) as a case study. Experimental results show that this approach considerably improves the performance of distributed GCN applications.

一些通信交换机，例如Mellanox SHArP和IBM BlueGene集群中的通信交换机，被增强为在应用程序级别使用固定功能集合处理数据包。然而，这种方法缺乏灵活性，这限制了它们在各种动态工作负载中的适用性。最近，一种使用高级语言(如P4)的新型可编程包处理器已成为可能的候选。然而，基于P4的交换机在某些应用中存在不足，包括机器学习，这些应用目前需要P4不支持的功能。其中包括更复杂的计算，如稀疏计算和融合乘法累加、数据密集型浮点操作、数据重用和大量内存。这里要解决的问题是，这样的交换机扩展需要支持:大量的状态、重要的灵活计算能力和易于编程，同时保持完整的功能，包括确保高吞吐量和演示实用性。在这项工作中，我们提出了一种可编程的侧面型加速器，它可以嵌入或连接到现有的通信交换管道中，并且能够以线路速率处理数据包。提议的交换加速器是基于混合ISA (RISC-V指令的子集)和数据流图(在CGRAs中发现)。为了增强性能，还支持矢量指令。为了便于使用，我们开发了一个完整的工具链，将用户提供的C/ c++代码编译为配置加速器的适当后端指令。虽然这种方法足够灵活，可以支持各种工作负载，但在本文中，我们将图卷积网络(GCNs)作为案例研究。实验结果表明，该方法显著提高了分布式GCN应用的性能。

{"title":"FLASH: FPGA-Accelerated Smart Switches with GCN Case Study","authors":"Pouya Haghi, William Krska, Cheng Tan, Tong Geng, P. Chen, Connor Greenwood, Anqi Guo, Thomas M. Hines, Chunshu Wu, Ang Li, A. Skjellum, Martin C. Herbordt","doi":"10.1145/3577193.3593739","DOIUrl":"https://doi.org/10.1145/3577193.3593739","url":null,"abstract":"Some communication switches, e.g., the Mellanox SHArP and those in the IBM BlueGene clusters, are augmented to process packets at the application level with fixed-function collectives. This approach, however, lacks flexibility, which limits their applicability in diverse and dynamic workloads. Recently, a new type of programmable packet processor, which uses high-level languages, e.g., P4, has emerged as a possible candidate. P4-based switches, however, fall short in certain applications, including machine learning, where capabilities not currently supported by P4 are needed. These include more complex calculation, such as sparse computation and fused multiply-accumulate, data-intensive floating point operations, data reuse, and significant memory. The problem addressed here is that such a switch augmentation needs to support: a large amount of state, significant flexible compute capability, and ease of programming, all while maintaining full functionality, including ensuring high throughput, and demonstrating utility. In this work, we propose a programmable look-aside-type accelerator that can be embedded into, or attached to, existing communication switch pipelines and that is capable of processing packets at line rate. The proposed in-switch accelerator is based on mixing an ISA (subset of RISC-V instructions) with dataflow graphs (found in CGRAs). To augment performance, vector instructions are also supported. To facilitate usability, we have developed a complete toolchain to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. While this approach is flexible enough to support various workloads, in this paper, we consider Graph Convolutional Networks (GCNs) as a case study. Experimental results show that this approach considerably improves the performance of distributed GCN applications.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123723720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

PAC: Preference-Aware Co-location Scheduling on Heterogeneous NUMA Architectures To Improve Resource Utilization PAC:基于偏好感知的异构NUMA架构协同位置调度以提高资源利用率

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593709

Pu Pang, Yaoxuan Li, Bo Liu, Quan Chen, Zhou Yu, Zhibin Yu, Deze Zeng, Jingwen Leng, Jieru Zhao, Minyi Guo

Latency-critical applications directly interact with end users and often experience the diurnal load pattern. In production, best-effort applications are often co-located with them to utilize the idle cores at the low load. Meanwhile, modern computers are evolving towards heterogeneous NUMA architecture, where the cores have different computation abilities, memory access latencies and network communication delays. Prior co-location scheduling work did not consider the NUMA architecture, and failed to maximize the throughput of best-effort applications while ensuring the required QoS of latency-critical applications. Our investigation shows that NUMA effect has complex impacts on the latency of latency-critical applications and the throughput of best-effort applications. We therefore propose PAC, a preference-aware co-location scheduling scheme that considers the NUMA effect for heterogeneous NUMA architectures. PAC has a performance monitor and a core scheduler. Specifically, the performance monitor identifies the "dangerous" latency-critical applications that require upgrading core allocations. We propose two low-overhead scheduling strategies for the scheduler. The strategies identify the bottlenecks of applications and adjust core allocations accordingly. Experimental result shows that PAC improves the throughput of best-effort applications by 3.87× while ensuring the required QoS of latency-critical applications.

延迟关键型应用程序直接与最终用户交互，并且经常经历日负载模式。在生产中，尽最大努力的应用程序通常与它们共存，以便在低负载时利用空闲内核。同时，现代计算机正在向异构NUMA架构发展，核心具有不同的计算能力、内存访问延迟和网络通信延迟。先前的协同位置调度工作没有考虑NUMA体系结构，无法在确保延迟关键型应用程序所需的QoS的同时，最大限度地提高应用程序的吞吐量。我们的研究表明，NUMA效应对延迟关键型应用程序的延迟和尽力而为应用程序的吞吐量有复杂的影响。因此，我们提出了PAC，一种考虑异构NUMA架构的NUMA效应的偏好感知协同调度方案。PAC有一个性能监视器和一个核心调度程序。具体来说，性能监视器会识别需要升级核心分配的“危险”延迟关键型应用程序。我们为调度程序提出了两种低开销的调度策略。这些策略确定应用程序的瓶颈，并相应地调整核心分配。实验结果表明，PAC在保证延迟关键型应用所需的QoS的同时，将最努力应用的吞吐量提高了3.87倍。

{"title":"PAC: Preference-Aware Co-location Scheduling on Heterogeneous NUMA Architectures To Improve Resource Utilization","authors":"Pu Pang, Yaoxuan Li, Bo Liu, Quan Chen, Zhou Yu, Zhibin Yu, Deze Zeng, Jingwen Leng, Jieru Zhao, Minyi Guo","doi":"10.1145/3577193.3593709","DOIUrl":"https://doi.org/10.1145/3577193.3593709","url":null,"abstract":"Latency-critical applications directly interact with end users and often experience the diurnal load pattern. In production, best-effort applications are often co-located with them to utilize the idle cores at the low load. Meanwhile, modern computers are evolving towards heterogeneous NUMA architecture, where the cores have different computation abilities, memory access latencies and network communication delays. Prior co-location scheduling work did not consider the NUMA architecture, and failed to maximize the throughput of best-effort applications while ensuring the required QoS of latency-critical applications. Our investigation shows that NUMA effect has complex impacts on the latency of latency-critical applications and the throughput of best-effort applications. We therefore propose PAC, a preference-aware co-location scheduling scheme that considers the NUMA effect for heterogeneous NUMA architectures. PAC has a performance monitor and a core scheduler. Specifically, the performance monitor identifies the \"dangerous\" latency-critical applications that require upgrading core allocations. We propose two low-overhead scheduling strategies for the scheduler. The strategies identify the bottlenecks of applications and adjust core allocations accordingly. Experimental result shows that PAC improves the throughput of best-effort applications by 3.87× while ensuring the required QoS of latency-critical applications.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121706388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Transfer-learning-based Autotuning using Gaussian Copula 基于迁移学习的高斯Copula自整定

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593712

Thomas Randall, Jaehoon Koo, B. Videau, Michael Kruse, Xingfu Wu, P. Hovland, Mary Hall, Rong Ge, P. Balaprakash

As diverse high-performance computing (HPC) systems are built, many opportunities arise for applications to solve larger problems than ever before. Given the significantly increased complexity of these HPC systems and application tuning, empirical performance tuning, such as autotuning, has emerged as a promising approach in recent years. Despite its effectiveness, autotuning is often a computationally expensive approach. Transfer learning (TL)-based autotuning seeks to address this issue by leveraging the data from prior tuning. Current TL methods for autotuning spend significant time modeling the relationship between parameter configurations and performance, which is ineffective for few-shot (that is, few empirical evaluations) tuning on new tasks. We introduce the first generative TL-based autotuning approach based on the Gaussian copula (GC) to model the high-performing regions of the search space from prior data and then generate high-performing configurations for new tasks. This allows a sampling-based approach that maximizes few-shot performance and provides the first probabilistic estimation of the few-shot budget for effective TL-based autotuning. We compare our generative TL approach with state-of-the-art autotuning techniques on several benchmarks. We find that the GC is capable of achieving 64.37% of peak few-shot performance in its first evaluation. Furthermore, the GC model can determine a few-shot transfer budget that yields up to 33.39× speedup, a dramatic improvement over the 20.58× speedup using prior techniques.

随着各种高性能计算(HPC)系统的构建，应用程序有很多机会解决比以往更大的问题。鉴于这些HPC系统和应用程序调优的复杂性显著增加，近年来，经验性能调优(如自动调优)已成为一种很有前途的方法。尽管自动调优很有效，但它通常是一种计算成本很高的方法。基于迁移学习(TL)的自动调优试图通过利用先前调优的数据来解决这个问题。当前用于自动调优的TL方法花费大量时间建模参数配置和性能之间的关系，这对于新任务的少量调优(即很少的经验评估)是无效的。我们引入了第一种基于高斯copula (GC)的基于生成式tl的自动调谐方法，从先前的数据中建模搜索空间的高性能区域，然后为新任务生成高性能配置。这允许基于采样的方法最大化少数镜头性能，并为有效的基于tl的自动调优提供少数镜头预算的第一个概率估计。我们在几个基准上比较了我们的生成式TL方法与最先进的自动调优技术。我们发现GC在第一次评估中能够达到64.37%的峰值少射性能。此外，GC模型可以确定几次传输预算，从而产生高达33.39倍的加速，与使用先前技术的20.58倍加速相比，这是一个显着的改进。

{"title":"Transfer-learning-based Autotuning using Gaussian Copula","authors":"Thomas Randall, Jaehoon Koo, B. Videau, Michael Kruse, Xingfu Wu, P. Hovland, Mary Hall, Rong Ge, P. Balaprakash","doi":"10.1145/3577193.3593712","DOIUrl":"https://doi.org/10.1145/3577193.3593712","url":null,"abstract":"As diverse high-performance computing (HPC) systems are built, many opportunities arise for applications to solve larger problems than ever before. Given the significantly increased complexity of these HPC systems and application tuning, empirical performance tuning, such as autotuning, has emerged as a promising approach in recent years. Despite its effectiveness, autotuning is often a computationally expensive approach. Transfer learning (TL)-based autotuning seeks to address this issue by leveraging the data from prior tuning. Current TL methods for autotuning spend significant time modeling the relationship between parameter configurations and performance, which is ineffective for few-shot (that is, few empirical evaluations) tuning on new tasks. We introduce the first generative TL-based autotuning approach based on the Gaussian copula (GC) to model the high-performing regions of the search space from prior data and then generate high-performing configurations for new tasks. This allows a sampling-based approach that maximizes few-shot performance and provides the first probabilistic estimation of the few-shot budget for effective TL-based autotuning. We compare our generative TL approach with state-of-the-art autotuning techniques on several benchmarks. We find that the GC is capable of achieving 64.37% of peak few-shot performance in its first evaluation. Furthermore, the GC model can determine a few-shot transfer budget that yields up to 33.39× speedup, a dramatic improvement over the 20.58× speedup using prior techniques.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122378860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

OpenFFT: An Adaptive Tuning Framework for 3D FFT on ARM Multicore CPUs OpenFFT: ARM多核cpu上3D FFT的自适应调优框架

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593735

Tun Chen, Haipeng Jia, Yunquan Zhang, Kun Li, Zhihao Li, Xiang Zhao, Jianyu Yao, Chendi Li

The sophisticated hierarchy and shared characteristics of cache in multicore CPU architectures bring challenges to the performance improvement of fundamental algorithms, especially in implementing and optimizing 3D FFT. 3D FFT is a memory-bounded algorithm that contains many highly discretized memory accesses. With the working set scaling, the data locality becomes poor, which is prone to cause serious memory access overhead, especially for high-dimensional data transposition. This paper proposes a 3D FFT optimization framework named OpenFFT. This framework optimizes the memory access of 3D FFT by the following methods, including 1) A novel tiling algorithm, Z-OpenFFT, based on the column-order algorithm for high-dimensional vectorization to improve data locality and eliminate transposition; 2) An efficient search algorithm Section-cache-aware algorithm to optimize the memory access of butterfly network of 1D FFT; 3) A multi-thread allocation model by analyzing the characteristics of cache hierarchy and task size to allocate threads adaptively. Experiments demonstrate that OpenFFT could obtain a more competitive performance than the best configuration of FFTW and ARMPL on ARM CPUs.

多核CPU架构中缓存复杂的层次结构和共享特性给基础算法的性能提升带来了挑战，特别是在3D FFT的实现和优化方面。三维FFT是一种内存有限的算法，它包含许多高度离散的内存访问。随着工作集的扩展，数据的局部性变得很差，容易造成严重的内存访问开销，特别是对于高维数据的转置。本文提出了一个三维FFT优化框架OpenFFT。该框架通过以下方法对三维FFT的内存访问进行了优化:1)基于列序高维矢量化算法的Z-OpenFFT平铺算法，提高了数据局部性，消除了换位;2)一种优化一维FFT蝴蝶网络内存访问的高效搜索算法-分段缓存感知算法;3)通过分析缓存层次和任务大小的特点，建立多线程分配模型，实现线程的自适应分配。实验表明，OpenFFT在ARM cpu上的性能优于FFTW和腋窝的最佳配置。

{"title":"OpenFFT: An Adaptive Tuning Framework for 3D FFT on ARM Multicore CPUs","authors":"Tun Chen, Haipeng Jia, Yunquan Zhang, Kun Li, Zhihao Li, Xiang Zhao, Jianyu Yao, Chendi Li","doi":"10.1145/3577193.3593735","DOIUrl":"https://doi.org/10.1145/3577193.3593735","url":null,"abstract":"The sophisticated hierarchy and shared characteristics of cache in multicore CPU architectures bring challenges to the performance improvement of fundamental algorithms, especially in implementing and optimizing 3D FFT. 3D FFT is a memory-bounded algorithm that contains many highly discretized memory accesses. With the working set scaling, the data locality becomes poor, which is prone to cause serious memory access overhead, especially for high-dimensional data transposition. This paper proposes a 3D FFT optimization framework named OpenFFT. This framework optimizes the memory access of 3D FFT by the following methods, including 1) A novel tiling algorithm, Z-OpenFFT, based on the column-order algorithm for high-dimensional vectorization to improve data locality and eliminate transposition; 2) An efficient search algorithm Section-cache-aware algorithm to optimize the memory access of butterfly network of 1D FFT; 3) A multi-thread allocation model by analyzing the characteristics of cache hierarchy and task size to allocate threads adaptively. Experiments demonstrate that OpenFFT could obtain a more competitive performance than the best configuration of FFTW and ARMPL on ARM CPUs.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124390233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scalable algorithms for compact spanners on real world graphs 可伸缩算法的紧凑扳手在现实世界的图

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593727

Maulein Pathak, Yogish Sabharwal, Neelima Gupta

A graph spanner is a subgraph that preserves the shortest distance between every pair of vertices within a permissible distortion. Typically, the allowed distortion is a multiplicative factor (of the original distances) and is referred to as stretch. Efficient multiplicative spanners, based on finding low diameter decompositions, have been studied in the distributed and parallel settings. Most of these studies aim to find spanners with theoretical guarantees on the stretch and spanner size. The spanner size guarantees obtained in these works are not very useful for real world sparse graphs. In this work, we evaluate and compare the state of the art algorithms for multiplicative spanners on real world and synthetic graphs. We propose a heuristic that aims to reduce the size of the output spanner. When combined with existing approaches, it admits similar theoretical guarantees as described in prior work while yielding considerably smaller spanners. Our heuristic builds on the idea of selecting centers with large neighborhoods and growing clusters around them. We present a parallel algorithm for selecting a large set of cluster centers based on this heuristic. We evaluate our algorithms on 18 real world graphs from the SNAP data set and 3 well studied synthetic graphs. We demonstrate that our heuristic yields spanners with significantly fewer edges - up to 6x smaller on real world graphs and up to 20x smaller on synthetic graphs, compared to baselines from prior work.

图形扳手是一种子图，它在允许的失真范围内保留每对顶点之间的最短距离。通常，允许的失真是(原始距离的)乘法因子，称为拉伸。在寻找小直径分解的基础上，研究了分布式和并行环境下的高效乘法扳手。大多数这些研究的目的是找到具有理论保证的扳手的拉伸和扳手的尺寸。在这些工作中获得的扳手尺寸保证对现实世界的稀疏图不是很有用。在这项工作中，我们评估和比较了真实世界和合成图上乘法扳手的最新算法。我们提出了一种启发式方法，旨在减少输出扳手的大小。当与现有方法相结合时，它承认与先前工作中描述的相似的理论保证，同时产生相当小的扳手。我们的启发式方法建立在这样的想法之上:选择有大型社区的中心，并在其周围不断增长的集群。在此基础上提出了一种选择大簇中心集的并行算法。我们在来自SNAP数据集的18个真实世界图和3个经过充分研究的合成图上评估了我们的算法。我们证明，我们的启发式生成的扳手的边明显更少——与之前工作的基线相比，在真实世界的图上减少了6倍，在合成图上减少了20倍。

{"title":"Scalable algorithms for compact spanners on real world graphs","authors":"Maulein Pathak, Yogish Sabharwal, Neelima Gupta","doi":"10.1145/3577193.3593727","DOIUrl":"https://doi.org/10.1145/3577193.3593727","url":null,"abstract":"A graph spanner is a subgraph that preserves the shortest distance between every pair of vertices within a permissible distortion. Typically, the allowed distortion is a multiplicative factor (of the original distances) and is referred to as stretch. Efficient multiplicative spanners, based on finding low diameter decompositions, have been studied in the distributed and parallel settings. Most of these studies aim to find spanners with theoretical guarantees on the stretch and spanner size. The spanner size guarantees obtained in these works are not very useful for real world sparse graphs. In this work, we evaluate and compare the state of the art algorithms for multiplicative spanners on real world and synthetic graphs. We propose a heuristic that aims to reduce the size of the output spanner. When combined with existing approaches, it admits similar theoretical guarantees as described in prior work while yielding considerably smaller spanners. Our heuristic builds on the idea of selecting centers with large neighborhoods and growing clusters around them. We present a parallel algorithm for selecting a large set of cluster centers based on this heuristic. We evaluate our algorithms on 18 real world graphs from the SNAP data set and 3 well studied synthetic graphs. We demonstrate that our heuristic yields spanners with significantly fewer edges - up to 6x smaller on real world graphs and up to 20x smaller on synthetic graphs, compared to baselines from prior work.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132144546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Using Additive Modifications in LU Factorization Instead of Pivoting 用加性修正法代替旋转法进行LU分解

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593731

Neil Lindquist, P. Luszczek, J. Dongarra

Direct solvers for dense systems of linear equations commonly use partial pivoting to ensure numerical stability. However, pivoting can introduce significant performance overheads, such as synchronization and data movement, particularly on distributed systems. To improve the performance of these solvers, we present an alternative to pivoting in which numerical stability is obtained through additive updates. We implemented this approach using SLATE, a GPU-accelerated numerical linear algebra library, and evaluated it on the Summit supercomputer. Our approach provides better performance (up to 5-fold speedup) than Gaussian elimination with partial pivoting for comparable accuracy on most of the tested matrices. It also provides better accuracy (up to 15 more digits) than Gaussian elimination with no pivoting for comparable performance.

密集线性方程组的直接求解通常使用部分枢轴来保证数值稳定性。但是，旋转会带来显著的性能开销，例如同步和数据移动，特别是在分布式系统上。为了提高这些求解器的性能，我们提出了一种替代枢轴的方法，其中通过加性更新获得数值稳定性。我们使用gpu加速的数值线性代数库SLATE实现了这种方法，并在Summit超级计算机上对其进行了评估。我们的方法在大多数测试矩阵上提供了更好的性能(高达5倍的加速)，而不是具有部分枢轴的高斯消除。它还提供了比高斯消去法更好的精度(最多多15个数字)，而没有旋转。

引用次数: 0

FT-topo: Architecture-Driven Folded-Triangle Partitioning for Communication-efficient Graph Processing FT-topo:架构驱动的折叠三角形分区，用于通信高效的图形处理

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593729

X. Gan, Guang Wu, Ruigeng Zeng, Jiaqi Si, Ji Liu, Daxiang Dong, Chunye Gong, Cong Liu, Tiejun Li

As graph size (numbers of vertices and edges) is increasing from billions to trillions, efficient graph processing requires exascale computing clusters, which consist of hundreds of thousands of nodes connected via hierarchical networks with multiple levels of communication domains, e.g., multilevel triangle communication domains. While the computation of traversal-centric graph algorithms is relatively simple (e.g., status check), communication is the bottleneck due to the transfer of numerous small messages among hierarchical triangle communication domains. in this paper, we propose FT-topo, a communication-efficient graph partitioning policy for processing exascale graphs. The key idea of FT-topo is to directly map the big graph onto the hierarchical topology of exascale clusters. We carry out extensive experimentation by running various graph algorithms with synthetic graphs and real-world graphs on both Tianhe supercomputer and commercial clusters to show the advantages of FT-topo. FT-topo substantially mitigates communication overhead and thus is orders of magnitude faster than that of the state-of-the-art methods. In particular, FT-topo-based Tianhe supercomputer is superior to the fastest BFS and SSSP systems in the latest Graph500 lists. Furthermore, we deployed FT-topo on other large-scale clusters and it greatly improves graph processing performance on other commercial clusters. FT-topo-based graph operators outperforms the state-of-the-art graph partitioning and graph system by orders of magnitude on real-world graphs.

随着图的大小(顶点和边的数量)从数十亿增加到数万亿，高效的图处理需要百亿亿级的计算集群，这些集群由数十万个节点组成，这些节点通过具有多层通信域的分层网络连接，例如多层三角形通信域。虽然以遍历为中心的图算法的计算相对简单(例如，状态检查)，但由于在分层三角形通信域之间传输大量小消息，通信是瓶颈。在本文中，我们提出了一种用于处理百亿亿级图的通信高效图分区策略FT-topo。FT-topo的关键思想是直接将大图映射到百亿亿级集群的分层拓扑结构上。我们通过在天河超级计算机和商业集群上运行各种图形算法(包括合成图和真实图)进行了广泛的实验，以显示FT-topo的优势。FT-topo大大降低了通信开销，因此比最先进的方法快了几个数量级。特别是，在最新的Graph500榜单中，基于ft拓扑的天河超级计算机优于最快的BFS和SSSP系统。此外，我们在其他大规模集群上部署了FT-topo，它大大提高了其他商业集群的图形处理性能。在现实世界的图上，基于ft拓扑的图算子在数量级上优于最先进的图划分和图系统。

{"title":"FT-topo: Architecture-Driven Folded-Triangle Partitioning for Communication-efficient Graph Processing","authors":"X. Gan, Guang Wu, Ruigeng Zeng, Jiaqi Si, Ji Liu, Daxiang Dong, Chunye Gong, Cong Liu, Tiejun Li","doi":"10.1145/3577193.3593729","DOIUrl":"https://doi.org/10.1145/3577193.3593729","url":null,"abstract":"As graph size (numbers of vertices and edges) is increasing from billions to trillions, efficient graph processing requires exascale computing clusters, which consist of hundreds of thousands of nodes connected via hierarchical networks with multiple levels of communication domains, e.g., multilevel triangle communication domains. While the computation of traversal-centric graph algorithms is relatively simple (e.g., status check), communication is the bottleneck due to the transfer of numerous small messages among hierarchical triangle communication domains. in this paper, we propose FT-topo, a communication-efficient graph partitioning policy for processing exascale graphs. The key idea of FT-topo is to directly map the big graph onto the hierarchical topology of exascale clusters. We carry out extensive experimentation by running various graph algorithms with synthetic graphs and real-world graphs on both Tianhe supercomputer and commercial clusters to show the advantages of FT-topo. FT-topo substantially mitigates communication overhead and thus is orders of magnitude faster than that of the state-of-the-art methods. In particular, FT-topo-based Tianhe supercomputer is superior to the fastest BFS and SSSP systems in the latest Graph500 lists. Furthermore, we deployed FT-topo on other large-scale clusters and it greatly improves graph processing performance on other commercial clusters. FT-topo-based graph operators outperforms the state-of-the-art graph partitioning and graph system by orders of magnitude on real-world graphs.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114407265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Use Only What You Need: Judicious Parallelism For File Transfers in High Performance Networks 只使用你需要的:高性能网络中文件传输的明智并行性

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593722

Md. Arifuzzaman, Engin Arslan

Parallelism is key to efficiently utilizing high-speed research networks when transferring large volumes of data. However, the monolithic design of existing transfer applications requires the same level of parallelism to be used for read, write, and network operations for file transfers. This, in turn, overburdens system resources since setting the parallelism level for the slowest component results in unnecessarily high parallelism for other components. Using more than necessary parallelism lead to increased overhead on system resources and unfair resource allocation among competing transfers. In this paper, we introduce modular file transfer architecture, Marlin, to separate I/O and network operations for file transfers so that parallelism can be independently adjusted for each component. Marlin adopts online gradient descent algorithm to swiftly search the solution space and find the optimal level of parallelism for read, transfer, and write operations. Experimental results collected under various network settings show that Marlin can identify and use a minimum parallelism level for each component, improving fairness among competing transfers and CPU utilization. Finally, separating network transfers from write operations allows Marlin to outperform the state-of-the-art solutions by more than 2x when transferring small datasets.

在传输大量数据时，并行性是有效利用高速研究网络的关键。然而，现有传输应用程序的单片设计要求对文件传输的读、写和网络操作使用相同级别的并行性。这反过来又加重了系统资源的负担，因为为最慢的组件设置并行级别会导致其他组件出现不必要的高并行性。使用过多的并行性会增加系统资源的开销，并导致竞争传输之间的资源分配不公平。在本文中，我们引入模块化的文件传输体系结构Marlin，将文件传输的I/O和网络操作分开，从而可以独立地调整每个组件的并行性。Marlin采用在线梯度下降算法，快速搜索解空间，找到读、转、写操作的最优并行度。在各种网络设置下收集的实验结果表明，Marlin可以识别并使用每个组件的最小并行级别，从而提高竞争传输之间的公平性和CPU利用率。最后，在传输小数据集时，将网络传输与写操作分离，使Marlin的性能比最先进的解决方案高出2倍以上。

{"title":"Use Only What You Need: Judicious Parallelism For File Transfers in High Performance Networks","authors":"Md. Arifuzzaman, Engin Arslan","doi":"10.1145/3577193.3593722","DOIUrl":"https://doi.org/10.1145/3577193.3593722","url":null,"abstract":"Parallelism is key to efficiently utilizing high-speed research networks when transferring large volumes of data. However, the monolithic design of existing transfer applications requires the same level of parallelism to be used for read, write, and network operations for file transfers. This, in turn, overburdens system resources since setting the parallelism level for the slowest component results in unnecessarily high parallelism for other components. Using more than necessary parallelism lead to increased overhead on system resources and unfair resource allocation among competing transfers. In this paper, we introduce modular file transfer architecture, Marlin, to separate I/O and network operations for file transfers so that parallelism can be independently adjusted for each component. Marlin adopts online gradient descent algorithm to swiftly search the solution space and find the optimal level of parallelism for read, transfer, and write operations. Experimental results collected under various network settings show that Marlin can identify and use a minimum parallelism level for each component, improving fairness among competing transfers and CPU utilization. Finally, separating network transfers from write operations allows Marlin to outperform the state-of-the-art solutions by more than 2x when transferring small datasets.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116655305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1