Xiaojian Yang, Shengguo Li, Fan Yuan, Dezun Dong, Chun Huang, Z. Wang
Multigrid algorithms are widely used to solve large-scale sparse linear systems, which is essential for many high-performance workloads. The symmetric Gauss-Seidel (SYMGS) method is often responsible for the performance bottleneck of MG. This paper presents new methods to parallelize and enhance the computation and parallelization efficiency of the SYMGS and MG algorithms on multi-core CPUs. Our solution employs a matrix splitting strategy and a revised computation formula to decrease the computation operations and memory accesses in SYMGS. With this new SYMGS strategy, we can then merge the two most time-consuming components of MG. On top of these, we propose a new asynchronous parallelization scheme to reduce the synchronization overhead when parallelizing SYMGS. We demonstrate the benefit of our techniques by integrating them with the HPCG benchmark and two real-life applications. Evaluation conducted on four architectures, including three ARMv8 and one x86, shows that our techniques greatly surpass the performance of engineer- and vendor-tuned implementations across various workloads and platforms.
{"title":"Optimizing Multi-grid Computation and Parallelization on Multi-cores","authors":"Xiaojian Yang, Shengguo Li, Fan Yuan, Dezun Dong, Chun Huang, Z. Wang","doi":"10.1145/3577193.3593726","DOIUrl":"https://doi.org/10.1145/3577193.3593726","url":null,"abstract":"Multigrid algorithms are widely used to solve large-scale sparse linear systems, which is essential for many high-performance workloads. The symmetric Gauss-Seidel (SYMGS) method is often responsible for the performance bottleneck of MG. This paper presents new methods to parallelize and enhance the computation and parallelization efficiency of the SYMGS and MG algorithms on multi-core CPUs. Our solution employs a matrix splitting strategy and a revised computation formula to decrease the computation operations and memory accesses in SYMGS. With this new SYMGS strategy, we can then merge the two most time-consuming components of MG. On top of these, we propose a new asynchronous parallelization scheme to reduce the synchronization overhead when parallelizing SYMGS. We demonstrate the benefit of our techniques by integrating them with the HPCG benchmark and two real-life applications. Evaluation conducted on four architectures, including three ARMv8 and one x86, shows that our techniques greatly surpass the performance of engineer- and vendor-tuned implementations across various workloads and platforms.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130683969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaofeng Yang, Xiandong Liu, Yun-Tsz Wang, Xin He, Guangming Tan
Finding the All-Pairs Shortest Paths (APSP) in a graph is the key for various domains. Motivated by the graphs are sparse in most real-world applications, we store the whole graph as a compressed storage format in each process of the distributed computing clusters and combine the Floyd algorithm with the Dijkstra algorithm to solve the APSP problem in this work, which leads to the novel Fast APSP algorithm. In contrast to the state-of-the-art Part APSP algorithm, our algorithm adds some memory overhead to store the original sparse graph and uses local Floyd and global Dijkstra algorithms simultaneously. The payoff is the circumvention of expensive global communication, reducing one local FW operation, simplifying the Minplus function, and making its data access continuous. Furthermore, we propose a parallel framework to solve the problem of mismatch between the number of GPUs and the number of divisible blocks of a graph. The Fast APSP algorithm exhibits an average speedup of 16.97x compared to the CPU Dijkstra algorithm, 7.09x compared to the GPU Dijkstra algorithm, 7.09x compared to the Part APSP algorithm, and 4.6x compared to the decentralized Part APSP algorithm. It also shows good scalability in our experiments. It takes about 12.45 minutes to solve the APSP problem for the graph with 11,548,845 vertices by engaging 2048 GPUs.
{"title":"Fast All-Pairs Shortest Paths Algorithm in Large Sparse Graph","authors":"Shaofeng Yang, Xiandong Liu, Yun-Tsz Wang, Xin He, Guangming Tan","doi":"10.1145/3577193.3593728","DOIUrl":"https://doi.org/10.1145/3577193.3593728","url":null,"abstract":"Finding the All-Pairs Shortest Paths (APSP) in a graph is the key for various domains. Motivated by the graphs are sparse in most real-world applications, we store the whole graph as a compressed storage format in each process of the distributed computing clusters and combine the Floyd algorithm with the Dijkstra algorithm to solve the APSP problem in this work, which leads to the novel Fast APSP algorithm. In contrast to the state-of-the-art Part APSP algorithm, our algorithm adds some memory overhead to store the original sparse graph and uses local Floyd and global Dijkstra algorithms simultaneously. The payoff is the circumvention of expensive global communication, reducing one local FW operation, simplifying the Minplus function, and making its data access continuous. Furthermore, we propose a parallel framework to solve the problem of mismatch between the number of GPUs and the number of divisible blocks of a graph. The Fast APSP algorithm exhibits an average speedup of 16.97x compared to the CPU Dijkstra algorithm, 7.09x compared to the GPU Dijkstra algorithm, 7.09x compared to the Part APSP algorithm, and 4.6x compared to the decentralized Part APSP algorithm. It also shows good scalability in our experiments. It takes about 12.45 minutes to solve the APSP problem for the graph with 11,548,845 vertices by engaging 2048 GPUs.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124993488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pouya Haghi, William Krska, Cheng Tan, Tong Geng, P. Chen, Connor Greenwood, Anqi Guo, Thomas M. Hines, Chunshu Wu, Ang Li, A. Skjellum, Martin C. Herbordt
Some communication switches, e.g., the Mellanox SHArP and those in the IBM BlueGene clusters, are augmented to process packets at the application level with fixed-function collectives. This approach, however, lacks flexibility, which limits their applicability in diverse and dynamic workloads. Recently, a new type of programmable packet processor, which uses high-level languages, e.g., P4, has emerged as a possible candidate. P4-based switches, however, fall short in certain applications, including machine learning, where capabilities not currently supported by P4 are needed. These include more complex calculation, such as sparse computation and fused multiply-accumulate, data-intensive floating point operations, data reuse, and significant memory. The problem addressed here is that such a switch augmentation needs to support: a large amount of state, significant flexible compute capability, and ease of programming, all while maintaining full functionality, including ensuring high throughput, and demonstrating utility. In this work, we propose a programmable look-aside-type accelerator that can be embedded into, or attached to, existing communication switch pipelines and that is capable of processing packets at line rate. The proposed in-switch accelerator is based on mixing an ISA (subset of RISC-V instructions) with dataflow graphs (found in CGRAs). To augment performance, vector instructions are also supported. To facilitate usability, we have developed a complete toolchain to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. While this approach is flexible enough to support various workloads, in this paper, we consider Graph Convolutional Networks (GCNs) as a case study. Experimental results show that this approach considerably improves the performance of distributed GCN applications.
{"title":"FLASH: FPGA-Accelerated Smart Switches with GCN Case Study","authors":"Pouya Haghi, William Krska, Cheng Tan, Tong Geng, P. Chen, Connor Greenwood, Anqi Guo, Thomas M. Hines, Chunshu Wu, Ang Li, A. Skjellum, Martin C. Herbordt","doi":"10.1145/3577193.3593739","DOIUrl":"https://doi.org/10.1145/3577193.3593739","url":null,"abstract":"Some communication switches, e.g., the Mellanox SHArP and those in the IBM BlueGene clusters, are augmented to process packets at the application level with fixed-function collectives. This approach, however, lacks flexibility, which limits their applicability in diverse and dynamic workloads. Recently, a new type of programmable packet processor, which uses high-level languages, e.g., P4, has emerged as a possible candidate. P4-based switches, however, fall short in certain applications, including machine learning, where capabilities not currently supported by P4 are needed. These include more complex calculation, such as sparse computation and fused multiply-accumulate, data-intensive floating point operations, data reuse, and significant memory. The problem addressed here is that such a switch augmentation needs to support: a large amount of state, significant flexible compute capability, and ease of programming, all while maintaining full functionality, including ensuring high throughput, and demonstrating utility. In this work, we propose a programmable look-aside-type accelerator that can be embedded into, or attached to, existing communication switch pipelines and that is capable of processing packets at line rate. The proposed in-switch accelerator is based on mixing an ISA (subset of RISC-V instructions) with dataflow graphs (found in CGRAs). To augment performance, vector instructions are also supported. To facilitate usability, we have developed a complete toolchain to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. While this approach is flexible enough to support various workloads, in this paper, we consider Graph Convolutional Networks (GCNs) as a case study. Experimental results show that this approach considerably improves the performance of distributed GCN applications.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123723720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pu Pang, Yaoxuan Li, Bo Liu, Quan Chen, Zhou Yu, Zhibin Yu, Deze Zeng, Jingwen Leng, Jieru Zhao, Minyi Guo
Latency-critical applications directly interact with end users and often experience the diurnal load pattern. In production, best-effort applications are often co-located with them to utilize the idle cores at the low load. Meanwhile, modern computers are evolving towards heterogeneous NUMA architecture, where the cores have different computation abilities, memory access latencies and network communication delays. Prior co-location scheduling work did not consider the NUMA architecture, and failed to maximize the throughput of best-effort applications while ensuring the required QoS of latency-critical applications. Our investigation shows that NUMA effect has complex impacts on the latency of latency-critical applications and the throughput of best-effort applications. We therefore propose PAC, a preference-aware co-location scheduling scheme that considers the NUMA effect for heterogeneous NUMA architectures. PAC has a performance monitor and a core scheduler. Specifically, the performance monitor identifies the "dangerous" latency-critical applications that require upgrading core allocations. We propose two low-overhead scheduling strategies for the scheduler. The strategies identify the bottlenecks of applications and adjust core allocations accordingly. Experimental result shows that PAC improves the throughput of best-effort applications by 3.87× while ensuring the required QoS of latency-critical applications.
{"title":"PAC: Preference-Aware Co-location Scheduling on Heterogeneous NUMA Architectures To Improve Resource Utilization","authors":"Pu Pang, Yaoxuan Li, Bo Liu, Quan Chen, Zhou Yu, Zhibin Yu, Deze Zeng, Jingwen Leng, Jieru Zhao, Minyi Guo","doi":"10.1145/3577193.3593709","DOIUrl":"https://doi.org/10.1145/3577193.3593709","url":null,"abstract":"Latency-critical applications directly interact with end users and often experience the diurnal load pattern. In production, best-effort applications are often co-located with them to utilize the idle cores at the low load. Meanwhile, modern computers are evolving towards heterogeneous NUMA architecture, where the cores have different computation abilities, memory access latencies and network communication delays. Prior co-location scheduling work did not consider the NUMA architecture, and failed to maximize the throughput of best-effort applications while ensuring the required QoS of latency-critical applications. Our investigation shows that NUMA effect has complex impacts on the latency of latency-critical applications and the throughput of best-effort applications. We therefore propose PAC, a preference-aware co-location scheduling scheme that considers the NUMA effect for heterogeneous NUMA architectures. PAC has a performance monitor and a core scheduler. Specifically, the performance monitor identifies the \"dangerous\" latency-critical applications that require upgrading core allocations. We propose two low-overhead scheduling strategies for the scheduler. The strategies identify the bottlenecks of applications and adjust core allocations accordingly. Experimental result shows that PAC improves the throughput of best-effort applications by 3.87× while ensuring the required QoS of latency-critical applications.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121706388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Randall, Jaehoon Koo, B. Videau, Michael Kruse, Xingfu Wu, P. Hovland, Mary Hall, Rong Ge, P. Balaprakash
As diverse high-performance computing (HPC) systems are built, many opportunities arise for applications to solve larger problems than ever before. Given the significantly increased complexity of these HPC systems and application tuning, empirical performance tuning, such as autotuning, has emerged as a promising approach in recent years. Despite its effectiveness, autotuning is often a computationally expensive approach. Transfer learning (TL)-based autotuning seeks to address this issue by leveraging the data from prior tuning. Current TL methods for autotuning spend significant time modeling the relationship between parameter configurations and performance, which is ineffective for few-shot (that is, few empirical evaluations) tuning on new tasks. We introduce the first generative TL-based autotuning approach based on the Gaussian copula (GC) to model the high-performing regions of the search space from prior data and then generate high-performing configurations for new tasks. This allows a sampling-based approach that maximizes few-shot performance and provides the first probabilistic estimation of the few-shot budget for effective TL-based autotuning. We compare our generative TL approach with state-of-the-art autotuning techniques on several benchmarks. We find that the GC is capable of achieving 64.37% of peak few-shot performance in its first evaluation. Furthermore, the GC model can determine a few-shot transfer budget that yields up to 33.39× speedup, a dramatic improvement over the 20.58× speedup using prior techniques.
{"title":"Transfer-learning-based Autotuning using Gaussian Copula","authors":"Thomas Randall, Jaehoon Koo, B. Videau, Michael Kruse, Xingfu Wu, P. Hovland, Mary Hall, Rong Ge, P. Balaprakash","doi":"10.1145/3577193.3593712","DOIUrl":"https://doi.org/10.1145/3577193.3593712","url":null,"abstract":"As diverse high-performance computing (HPC) systems are built, many opportunities arise for applications to solve larger problems than ever before. Given the significantly increased complexity of these HPC systems and application tuning, empirical performance tuning, such as autotuning, has emerged as a promising approach in recent years. Despite its effectiveness, autotuning is often a computationally expensive approach. Transfer learning (TL)-based autotuning seeks to address this issue by leveraging the data from prior tuning. Current TL methods for autotuning spend significant time modeling the relationship between parameter configurations and performance, which is ineffective for few-shot (that is, few empirical evaluations) tuning on new tasks. We introduce the first generative TL-based autotuning approach based on the Gaussian copula (GC) to model the high-performing regions of the search space from prior data and then generate high-performing configurations for new tasks. This allows a sampling-based approach that maximizes few-shot performance and provides the first probabilistic estimation of the few-shot budget for effective TL-based autotuning. We compare our generative TL approach with state-of-the-art autotuning techniques on several benchmarks. We find that the GC is capable of achieving 64.37% of peak few-shot performance in its first evaluation. Furthermore, the GC model can determine a few-shot transfer budget that yields up to 33.39× speedup, a dramatic improvement over the 20.58× speedup using prior techniques.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122378860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tun Chen, Haipeng Jia, Yunquan Zhang, Kun Li, Zhihao Li, Xiang Zhao, Jianyu Yao, Chendi Li
The sophisticated hierarchy and shared characteristics of cache in multicore CPU architectures bring challenges to the performance improvement of fundamental algorithms, especially in implementing and optimizing 3D FFT. 3D FFT is a memory-bounded algorithm that contains many highly discretized memory accesses. With the working set scaling, the data locality becomes poor, which is prone to cause serious memory access overhead, especially for high-dimensional data transposition. This paper proposes a 3D FFT optimization framework named OpenFFT. This framework optimizes the memory access of 3D FFT by the following methods, including 1) A novel tiling algorithm, Z-OpenFFT, based on the column-order algorithm for high-dimensional vectorization to improve data locality and eliminate transposition; 2) An efficient search algorithm Section-cache-aware algorithm to optimize the memory access of butterfly network of 1D FFT; 3) A multi-thread allocation model by analyzing the characteristics of cache hierarchy and task size to allocate threads adaptively. Experiments demonstrate that OpenFFT could obtain a more competitive performance than the best configuration of FFTW and ARMPL on ARM CPUs.
{"title":"OpenFFT: An Adaptive Tuning Framework for 3D FFT on ARM Multicore CPUs","authors":"Tun Chen, Haipeng Jia, Yunquan Zhang, Kun Li, Zhihao Li, Xiang Zhao, Jianyu Yao, Chendi Li","doi":"10.1145/3577193.3593735","DOIUrl":"https://doi.org/10.1145/3577193.3593735","url":null,"abstract":"The sophisticated hierarchy and shared characteristics of cache in multicore CPU architectures bring challenges to the performance improvement of fundamental algorithms, especially in implementing and optimizing 3D FFT. 3D FFT is a memory-bounded algorithm that contains many highly discretized memory accesses. With the working set scaling, the data locality becomes poor, which is prone to cause serious memory access overhead, especially for high-dimensional data transposition. This paper proposes a 3D FFT optimization framework named OpenFFT. This framework optimizes the memory access of 3D FFT by the following methods, including 1) A novel tiling algorithm, Z-OpenFFT, based on the column-order algorithm for high-dimensional vectorization to improve data locality and eliminate transposition; 2) An efficient search algorithm Section-cache-aware algorithm to optimize the memory access of butterfly network of 1D FFT; 3) A multi-thread allocation model by analyzing the characteristics of cache hierarchy and task size to allocate threads adaptively. Experiments demonstrate that OpenFFT could obtain a more competitive performance than the best configuration of FFTW and ARMPL on ARM CPUs.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124390233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A graph spanner is a subgraph that preserves the shortest distance between every pair of vertices within a permissible distortion. Typically, the allowed distortion is a multiplicative factor (of the original distances) and is referred to as stretch. Efficient multiplicative spanners, based on finding low diameter decompositions, have been studied in the distributed and parallel settings. Most of these studies aim to find spanners with theoretical guarantees on the stretch and spanner size. The spanner size guarantees obtained in these works are not very useful for real world sparse graphs. In this work, we evaluate and compare the state of the art algorithms for multiplicative spanners on real world and synthetic graphs. We propose a heuristic that aims to reduce the size of the output spanner. When combined with existing approaches, it admits similar theoretical guarantees as described in prior work while yielding considerably smaller spanners. Our heuristic builds on the idea of selecting centers with large neighborhoods and growing clusters around them. We present a parallel algorithm for selecting a large set of cluster centers based on this heuristic. We evaluate our algorithms on 18 real world graphs from the SNAP data set and 3 well studied synthetic graphs. We demonstrate that our heuristic yields spanners with significantly fewer edges - up to 6x smaller on real world graphs and up to 20x smaller on synthetic graphs, compared to baselines from prior work.
{"title":"Scalable algorithms for compact spanners on real world graphs","authors":"Maulein Pathak, Yogish Sabharwal, Neelima Gupta","doi":"10.1145/3577193.3593727","DOIUrl":"https://doi.org/10.1145/3577193.3593727","url":null,"abstract":"A graph spanner is a subgraph that preserves the shortest distance between every pair of vertices within a permissible distortion. Typically, the allowed distortion is a multiplicative factor (of the original distances) and is referred to as stretch. Efficient multiplicative spanners, based on finding low diameter decompositions, have been studied in the distributed and parallel settings. Most of these studies aim to find spanners with theoretical guarantees on the stretch and spanner size. The spanner size guarantees obtained in these works are not very useful for real world sparse graphs. In this work, we evaluate and compare the state of the art algorithms for multiplicative spanners on real world and synthetic graphs. We propose a heuristic that aims to reduce the size of the output spanner. When combined with existing approaches, it admits similar theoretical guarantees as described in prior work while yielding considerably smaller spanners. Our heuristic builds on the idea of selecting centers with large neighborhoods and growing clusters around them. We present a parallel algorithm for selecting a large set of cluster centers based on this heuristic. We evaluate our algorithms on 18 real world graphs from the SNAP data set and 3 well studied synthetic graphs. We demonstrate that our heuristic yields spanners with significantly fewer edges - up to 6x smaller on real world graphs and up to 20x smaller on synthetic graphs, compared to baselines from prior work.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132144546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Direct solvers for dense systems of linear equations commonly use partial pivoting to ensure numerical stability. However, pivoting can introduce significant performance overheads, such as synchronization and data movement, particularly on distributed systems. To improve the performance of these solvers, we present an alternative to pivoting in which numerical stability is obtained through additive updates. We implemented this approach using SLATE, a GPU-accelerated numerical linear algebra library, and evaluated it on the Summit supercomputer. Our approach provides better performance (up to 5-fold speedup) than Gaussian elimination with partial pivoting for comparable accuracy on most of the tested matrices. It also provides better accuracy (up to 15 more digits) than Gaussian elimination with no pivoting for comparable performance.
{"title":"Using Additive Modifications in LU Factorization Instead of Pivoting","authors":"Neil Lindquist, P. Luszczek, J. Dongarra","doi":"10.1145/3577193.3593731","DOIUrl":"https://doi.org/10.1145/3577193.3593731","url":null,"abstract":"Direct solvers for dense systems of linear equations commonly use partial pivoting to ensure numerical stability. However, pivoting can introduce significant performance overheads, such as synchronization and data movement, particularly on distributed systems. To improve the performance of these solvers, we present an alternative to pivoting in which numerical stability is obtained through additive updates. We implemented this approach using SLATE, a GPU-accelerated numerical linear algebra library, and evaluated it on the Summit supercomputer. Our approach provides better performance (up to 5-fold speedup) than Gaussian elimination with partial pivoting for comparable accuracy on most of the tested matrices. It also provides better accuracy (up to 15 more digits) than Gaussian elimination with no pivoting for comparable performance.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114727420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
X. Gan, Guang Wu, Ruigeng Zeng, Jiaqi Si, Ji Liu, Daxiang Dong, Chunye Gong, Cong Liu, Tiejun Li
As graph size (numbers of vertices and edges) is increasing from billions to trillions, efficient graph processing requires exascale computing clusters, which consist of hundreds of thousands of nodes connected via hierarchical networks with multiple levels of communication domains, e.g., multilevel triangle communication domains. While the computation of traversal-centric graph algorithms is relatively simple (e.g., status check), communication is the bottleneck due to the transfer of numerous small messages among hierarchical triangle communication domains. in this paper, we propose FT-topo, a communication-efficient graph partitioning policy for processing exascale graphs. The key idea of FT-topo is to directly map the big graph onto the hierarchical topology of exascale clusters. We carry out extensive experimentation by running various graph algorithms with synthetic graphs and real-world graphs on both Tianhe supercomputer and commercial clusters to show the advantages of FT-topo. FT-topo substantially mitigates communication overhead and thus is orders of magnitude faster than that of the state-of-the-art methods. In particular, FT-topo-based Tianhe supercomputer is superior to the fastest BFS and SSSP systems in the latest Graph500 lists. Furthermore, we deployed FT-topo on other large-scale clusters and it greatly improves graph processing performance on other commercial clusters. FT-topo-based graph operators outperforms the state-of-the-art graph partitioning and graph system by orders of magnitude on real-world graphs.
{"title":"FT-topo: Architecture-Driven Folded-Triangle Partitioning for Communication-efficient Graph Processing","authors":"X. Gan, Guang Wu, Ruigeng Zeng, Jiaqi Si, Ji Liu, Daxiang Dong, Chunye Gong, Cong Liu, Tiejun Li","doi":"10.1145/3577193.3593729","DOIUrl":"https://doi.org/10.1145/3577193.3593729","url":null,"abstract":"As graph size (numbers of vertices and edges) is increasing from billions to trillions, efficient graph processing requires exascale computing clusters, which consist of hundreds of thousands of nodes connected via hierarchical networks with multiple levels of communication domains, e.g., multilevel triangle communication domains. While the computation of traversal-centric graph algorithms is relatively simple (e.g., status check), communication is the bottleneck due to the transfer of numerous small messages among hierarchical triangle communication domains. in this paper, we propose FT-topo, a communication-efficient graph partitioning policy for processing exascale graphs. The key idea of FT-topo is to directly map the big graph onto the hierarchical topology of exascale clusters. We carry out extensive experimentation by running various graph algorithms with synthetic graphs and real-world graphs on both Tianhe supercomputer and commercial clusters to show the advantages of FT-topo. FT-topo substantially mitigates communication overhead and thus is orders of magnitude faster than that of the state-of-the-art methods. In particular, FT-topo-based Tianhe supercomputer is superior to the fastest BFS and SSSP systems in the latest Graph500 lists. Furthermore, we deployed FT-topo on other large-scale clusters and it greatly improves graph processing performance on other commercial clusters. FT-topo-based graph operators outperforms the state-of-the-art graph partitioning and graph system by orders of magnitude on real-world graphs.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114407265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallelism is key to efficiently utilizing high-speed research networks when transferring large volumes of data. However, the monolithic design of existing transfer applications requires the same level of parallelism to be used for read, write, and network operations for file transfers. This, in turn, overburdens system resources since setting the parallelism level for the slowest component results in unnecessarily high parallelism for other components. Using more than necessary parallelism lead to increased overhead on system resources and unfair resource allocation among competing transfers. In this paper, we introduce modular file transfer architecture, Marlin, to separate I/O and network operations for file transfers so that parallelism can be independently adjusted for each component. Marlin adopts online gradient descent algorithm to swiftly search the solution space and find the optimal level of parallelism for read, transfer, and write operations. Experimental results collected under various network settings show that Marlin can identify and use a minimum parallelism level for each component, improving fairness among competing transfers and CPU utilization. Finally, separating network transfers from write operations allows Marlin to outperform the state-of-the-art solutions by more than 2x when transferring small datasets.
{"title":"Use Only What You Need: Judicious Parallelism For File Transfers in High Performance Networks","authors":"Md. Arifuzzaman, Engin Arslan","doi":"10.1145/3577193.3593722","DOIUrl":"https://doi.org/10.1145/3577193.3593722","url":null,"abstract":"Parallelism is key to efficiently utilizing high-speed research networks when transferring large volumes of data. However, the monolithic design of existing transfer applications requires the same level of parallelism to be used for read, write, and network operations for file transfers. This, in turn, overburdens system resources since setting the parallelism level for the slowest component results in unnecessarily high parallelism for other components. Using more than necessary parallelism lead to increased overhead on system resources and unfair resource allocation among competing transfers. In this paper, we introduce modular file transfer architecture, Marlin, to separate I/O and network operations for file transfers so that parallelism can be independently adjusted for each component. Marlin adopts online gradient descent algorithm to swiftly search the solution space and find the optimal level of parallelism for read, transfer, and write operations. Experimental results collected under various network settings show that Marlin can identify and use a minimum parallelism level for each component, improving fairness among competing transfers and CPU utilization. Finally, separating network transfers from write operations allows Marlin to outperform the state-of-the-art solutions by more than 2x when transferring small datasets.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116655305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}