Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916270
A. Abdelfattah, S. Tomov, J. Dongarra
This paper presents a progressive approach for optimizing the batched LU factorization on graphics processing units (GPUs). The paper shows that the reliance on level-3 BLAS routines for performance does not really pay off, and that it is indeed important to pay attention to the memory-bound part of the algorithm, especially when the problem size is very small. In this context, we develop a size-aware multi-level blocking technique that utilizes different granularities for kernel fusion according to the problem size. Our experiments, which are conducted on a Tesla V100 GPU, show that the multi-level blocking technique achieves speedups for single/double precisions that are up to 3.28×/2.69× against the generic LAPACK-style implementation. It is also up to 8.72×/7.2× faster than the cuBLAS library for single and double precisions, respectively. The developed solution is integrated into the open-source MAGMA library.
{"title":"Progressive Optimization of Batched LU Factorization on GPUs","authors":"A. Abdelfattah, S. Tomov, J. Dongarra","doi":"10.1109/HPEC.2019.8916270","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916270","url":null,"abstract":"This paper presents a progressive approach for optimizing the batched LU factorization on graphics processing units (GPUs). The paper shows that the reliance on level-3 BLAS routines for performance does not really pay off, and that it is indeed important to pay attention to the memory-bound part of the algorithm, especially when the problem size is very small. In this context, we develop a size-aware multi-level blocking technique that utilizes different granularities for kernel fusion according to the problem size. Our experiments, which are conducted on a Tesla V100 GPU, show that the multi-level blocking technique achieves speedups for single/double precisions that are up to 3.28×/2.69× against the generic LAPACK-style implementation. It is also up to 8.72×/7.2× faster than the cuBLAS library for single and double precisions, respectively. The developed solution is integrated into the open-source MAGMA library.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1996 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127322379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916319
M. Baskaran, Thomas Henretty, J. Ezick
Tensor decomposition is a prominent technique for analyzing multi-attribute data and is being increasingly used for data analysis in different application areas. Tensor decomposition methods are computationally intense and often involve irregular memory accesses over large-scale sparse data. Hence it becomes critical to optimize the execution of such data intensive computations and associated data movement to reduce the eventual time-to-solution in data analysis applications. With the prevalence of using advanced high-performance computing (HPC) systems for data analysis applications, it is becoming increasingly important to provide fast and scalable implementation of tensor decompositions and execute them efficiently on modern and advanced HPC systems. In this paper, we present distributed tensor decomposition methods that achieve faster, memory-efficient, and communication-reduced execution on HPC systems. We demonstrate that our techniques reduce the overall communication and execution time of tensor decomposition methods when they are used for analyzing datasets of varied size from real application. We illustrate our results on HPE Superdome Flex server, a high-end modular system offering large-scale in-memory computing, and on a distributed cluster of Intel Xeon multi-core nodes.
{"title":"Fast and Scalable Distributed Tensor Decompositions","authors":"M. Baskaran, Thomas Henretty, J. Ezick","doi":"10.1109/HPEC.2019.8916319","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916319","url":null,"abstract":"Tensor decomposition is a prominent technique for analyzing multi-attribute data and is being increasingly used for data analysis in different application areas. Tensor decomposition methods are computationally intense and often involve irregular memory accesses over large-scale sparse data. Hence it becomes critical to optimize the execution of such data intensive computations and associated data movement to reduce the eventual time-to-solution in data analysis applications. With the prevalence of using advanced high-performance computing (HPC) systems for data analysis applications, it is becoming increasingly important to provide fast and scalable implementation of tensor decompositions and execute them efficiently on modern and advanced HPC systems. In this paper, we present distributed tensor decomposition methods that achieve faster, memory-efficient, and communication-reduced execution on HPC systems. We demonstrate that our techniques reduce the overall communication and execution time of tensor decomposition methods when they are used for analyzing datasets of varied size from real application. We illustrate our results on HPE Superdome Flex server, a high-end modular system offering large-scale in-memory computing, and on a distributed cluster of Intel Xeon multi-core nodes.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128037102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916236
Arnab A. Purkayastha, S. Raghavendran, Jhanani Thiagarajan, H. Tabkhi
OpenCL programming ability combined with OpenCL High-Level Synthesis (OpenCL-HLS) tools have made tremendous improvements in the reconfigurable computing field. FPGAs inherent pipelined parallelism capability provides not only faster execution times but also power-efficient solutions when executing massively parallel applications. A major execution bottleneck affecting FPGA performance is the high number of memory stalls exposed to pipelined data-path that hinders the benefits of data-path customization.This paper explores the efficiency of “OpenCL Pipe” to hide memory access latency on cloud FPGAs by decoupling memory access from computation. The Pipe semantic is leveraged to split OpenCL kernels into “read”, “compute” and “write back” sub-kernels which work concurrently to overlap the computation of current threads with the memory access of future threads. For evaluation, we use a mix of seven massively parallel high-performance applications from the Rodinia suite vs. 3.1. All our tests are conducted on the Xilinx VU9FP FPGA platform of Amazon cloud-based AWS EC2 F1 instance. On average, we observe 5.2x speedup with a 2.2x increase in memory bandwidth utilization with about 2.5x increase in FPGA resource utilization over the baseline synthesis (Xilinx OpenCL-HLS).11This work has been funded and supported by the Xilinx University Program (XUP)..
{"title":"Exploring the Efficiency of OpenCL Pipe for Hiding Memory Latency on Cloud FPGAs","authors":"Arnab A. Purkayastha, S. Raghavendran, Jhanani Thiagarajan, H. Tabkhi","doi":"10.1109/HPEC.2019.8916236","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916236","url":null,"abstract":"OpenCL programming ability combined with OpenCL High-Level Synthesis (OpenCL-HLS) tools have made tremendous improvements in the reconfigurable computing field. FPGAs inherent pipelined parallelism capability provides not only faster execution times but also power-efficient solutions when executing massively parallel applications. A major execution bottleneck affecting FPGA performance is the high number of memory stalls exposed to pipelined data-path that hinders the benefits of data-path customization.This paper explores the efficiency of “OpenCL Pipe” to hide memory access latency on cloud FPGAs by decoupling memory access from computation. The Pipe semantic is leveraged to split OpenCL kernels into “read”, “compute” and “write back” sub-kernels which work concurrently to overlap the computation of current threads with the memory access of future threads. For evaluation, we use a mix of seven massively parallel high-performance applications from the Rodinia suite vs. 3.1. All our tests are conducted on the Xilinx VU9FP FPGA platform of Amazon cloud-based AWS EC2 F1 instance. On average, we observe 5.2x speedup with a 2.2x increase in memory bandwidth utilization with about 2.5x increase in FPGA resource utilization over the baseline synthesis (Xilinx OpenCL-HLS).11This work has been funded and supported by the Xilinx University Program (XUP)..","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131352823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916309
Louis Jenkins, J. Firoz, Marcin Zalewski, C. Joslyn, Mark Raugas
The Partitioned Global Address Space (PGAS) programming model can be implemented either with programming language features or with runtime library APIs, each implementation favoring different aspects (e.g., productivity, abstraction, flexibility, or performance). Certain language and runtime features, such as collectives, explicit and asynchronous communication primitives, and constructs facilitating overlap of communication and computation (such as futures and conjoined futures) can enable better performance and scaling for irregular applications, in particular for distributed graph analytics. We compare graph algorithms in one of each of these environments: the Chapel PGAS programming language and the the UPC++ PGAS runtime library. We implement algorithms for breadth-first search and triangle counting graph kernels in both environments. We discuss the code in each of the environments, and compile performance data on a Cray Aries and an Infiniband platform. Our results show that the library-based approach of UPC++ currently provides strong performance while Chapel provides a high-level abstraction that, harder to optimize, still provides comparable performance.
{"title":"Graph Algorithms in PGAS: Chapel and UPC++","authors":"Louis Jenkins, J. Firoz, Marcin Zalewski, C. Joslyn, Mark Raugas","doi":"10.1109/HPEC.2019.8916309","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916309","url":null,"abstract":"The Partitioned Global Address Space (PGAS) programming model can be implemented either with programming language features or with runtime library APIs, each implementation favoring different aspects (e.g., productivity, abstraction, flexibility, or performance). Certain language and runtime features, such as collectives, explicit and asynchronous communication primitives, and constructs facilitating overlap of communication and computation (such as futures and conjoined futures) can enable better performance and scaling for irregular applications, in particular for distributed graph analytics. We compare graph algorithms in one of each of these environments: the Chapel PGAS programming language and the the UPC++ PGAS runtime library. We implement algorithms for breadth-first search and triangle counting graph kernels in both environments. We discuss the code in each of the environments, and compile performance data on a Cray Aries and an Infiniband platform. Our results show that the library-based approach of UPC++ currently provides strong performance while Chapel provides a high-level abstraction that, harder to optimize, still provides comparable performance.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"297-301 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130817903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916302
Seher Acer, Abdurrahman Yasar, S. Rajamanickam, Michael M. Wolf, Ümit V. Çatalyürek
Triangle counting is a foundational graph-analysis kernel in network science. It has also been one of the challenge problems for the “Static Graph Challenge”. In this work, we propose a novel, hybrid, parallel triangle counting algorithm based on its linear algebra formulation. Our framework uses MPI and Cilk to exploit the benefits of distributed-memory and shared-memory parallelism, respectively. The problem is partitioned among MPI processes using a two-dimensional (2D) Cartesian block partitioning. One-dimensional (1D) rowwise partitioning is used within the Cartesian blocks for shared-memory parallelism using the Cilk programming model. Besides exhibiting very good strong scaling behavior in almost all tested graphs, our algorithm achieves the fastest time on the 1.4B edge real-world twitter graph, which is 3.217 seconds, on 1,092 cores. In comparison to past distributed-memory parallel winners of the graph challenge, we demonstrate a speed up of 2.7× on this twitter graph. This is also the fastest time reported for parallel triangle counting on the twitter graph when the graph is not replicated.
{"title":"Scalable Triangle Counting on Distributed-Memory Systems","authors":"Seher Acer, Abdurrahman Yasar, S. Rajamanickam, Michael M. Wolf, Ümit V. Çatalyürek","doi":"10.1109/HPEC.2019.8916302","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916302","url":null,"abstract":"Triangle counting is a foundational graph-analysis kernel in network science. It has also been one of the challenge problems for the “Static Graph Challenge”. In this work, we propose a novel, hybrid, parallel triangle counting algorithm based on its linear algebra formulation. Our framework uses MPI and Cilk to exploit the benefits of distributed-memory and shared-memory parallelism, respectively. The problem is partitioned among MPI processes using a two-dimensional (2D) Cartesian block partitioning. One-dimensional (1D) rowwise partitioning is used within the Cartesian blocks for shared-memory parallelism using the Cilk programming model. Besides exhibiting very good strong scaling behavior in almost all tested graphs, our algorithm achieves the fastest time on the 1.4B edge real-world twitter graph, which is 3.217 seconds, on 1,092 cores. In comparison to past distributed-memory parallel winners of the graph challenge, we demonstrate a speed up of 2.7× on this twitter graph. This is also the fastest time reported for parallel triangle counting on the twitter graph when the graph is not replicated.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133160698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916298
M. Rogowski, Suha N. Kayum, F. Mannuß
In parallel reservoir simulation, massively sized files are written recurrently throughout a simulation run. A method is developed to compress the distributed data to be written during the simulation run and to output it to a single compressed file. Evaluation of several compression algorithms on a range of simulation models is performed. The presented method results in 3x file size reduction and a decrease in the total application runtime.
{"title":"Lossless Compression of Internal Files in Parallel Reservoir Simulation","authors":"M. Rogowski, Suha N. Kayum, F. Mannuß","doi":"10.1109/HPEC.2019.8916298","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916298","url":null,"abstract":"In parallel reservoir simulation, massively sized files are written recurrently throughout a simulation run. A method is developed to compress the distributed data to be written during the simulation run and to output it to a single compressed file. Evaluation of several compression algorithms on a range of simulation models is performed. The presented method results in 3x file size reduction and a decrease in the total application runtime.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132998017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916216
Chuangyi Gui, Long Zheng, Pengcheng Yao, Xiaofei Liao, Hai Jin
Triangle counting is one of the most basic graph applications to solve many real-world problems in a wide variety of domains. Exploring the massive parallelism of the Graphics Processing Unit (GPU) to accelerate the triangle counting is prevail. We identify that the stat-of-the-art GPU-based studies that focus on improving the load balancing still exhibit inherently a large number of random accesses in degrading the performance. In this paper, we design a prefetching scheme that buffers the neighbor list of the processed vertex in advance in the fast shared memory to avoid high latency of random global memory access. Also, we adopt the degree-based graph reordering technique and design a simple heuristic to evenly distribute the workload. Compared to the state-of-the-art HEPC Graph Challenge Champion in the last year, we advance to improve the performance of triangle counting by up to $5.9 times $ speedup with $gt 10^{9}$ TEPS on a single GPU for many large real graphs from graph challenge datasets.
{"title":"Fast Triangle Counting on GPU","authors":"Chuangyi Gui, Long Zheng, Pengcheng Yao, Xiaofei Liao, Hai Jin","doi":"10.1109/HPEC.2019.8916216","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916216","url":null,"abstract":"Triangle counting is one of the most basic graph applications to solve many real-world problems in a wide variety of domains. Exploring the massive parallelism of the Graphics Processing Unit (GPU) to accelerate the triangle counting is prevail. We identify that the stat-of-the-art GPU-based studies that focus on improving the load balancing still exhibit inherently a large number of random accesses in degrading the performance. In this paper, we design a prefetching scheme that buffers the neighbor list of the processed vertex in advance in the fast shared memory to avoid high latency of random global memory access. Also, we adopt the degree-based graph reordering technique and design a simple heuristic to evenly distribute the workload. Compared to the state-of-the-art HEPC Graph Challenge Champion in the last year, we advance to improve the performance of triangle counting by up to $5.9 times $ speedup with $gt 10^{9}$ TEPS on a single GPU for many large real graphs from graph challenge datasets.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133540336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916528
A. Wollaber, Jaime Peña, Benjamin Blease, Leslie Shing, Kenneth Alperin, Serge Vilvovsky, P. Trepagnier, Neal Wagner, Leslie Leonard
Cyber situation awareness technologies have largely been focused on present-state conditions, with limited abilities to forward-project nominal conditions in a contested environment. We demonstrate an approach that uses data-driven, high performance computing (HPC) simulations of attacker/defender activities in a logically connected network environment that enables this capability for interactive, operational decision making in real time. Our contributions are three-fold: (1) we link live cyber data to inform the parameters of a cybersecurity model, (2) we perform HPC simulations and optimizations with a genetic algorithm to evaluate and recommend risk remediation strategies that inhibit attacker lateral movement, and (3) we provide a prototype platform to allow cyber defenders to assess the value of their own alternative risk reduction strategies on a relevant timeline. We present an overview of the data and software architectures, and results are presented that demonstrate operational utility alongside HPC-enabled runtimes.
{"title":"Proactive Cyber Situation Awareness via High Performance Computing","authors":"A. Wollaber, Jaime Peña, Benjamin Blease, Leslie Shing, Kenneth Alperin, Serge Vilvovsky, P. Trepagnier, Neal Wagner, Leslie Leonard","doi":"10.1109/HPEC.2019.8916528","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916528","url":null,"abstract":"Cyber situation awareness technologies have largely been focused on present-state conditions, with limited abilities to forward-project nominal conditions in a contested environment. We demonstrate an approach that uses data-driven, high performance computing (HPC) simulations of attacker/defender activities in a logically connected network environment that enables this capability for interactive, operational decision making in real time. Our contributions are three-fold: (1) we link live cyber data to inform the parameters of a cybersecurity model, (2) we perform HPC simulations and optimizations with a genetic algorithm to evaluate and recommend risk remediation strategies that inhibit attacker lateral movement, and (3) we provide a prototype platform to allow cyber defenders to assess the value of their own alternative risk reduction strategies on a relevant timeline. We present an overview of the data and software architectures, and results are presented that demonstrate operational utility alongside HPC-enabled runtimes.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"32 Pt 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133993922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916224
Cole Campton, Xiaobai Sun
We establish first a theoretical foundation for the use of Gromov-Hausdorff (GH) distance for point set registration with homeomorphic deformation maps perturbed by Gaussian noise. We then present a probabilistic, deformable registration framework. At the core of the framework is a highly efficient iterative algorithm with guaranteed convergence to a local minimum of the GH-based objective function. The framework has two other key components – a multi-scale stochastic shape descriptor and a data compression scheme. We also present an experimental comparison between our method and two existing influential methods on non-rigid motion between digital anthropomorphic phantoms extracted from physical data of multiple individuals.
{"title":"Auxiliary Maximum Likelihood Estimation for Noisy Point Cloud Registration","authors":"Cole Campton, Xiaobai Sun","doi":"10.1109/HPEC.2019.8916224","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916224","url":null,"abstract":"We establish first a theoretical foundation for the use of Gromov-Hausdorff (GH) distance for point set registration with homeomorphic deformation maps perturbed by Gaussian noise. We then present a probabilistic, deformable registration framework. At the core of the framework is a highly efficient iterative algorithm with guaranteed convergence to a local minimum of the GH-based objective function. The framework has two other key components – a multi-scale stochastic shape descriptor and a data compression scheme. We also present an experimental comparison between our method and two existing influential methods on non-rigid motion between digital anthropomorphic phantoms extracted from physical data of multiple individuals.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134328438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/HPEC.2019.8916572
Chunxing Yin, E. J. Riedy
Most current frameworks for streaming graph analysis “stop the world” and halt ingesting data while updating analysis results. Others add overhead for different forms of version control. In both methods, adding additional analysis kernels adds additional overhead to the entire system. A new formal model of concurrent analysis lets some algorithms, those valid for the model, update results concurrently with data ingest without synchronization. Additional kernels incur very little overhead. Here we present the first experimental results for the new model, considering the performance and result latency of updating Katz centrality on a low-power edge platform. The Katz centrality values remain close to the synchronous algorithm while reducing latency delay from 12.8$times $ to 179$times $.
{"title":"Concurrent Katz Centrality for Streaming Graphs","authors":"Chunxing Yin, E. J. Riedy","doi":"10.1109/HPEC.2019.8916572","DOIUrl":"https://doi.org/10.1109/HPEC.2019.8916572","url":null,"abstract":"Most current frameworks for streaming graph analysis “stop the world” and halt ingesting data while updating analysis results. Others add overhead for different forms of version control. In both methods, adding additional analysis kernels adds additional overhead to the entire system. A new formal model of concurrent analysis lets some algorithms, those valid for the model, update results concurrently with data ingest without synchronization. Additional kernels incur very little overhead. Here we present the first experimental results for the new model, considering the performance and result latency of updating Katz centrality on a low-power edge platform. The Katz centrality values remain close to the synchronous algorithm while reducing latency delay from 12.8$times $ to 179$times $.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114210157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}