Pub Date : 2017-09-01DOI: 10.1109/HPEC.2017.8091063
Shaun R. Deaton, D. Brownfield, Leonard Kosta, Zhaozhong Zhu, Suzanne J. Matthews
Network Monitoring Systems (NMS) are an important part of protecting Army and enterprise networks. As governments and corporations grow, the amount of traffic data collected by NMS grows proportionally. To protect users against emerging threats, it is common practice for organizations to maintain a series of custom regular expression (regex) patterns to run on NMS data. However, the growth of network traffic makes it increasingly difficult for network administrators to perform this process quickly. In this paper, we describe a novel algorithm that leverages Apache Spark to perform regex matching in parallel. We test our approach on a dataset of 31 million Bro HTTP log events and 569 regular expressions provided by the Army Engineer Research & Development Center (ERDC). Our results indicate that we are able to process 1, 250 events in 1.047 seconds, meeting the desired definition of real-time.
{"title":"Real-time regex matching with apache spark","authors":"Shaun R. Deaton, D. Brownfield, Leonard Kosta, Zhaozhong Zhu, Suzanne J. Matthews","doi":"10.1109/HPEC.2017.8091063","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091063","url":null,"abstract":"Network Monitoring Systems (NMS) are an important part of protecting Army and enterprise networks. As governments and corporations grow, the amount of traffic data collected by NMS grows proportionally. To protect users against emerging threats, it is common practice for organizations to maintain a series of custom regular expression (regex) patterns to run on NMS data. However, the growth of network traffic makes it increasingly difficult for network administrators to perform this process quickly. In this paper, we describe a novel algorithm that leverages Apache Spark to perform regex matching in parallel. We test our approach on a dataset of 31 million Bro HTTP log events and 569 regular expressions provided by the Army Engineer Research & Development Center (ERDC). Our results indicate that we are able to process 1, 250 events in 1.047 seconds, meeting the desired definition of real-time.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126415304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-09-01DOI: 10.1109/HPEC.2017.8091044
E. Debenedictis, Jeanine E. Cook, S. Srikanth, T. Conte
We define the Superstrider architecture and report simulation results that show it could be key to achieving HIVE hardware goals. Superstrider's performance comes from a novel sparse-to-dense stream converter, which relies on 3D manufacturing to tightly couple DRAM to an internal network so operations like merging and parallel prefix can be performed quickly and efficiently. With the ability to use the stream converter as a programming primitive, the memory-bound low-level graph operations that we are aware of speed up substantially. We give special attention to triangle counting in this paper. Simulations detailed elsewhere1 show 50–1,000× improvement in speed and energy efficiency. The low end of the range should be achievable by constructing a custom controller for current High Bandwidth Memory (HBM) where the high end would require fully integrated 3D that is on roadmaps for the future.
{"title":"Superstrider associative array architecture: Approved for unlimited unclassified release: SAND2017-7089 C","authors":"E. Debenedictis, Jeanine E. Cook, S. Srikanth, T. Conte","doi":"10.1109/HPEC.2017.8091044","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091044","url":null,"abstract":"We define the Superstrider architecture and report simulation results that show it could be key to achieving HIVE hardware goals. Superstrider's performance comes from a novel sparse-to-dense stream converter, which relies on 3D manufacturing to tightly couple DRAM to an internal network so operations like merging and parallel prefix can be performed quickly and efficiently. With the ability to use the stream converter as a programming primitive, the memory-bound low-level graph operations that we are aware of speed up substantially. We give special attention to triangle counting in this paper. Simulations detailed elsewhere1 show 50–1,000× improvement in speed and energy efficiency. The low end of the range should be achievable by constructing a custom controller for current High Bandwidth Memory (HBM) where the high end would require fully integrated 3D that is on roadmaps for the future.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"579 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132691710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-09-01DOI: 10.1109/HPEC.2017.8091091
Luca Piccolboni, Paolo Mantovani, G. D. Guglielmo, L. Carloni
Accelerators are specialized hardware designs that generally guarantee two to three orders of magnitude higher energy efficiency than general-purpose processor cores for their target computational kernels. To cope with the complexity of integrating many accelerators into heterogeneous systems, we have proposed Embedded Scalable Platforms (ESP) that combines a flexible architecture with a companion systemlevel design (SLD) methodology. In ESP, we leverage high-level synthesis (HLS) to expedite the design of accelerators, improve the process of design-space exploration (DSE), and promote the reuse of accelerators across different target systems-on-chip (SoCs). HLS tools offer a powerful set of parameters, known as knobs, to optimize the architecture of an accelerator and evaluate different trade-offs in terms of performance and costs. However, exploring a large region of the design space and identifying a rich set of Pareto-optimal implementations are still complex tasks. The standard knobs, in fact, operate only on loops and functions present in the high-level specifications, but they cannot work on other key aspects of SLD such as I/O bandwidth, on-chip memory organization, and trade-offs between the size of the local memory and the granularity at which data is transferred and processed by the accelerators. To address these limitations, we augmented the set of HLS knobs for ESP with three additional knobs, named eXtended Knobs (XKnobs). We used the XKnobs for exploring two selected kernels of the wide-area motion imagery (WAMI) application. Experimental results show that the DSE is broadened by up to 8.5× for the performance figure (latency) and 3.5× for the implementation costs (area) compared to use only the standard knobs.
{"title":"Broadening the exploration of the accelerator design space in embedded scalable platforms","authors":"Luca Piccolboni, Paolo Mantovani, G. D. Guglielmo, L. Carloni","doi":"10.1109/HPEC.2017.8091091","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091091","url":null,"abstract":"Accelerators are specialized hardware designs that generally guarantee two to three orders of magnitude higher energy efficiency than general-purpose processor cores for their target computational kernels. To cope with the complexity of integrating many accelerators into heterogeneous systems, we have proposed Embedded Scalable Platforms (ESP) that combines a flexible architecture with a companion systemlevel design (SLD) methodology. In ESP, we leverage high-level synthesis (HLS) to expedite the design of accelerators, improve the process of design-space exploration (DSE), and promote the reuse of accelerators across different target systems-on-chip (SoCs). HLS tools offer a powerful set of parameters, known as knobs, to optimize the architecture of an accelerator and evaluate different trade-offs in terms of performance and costs. However, exploring a large region of the design space and identifying a rich set of Pareto-optimal implementations are still complex tasks. The standard knobs, in fact, operate only on loops and functions present in the high-level specifications, but they cannot work on other key aspects of SLD such as I/O bandwidth, on-chip memory organization, and trade-offs between the size of the local memory and the granularity at which data is transferred and processed by the accelerators. To address these limitations, we augmented the set of HLS knobs for ESP with three additional knobs, named eXtended Knobs (XKnobs). We used the XKnobs for exploring two selected kernels of the wide-area motion imagery (WAMI) application. Experimental results show that the DSE is broadened by up to 8.5× for the performance figure (latency) and 3.5× for the implementation costs (area) compared to use only the standard knobs.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133020757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-09-01DOI: 10.1109/HPEC.2017.8091060
Y. Huangfu, Wei Zhang
Cache leakage reduction techniques usually compromise time predictability, which are not desirable for real-time systems. In this work, we extend the cache decay and drowsy cache techniques within the hardware-based Performance Enhancement Guaranteed Cache (PEG-C) architecture. The PEG-C can dynamically monitor the performance penalties caused by using leakage energy reduction techniques to ensure that the worst-case execution time (WCET) is better than the case without using any cache. At the same time, the leakage energy of caches can be reduced significantly.
{"title":"Leakage energy reduction for hard real-time caches","authors":"Y. Huangfu, Wei Zhang","doi":"10.1109/HPEC.2017.8091060","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091060","url":null,"abstract":"Cache leakage reduction techniques usually compromise time predictability, which are not desirable for real-time systems. In this work, we extend the cache decay and drowsy cache techniques within the hardware-based Performance Enhancement Guaranteed Cache (PEG-C) architecture. The PEG-C can dynamically monitor the performance penalties caused by using leakage energy reduction techniques to ensure that the worst-case execution time (WCET) is better than the case without using any cache. At the same time, the leakage energy of caches can be reduced significantly.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114916343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-09-01DOI: 10.1109/HPEC.2017.8091093
Kevin Verma, K. Szewc, R. Wille
Smoothed Particle Hydrodynamics (SPH) is a numerical method for fluid flow modeling, in which the fluid is discretized by a set of particles. SPH allows to model complex scenarios, which are difficult or costly to measure in the real world. This method has several advantages compared to other approaches, but suffers from a huge numerical complexity. In order to simulate real life phenomena, up to several hundred millions of particles have to be considered. Hence, HPC methods need to be leveraged to make SPH applicable for industrial applications. Distributing the respective computations among different GPUs to exploit massive parallelism is thereby particularly suited. However, certain characteristics of SPH make it a non-trivial task to properly distribute the respective workload. In this work, we present a load balancing method for a CUDA-based industrial SPH implementation on multi-GPU architectures. To that end, dedicated memory handling schemes are introduced, which reduce the synchronization overhead. Experimental evaluations confirm the scalability and efficiency of the proposed methods.
{"title":"Advanced load balancing for SPH simulations on multi-GPU architectures","authors":"Kevin Verma, K. Szewc, R. Wille","doi":"10.1109/HPEC.2017.8091093","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091093","url":null,"abstract":"Smoothed Particle Hydrodynamics (SPH) is a numerical method for fluid flow modeling, in which the fluid is discretized by a set of particles. SPH allows to model complex scenarios, which are difficult or costly to measure in the real world. This method has several advantages compared to other approaches, but suffers from a huge numerical complexity. In order to simulate real life phenomena, up to several hundred millions of particles have to be considered. Hence, HPC methods need to be leveraged to make SPH applicable for industrial applications. Distributing the respective computations among different GPUs to exploit massive parallelism is thereby particularly suited. However, certain characteristics of SPH make it a non-trivial task to properly distribute the respective workload. In this work, we present a load balancing method for a CUDA-based industrial SPH implementation on multi-GPU architectures. To that end, dedicated memory handling schemes are introduced, which reduce the synchronization overhead. Experimental evaluations confirm the scalability and efficiency of the proposed methods.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133307076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-09-01DOI: 10.1109/HPEC.2017.8091086
J. Bhimani, Zhengyu Yang, M. Leeser, N. Mi
Hypervisor-based virtualization technology has been successfully used to deploy high-performance and scalable infrastructure for Hadoop, and now Spark applications. Container-based virtualization techniques are becoming an important option, which is increasingly used due to their lightweight operation and better scaling when compared to Virtual Machines (VM). With containerization techniques such as Docker becoming mature and promising better performance, we can use Docker to speed-up big data applications. However, as applications have different behaviors and resource requirements, before replacing traditional hypervisor-based virtual machines with Docker, it is important to analyze and compare performance of applications running in the cloud with VMs and Docker containers. VM provides distributed resource management for different virtual machines running with their own allocated resources, while Docker relies on shared pool of resources among all containers. Here, we investigate the performance of different Apache Spark applications using both Virtual Machines (VM) and Docker containers. While others have looked at Docker's performance, this is the first study that compares these different virtualization frameworks for a big data enterprise cloud environment using Apache Spark. In addition to makespan and execution time, we also analyze different resource utilization (CPU, disk, memory, etc.) by Spark applications. Our results show that Spark using Docker can obtain speed-up of over 10 times when compared to using VM. However, we observe that this may not apply to all applications due to different workload patterns and different resource management schemes performed by virtual machines and containers. Our work can guide application developers, system administrators and researchers to better design and deploy big data applications on their platforms to improve the overall performance.
{"title":"Accelerating big data applications using lightweight virtualization framework on enterprise cloud","authors":"J. Bhimani, Zhengyu Yang, M. Leeser, N. Mi","doi":"10.1109/HPEC.2017.8091086","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091086","url":null,"abstract":"Hypervisor-based virtualization technology has been successfully used to deploy high-performance and scalable infrastructure for Hadoop, and now Spark applications. Container-based virtualization techniques are becoming an important option, which is increasingly used due to their lightweight operation and better scaling when compared to Virtual Machines (VM). With containerization techniques such as Docker becoming mature and promising better performance, we can use Docker to speed-up big data applications. However, as applications have different behaviors and resource requirements, before replacing traditional hypervisor-based virtual machines with Docker, it is important to analyze and compare performance of applications running in the cloud with VMs and Docker containers. VM provides distributed resource management for different virtual machines running with their own allocated resources, while Docker relies on shared pool of resources among all containers. Here, we investigate the performance of different Apache Spark applications using both Virtual Machines (VM) and Docker containers. While others have looked at Docker's performance, this is the first study that compares these different virtualization frameworks for a big data enterprise cloud environment using Apache Spark. In addition to makespan and execution time, we also analyze different resource utilization (CPU, disk, memory, etc.) by Spark applications. Our results show that Spark using Docker can obtain speed-up of over 10 times when compared to using VM. However, we observe that this may not apply to all applications due to different workload patterns and different resource management schemes performed by virtual machines and containers. Our work can guide application developers, system administrators and researchers to better design and deploy big data applications on their platforms to improve the overall performance.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"516 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116223911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-09-01DOI: 10.1109/HPEC.2017.8091049
Shaden Smith, Xing Liu, Nesreen Ahmed, A. Tom, F. Petrini, G. Karypis
The scale of data used in graph analytics grows at an unprecedented rate. More than ever, domain experts require efficient and parallel algorithms for tasks in graph analytics. One such task is the truss decomposition, which is a hierarchical decomposition of the edges of a graph and is closely related to the task of triangle enumeration. As evidenced by the recent GraphChallenge, existing algorithms and implementations for truss decomposition are insufficient for the scale of modern datasets. In this work, we propose a parallel algorithm for computing the truss decomposition of massive graphs on a shared-memory system. Our algorithm breaks a computation-efficient serial algorithm into several bulk-synchronous parallel steps which do not rely on atomics or other fine-grained synchronization. We evaluate our algorithm across a variety of synthetic and real-world datasets on a 56-core Intel Xeon system. Our serial implementation achieves over 1400 × speedup over the provided GraphChallenge serial benchmark implementation and is up to 28 × faster than the state-of-the-art shared-memory parallel algorithm.
{"title":"Truss decomposition on shared-memory parallel systems","authors":"Shaden Smith, Xing Liu, Nesreen Ahmed, A. Tom, F. Petrini, G. Karypis","doi":"10.1109/HPEC.2017.8091049","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091049","url":null,"abstract":"The scale of data used in graph analytics grows at an unprecedented rate. More than ever, domain experts require efficient and parallel algorithms for tasks in graph analytics. One such task is the truss decomposition, which is a hierarchical decomposition of the edges of a graph and is closely related to the task of triangle enumeration. As evidenced by the recent GraphChallenge, existing algorithms and implementations for truss decomposition are insufficient for the scale of modern datasets. In this work, we propose a parallel algorithm for computing the truss decomposition of massive graphs on a shared-memory system. Our algorithm breaks a computation-efficient serial algorithm into several bulk-synchronous parallel steps which do not rely on atomics or other fine-grained synchronization. We evaluate our algorithm across a variety of synthetic and real-world datasets on a 56-core Intel Xeon system. Our serial implementation achieves over 1400 × speedup over the provided GraphChallenge serial benchmark implementation and is up to 28 × faster than the state-of-the-art shared-memory parallel algorithm.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128023983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-09-01DOI: 10.1109/HPEC.2017.8091042
K. Date, Keven Feng, R. Nagi, Jinjun Xiong, N. Kim, Wen-mei W. Hwu
In this paper, we present collaborative CPU + GPU algorithms for triangle counting and truss decomposition, the two fundamental problems in graph analytics. We describe the implementation details and present experimental evaluation on the IBM Minsky platform. The main contribution of this paper is a thorough benchmarking and comparison of the different memory management schemes offered by CUDA 8 and NVLink, which can be harnessed for tackling large problems where the limited GPU memory capacity is the primary bottleneck in traditional computing platform. We find that the collaborative algorithms achieve 28× speedup on average (180× max) for triangle counting, and 165× speedup on average (498× max) for truss decomposition, when compared with the baseline Python implementation provided by the Graph Challenge organizers.
{"title":"Collaborative (CPU + GPU) algorithms for triangle counting and truss decomposition on the Minsky architecture: Static graph challenge: Subgraph isomorphism","authors":"K. Date, Keven Feng, R. Nagi, Jinjun Xiong, N. Kim, Wen-mei W. Hwu","doi":"10.1109/HPEC.2017.8091042","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091042","url":null,"abstract":"In this paper, we present collaborative CPU + GPU algorithms for triangle counting and truss decomposition, the two fundamental problems in graph analytics. We describe the implementation details and present experimental evaluation on the IBM Minsky platform. The main contribution of this paper is a thorough benchmarking and comparison of the different memory management schemes offered by CUDA 8 and NVLink, which can be harnessed for tackling large problems where the limited GPU memory capacity is the primary bottleneck in traditional computing platform. We find that the collaborative algorithms achieve 28× speedup on average (180× max) for triangle counting, and 165× speedup on average (498× max) for truss decomposition, when compared with the baseline Python implementation provided by the Graph Challenge organizers.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"16 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120984415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-09-01DOI: 10.1109/HPEC.2017.8091054
A. Tom, N. Sundaram, Nesreen Ahmed, Shaden Smith, Stijn Eyerman, Midhunchandra Kodiyath, I. Hur, F. Petrini, G. Karypis
The widespread use of graphs to model large scale real-world data brings with it the need for fast graph analytics. In this paper, we explore the problem of triangle counting, a fundamental graph-analytic operation, on shared-memory platforms. Existing triangle counting implementations do not effectively utilize the key characteristics of large sparse graphs for tuning their algorithms for performance. We explore such optimizations and develop faster serial and parallel variants of existing algorithms, which outperform the state-of-the-art on Intel manycore and multicore processors. Our algorithms achieve good strong scaling on many graphs with varying scale and degree distributions. Furthermore, we extend our optimizations to a well-known graph processing framework, GraphMat, and demonstrate their generality.
{"title":"Exploring optimizations on shared-memory platforms for parallel triangle counting algorithms","authors":"A. Tom, N. Sundaram, Nesreen Ahmed, Shaden Smith, Stijn Eyerman, Midhunchandra Kodiyath, I. Hur, F. Petrini, G. Karypis","doi":"10.1109/HPEC.2017.8091054","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091054","url":null,"abstract":"The widespread use of graphs to model large scale real-world data brings with it the need for fast graph analytics. In this paper, we explore the problem of triangle counting, a fundamental graph-analytic operation, on shared-memory platforms. Existing triangle counting implementations do not effectively utilize the key characteristics of large sparse graphs for tuning their algorithms for performance. We explore such optimizations and develop faster serial and parallel variants of existing algorithms, which outperform the state-of-the-art on Intel manycore and multicore processors. Our algorithms achieve good strong scaling on many graphs with varying scale and degree distributions. Furthermore, we extend our optimizations to a well-known graph processing framework, GraphMat, and demonstrate their generality.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117334022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-09-01DOI: 10.1109/HPEC.2017.8091075
Jung-Ho Um, Sunggeun Han, Hyunwoo Kim, Kyongseok Park
Recent developments in science and technology have made it possible to analyze data observed by satellites using optical properties. By monitoring changes in the ocean environment and ecosystem, we are currently conducting ocean environmental studies to identify abnormal weather phenomena. International aerospace laboratories such as NASA and ESA are publishing these observed data to ocean scientists around the world. Satellite sensing data accumulates day by day, but data volume for the global scale is so large that scientists usually divide the space for only the area of interest and perform time series analyses. Time series analysis is mainly applied to nonlinear distributions. However, studies of the ocean environment require analysis of the global ocean and ocean ecosystems. Data analysis in the global domain requires nonlinear data fitting for every cell of the satellite imagery data. However, commercial and open-source data analysis tools such as Matlab or R do not provide non-linear data fitting for multiple cells. Because of this, there is a difficulty for ocean scientists to directly implement the analysis of data and it is hard to guarantee distributed and parallelized computation performance. Therefore, in this paper, we propose an algorithm that can distribute and parallelize, in a multi-dimensional database environment, the Levenberg-Marquadt (LM) algorithm, which is well known as a non-linear data fitting algorithm. Our algorithm achieved about 7.5 times speed-up on average, compared to the MINPACK LM algorithm, which is based on MPI and written in FORTRAN. In addition, our algorithm improved 74.3 times speed-up when comparing to the maximum performance for each algorithm. As future research, we will utilize the developed algorithms in the ocean science field for data analysis of global scale satellite imagery data.
{"title":"Study on distributed and parallel non-linear optimization algorithm for ocean color remote sensing data","authors":"Jung-Ho Um, Sunggeun Han, Hyunwoo Kim, Kyongseok Park","doi":"10.1109/HPEC.2017.8091075","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091075","url":null,"abstract":"Recent developments in science and technology have made it possible to analyze data observed by satellites using optical properties. By monitoring changes in the ocean environment and ecosystem, we are currently conducting ocean environmental studies to identify abnormal weather phenomena. International aerospace laboratories such as NASA and ESA are publishing these observed data to ocean scientists around the world. Satellite sensing data accumulates day by day, but data volume for the global scale is so large that scientists usually divide the space for only the area of interest and perform time series analyses. Time series analysis is mainly applied to nonlinear distributions. However, studies of the ocean environment require analysis of the global ocean and ocean ecosystems. Data analysis in the global domain requires nonlinear data fitting for every cell of the satellite imagery data. However, commercial and open-source data analysis tools such as Matlab or R do not provide non-linear data fitting for multiple cells. Because of this, there is a difficulty for ocean scientists to directly implement the analysis of data and it is hard to guarantee distributed and parallelized computation performance. Therefore, in this paper, we propose an algorithm that can distribute and parallelize, in a multi-dimensional database environment, the Levenberg-Marquadt (LM) algorithm, which is well known as a non-linear data fitting algorithm. Our algorithm achieved about 7.5 times speed-up on average, compared to the MINPACK LM algorithm, which is based on MPI and written in FORTRAN. In addition, our algorithm improved 74.3 times speed-up when comparing to the maximum performance for each algorithm. As future research, we will utilize the developed algorithms in the ocean science field for data analysis of global scale satellite imagery data.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128670829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}