首页 > 最新文献

2017 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文 中文
Real-time regex matching with apache spark 实时正则表达式匹配与apache spark
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091063
Shaun R. Deaton, D. Brownfield, Leonard Kosta, Zhaozhong Zhu, Suzanne J. Matthews
Network Monitoring Systems (NMS) are an important part of protecting Army and enterprise networks. As governments and corporations grow, the amount of traffic data collected by NMS grows proportionally. To protect users against emerging threats, it is common practice for organizations to maintain a series of custom regular expression (regex) patterns to run on NMS data. However, the growth of network traffic makes it increasingly difficult for network administrators to perform this process quickly. In this paper, we describe a novel algorithm that leverages Apache Spark to perform regex matching in parallel. We test our approach on a dataset of 31 million Bro HTTP log events and 569 regular expressions provided by the Army Engineer Research & Development Center (ERDC). Our results indicate that we are able to process 1, 250 events in 1.047 seconds, meeting the desired definition of real-time.
网络监控系统(NMS)是保护军队和企业网络的重要组成部分。随着政府和公司的增长,NMS收集的流量数据量也会成比例地增长。为了保护用户免受新出现的威胁,组织通常的做法是维护一系列自定义正则表达式(regex)模式,以便在NMS数据上运行。然而,网络流量的增长使得网络管理员越来越难以快速执行此过程。在本文中,我们描述了一种利用Apache Spark并行执行正则表达式匹配的新算法。我们在陆军工程研究与发展中心(ERDC)提供的3100万个Bro HTTP日志事件和569个正则表达式的数据集上测试了我们的方法。我们的结果表明,我们能够在1.047秒内处理1,250个事件,满足所需的实时定义。
{"title":"Real-time regex matching with apache spark","authors":"Shaun R. Deaton, D. Brownfield, Leonard Kosta, Zhaozhong Zhu, Suzanne J. Matthews","doi":"10.1109/HPEC.2017.8091063","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091063","url":null,"abstract":"Network Monitoring Systems (NMS) are an important part of protecting Army and enterprise networks. As governments and corporations grow, the amount of traffic data collected by NMS grows proportionally. To protect users against emerging threats, it is common practice for organizations to maintain a series of custom regular expression (regex) patterns to run on NMS data. However, the growth of network traffic makes it increasingly difficult for network administrators to perform this process quickly. In this paper, we describe a novel algorithm that leverages Apache Spark to perform regex matching in parallel. We test our approach on a dataset of 31 million Bro HTTP log events and 569 regular expressions provided by the Army Engineer Research & Development Center (ERDC). Our results indicate that we are able to process 1, 250 events in 1.047 seconds, meeting the desired definition of real-time.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126415304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Superstrider associative array architecture: Approved for unlimited unclassified release: SAND2017-7089 C Superstrider关联数组架构:批准无限非机密发布:SAND2017-7089 C
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091044
E. Debenedictis, Jeanine E. Cook, S. Srikanth, T. Conte
We define the Superstrider architecture and report simulation results that show it could be key to achieving HIVE hardware goals. Superstrider's performance comes from a novel sparse-to-dense stream converter, which relies on 3D manufacturing to tightly couple DRAM to an internal network so operations like merging and parallel prefix can be performed quickly and efficiently. With the ability to use the stream converter as a programming primitive, the memory-bound low-level graph operations that we are aware of speed up substantially. We give special attention to triangle counting in this paper. Simulations detailed elsewhere1 show 50–1,000× improvement in speed and energy efficiency. The low end of the range should be achievable by constructing a custom controller for current High Bandwidth Memory (HBM) where the high end would require fully integrated 3D that is on roadmaps for the future.
我们定义了Superstrider架构,并报告了仿真结果,表明它可能是实现HIVE硬件目标的关键。Superstrider的性能来自于一种新颖的从稀疏到密集的流转换器,它依靠3D制造将DRAM与内部网络紧密耦合,从而可以快速有效地执行合并和并行前缀等操作。有了使用流转换器作为编程原语的能力,我们所知道的受内存限制的低级图形操作的速度就大大提高了。本文特别关注三角形计数。其他地方详细的模拟显示速度和能源效率提高了50 - 1000倍。低端应该可以通过为当前的高带宽存储器(HBM)构建自定义控制器来实现,而高端则需要完全集成的3D,这是未来的路线图。
{"title":"Superstrider associative array architecture: Approved for unlimited unclassified release: SAND2017-7089 C","authors":"E. Debenedictis, Jeanine E. Cook, S. Srikanth, T. Conte","doi":"10.1109/HPEC.2017.8091044","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091044","url":null,"abstract":"We define the Superstrider architecture and report simulation results that show it could be key to achieving HIVE hardware goals. Superstrider's performance comes from a novel sparse-to-dense stream converter, which relies on 3D manufacturing to tightly couple DRAM to an internal network so operations like merging and parallel prefix can be performed quickly and efficiently. With the ability to use the stream converter as a programming primitive, the memory-bound low-level graph operations that we are aware of speed up substantially. We give special attention to triangle counting in this paper. Simulations detailed elsewhere1 show 50–1,000× improvement in speed and energy efficiency. The low end of the range should be achievable by constructing a custom controller for current High Bandwidth Memory (HBM) where the high end would require fully integrated 3D that is on roadmaps for the future.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"579 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132691710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Broadening the exploration of the accelerator design space in embedded scalable platforms 在嵌入式可扩展平台中拓展加速器设计空间
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091091
Luca Piccolboni, Paolo Mantovani, G. D. Guglielmo, L. Carloni
Accelerators are specialized hardware designs that generally guarantee two to three orders of magnitude higher energy efficiency than general-purpose processor cores for their target computational kernels. To cope with the complexity of integrating many accelerators into heterogeneous systems, we have proposed Embedded Scalable Platforms (ESP) that combines a flexible architecture with a companion systemlevel design (SLD) methodology. In ESP, we leverage high-level synthesis (HLS) to expedite the design of accelerators, improve the process of design-space exploration (DSE), and promote the reuse of accelerators across different target systems-on-chip (SoCs). HLS tools offer a powerful set of parameters, known as knobs, to optimize the architecture of an accelerator and evaluate different trade-offs in terms of performance and costs. However, exploring a large region of the design space and identifying a rich set of Pareto-optimal implementations are still complex tasks. The standard knobs, in fact, operate only on loops and functions present in the high-level specifications, but they cannot work on other key aspects of SLD such as I/O bandwidth, on-chip memory organization, and trade-offs between the size of the local memory and the granularity at which data is transferred and processed by the accelerators. To address these limitations, we augmented the set of HLS knobs for ESP with three additional knobs, named eXtended Knobs (XKnobs). We used the XKnobs for exploring two selected kernels of the wide-area motion imagery (WAMI) application. Experimental results show that the DSE is broadened by up to 8.5× for the performance figure (latency) and 3.5× for the implementation costs (area) compared to use only the standard knobs.
加速器是一种专门的硬件设计,对于其目标计算内核,通常可以保证比通用处理器内核高两到三个数量级的能源效率。为了应对将许多加速器集成到异构系统中的复杂性,我们提出了嵌入式可扩展平台(ESP),它结合了灵活的体系结构和配套的系统级设计(SLD)方法。在ESP中,我们利用高级合成(HLS)来加快加速器的设计,改进设计空间探索(DSE)过程,并促进加速器在不同目标片上系统(soc)之间的重用。HLS工具提供了一组功能强大的参数(称为旋钮),用于优化加速器的体系结构,并在性能和成本方面评估不同的权衡。然而,探索设计空间的大区域并确定丰富的帕累托最优实现集仍然是复杂的任务。实际上,标准旋钮只能操作高级规范中的循环和函数,但它们不能处理SLD的其他关键方面,例如I/O带宽、片上内存组织以及本地内存大小与加速器传输和处理数据的粒度之间的权衡。为了解决这些限制,我们为ESP增加了三个额外的HLS旋钮集,称为扩展旋钮(XKnobs)。我们使用XKnobs来探索广域运动图像(WAMI)应用程序的两个选定的内核。实验结果表明,与仅使用标准旋钮相比,DSE在性能指标(延迟)上扩大了8.5倍,在实现成本(面积)上扩大了3.5倍。
{"title":"Broadening the exploration of the accelerator design space in embedded scalable platforms","authors":"Luca Piccolboni, Paolo Mantovani, G. D. Guglielmo, L. Carloni","doi":"10.1109/HPEC.2017.8091091","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091091","url":null,"abstract":"Accelerators are specialized hardware designs that generally guarantee two to three orders of magnitude higher energy efficiency than general-purpose processor cores for their target computational kernels. To cope with the complexity of integrating many accelerators into heterogeneous systems, we have proposed Embedded Scalable Platforms (ESP) that combines a flexible architecture with a companion systemlevel design (SLD) methodology. In ESP, we leverage high-level synthesis (HLS) to expedite the design of accelerators, improve the process of design-space exploration (DSE), and promote the reuse of accelerators across different target systems-on-chip (SoCs). HLS tools offer a powerful set of parameters, known as knobs, to optimize the architecture of an accelerator and evaluate different trade-offs in terms of performance and costs. However, exploring a large region of the design space and identifying a rich set of Pareto-optimal implementations are still complex tasks. The standard knobs, in fact, operate only on loops and functions present in the high-level specifications, but they cannot work on other key aspects of SLD such as I/O bandwidth, on-chip memory organization, and trade-offs between the size of the local memory and the granularity at which data is transferred and processed by the accelerators. To address these limitations, we augmented the set of HLS knobs for ESP with three additional knobs, named eXtended Knobs (XKnobs). We used the XKnobs for exploring two selected kernels of the wide-area motion imagery (WAMI) application. Experimental results show that the DSE is broadened by up to 8.5× for the performance figure (latency) and 3.5× for the implementation costs (area) compared to use only the standard knobs.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133020757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Leakage energy reduction for hard real-time caches 减少硬实时缓存的泄漏能量
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091060
Y. Huangfu, Wei Zhang
Cache leakage reduction techniques usually compromise time predictability, which are not desirable for real-time systems. In this work, we extend the cache decay and drowsy cache techniques within the hardware-based Performance Enhancement Guaranteed Cache (PEG-C) architecture. The PEG-C can dynamically monitor the performance penalties caused by using leakage energy reduction techniques to ensure that the worst-case execution time (WCET) is better than the case without using any cache. At the same time, the leakage energy of caches can be reduced significantly.
减少缓存泄漏的技术通常会损害时间的可预测性,这对于实时系统来说是不可取的。在这项工作中,我们在基于硬件的性能增强保证缓存(PEG-C)架构中扩展了缓存衰减和休眠缓存技术。PEG-C可以动态监控由于使用泄漏能量减少技术而导致的性能损失,以确保最坏情况下的执行时间(WCET)优于不使用任何缓存的情况。同时,可以显著降低缓存器的泄漏能量。
{"title":"Leakage energy reduction for hard real-time caches","authors":"Y. Huangfu, Wei Zhang","doi":"10.1109/HPEC.2017.8091060","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091060","url":null,"abstract":"Cache leakage reduction techniques usually compromise time predictability, which are not desirable for real-time systems. In this work, we extend the cache decay and drowsy cache techniques within the hardware-based Performance Enhancement Guaranteed Cache (PEG-C) architecture. The PEG-C can dynamically monitor the performance penalties caused by using leakage energy reduction techniques to ensure that the worst-case execution time (WCET) is better than the case without using any cache. At the same time, the leakage energy of caches can be reduced significantly.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114916343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advanced load balancing for SPH simulations on multi-GPU architectures 多gpu架构上SPH仿真的高级负载平衡
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091093
Kevin Verma, K. Szewc, R. Wille
Smoothed Particle Hydrodynamics (SPH) is a numerical method for fluid flow modeling, in which the fluid is discretized by a set of particles. SPH allows to model complex scenarios, which are difficult or costly to measure in the real world. This method has several advantages compared to other approaches, but suffers from a huge numerical complexity. In order to simulate real life phenomena, up to several hundred millions of particles have to be considered. Hence, HPC methods need to be leveraged to make SPH applicable for industrial applications. Distributing the respective computations among different GPUs to exploit massive parallelism is thereby particularly suited. However, certain characteristics of SPH make it a non-trivial task to properly distribute the respective workload. In this work, we present a load balancing method for a CUDA-based industrial SPH implementation on multi-GPU architectures. To that end, dedicated memory handling schemes are introduced, which reduce the synchronization overhead. Experimental evaluations confirm the scalability and efficiency of the proposed methods.
光滑颗粒流体动力学(SPH)是一种流体流动建模的数值方法,它将流体离散化为一组颗粒。SPH允许对复杂的场景进行建模,这些场景在现实世界中很难测量或成本很高。与其他方法相比,该方法有几个优点,但其数值复杂性较大。为了模拟现实生活中的现象,必须考虑多达数亿个粒子。因此,需要利用HPC方法使SPH适用于工业应用。因此,在不同的gpu之间分配各自的计算以利用大规模并行性是特别合适的。但是,SPH的某些特性使得正确分配各自的工作负载成为一项非常重要的任务。在这项工作中,我们提出了一种基于cuda的工业SPH在多gpu架构上实现的负载平衡方法。为此,引入了专用的内存处理方案,以减少同步开销。实验验证了该方法的可扩展性和有效性。
{"title":"Advanced load balancing for SPH simulations on multi-GPU architectures","authors":"Kevin Verma, K. Szewc, R. Wille","doi":"10.1109/HPEC.2017.8091093","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091093","url":null,"abstract":"Smoothed Particle Hydrodynamics (SPH) is a numerical method for fluid flow modeling, in which the fluid is discretized by a set of particles. SPH allows to model complex scenarios, which are difficult or costly to measure in the real world. This method has several advantages compared to other approaches, but suffers from a huge numerical complexity. In order to simulate real life phenomena, up to several hundred millions of particles have to be considered. Hence, HPC methods need to be leveraged to make SPH applicable for industrial applications. Distributing the respective computations among different GPUs to exploit massive parallelism is thereby particularly suited. However, certain characteristics of SPH make it a non-trivial task to properly distribute the respective workload. In this work, we present a load balancing method for a CUDA-based industrial SPH implementation on multi-GPU architectures. To that end, dedicated memory handling schemes are introduced, which reduce the synchronization overhead. Experimental evaluations confirm the scalability and efficiency of the proposed methods.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133307076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Accelerating big data applications using lightweight virtualization framework on enterprise cloud 在企业云上使用轻量级虚拟化框架加速大数据应用
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091086
J. Bhimani, Zhengyu Yang, M. Leeser, N. Mi
Hypervisor-based virtualization technology has been successfully used to deploy high-performance and scalable infrastructure for Hadoop, and now Spark applications. Container-based virtualization techniques are becoming an important option, which is increasingly used due to their lightweight operation and better scaling when compared to Virtual Machines (VM). With containerization techniques such as Docker becoming mature and promising better performance, we can use Docker to speed-up big data applications. However, as applications have different behaviors and resource requirements, before replacing traditional hypervisor-based virtual machines with Docker, it is important to analyze and compare performance of applications running in the cloud with VMs and Docker containers. VM provides distributed resource management for different virtual machines running with their own allocated resources, while Docker relies on shared pool of resources among all containers. Here, we investigate the performance of different Apache Spark applications using both Virtual Machines (VM) and Docker containers. While others have looked at Docker's performance, this is the first study that compares these different virtualization frameworks for a big data enterprise cloud environment using Apache Spark. In addition to makespan and execution time, we also analyze different resource utilization (CPU, disk, memory, etc.) by Spark applications. Our results show that Spark using Docker can obtain speed-up of over 10 times when compared to using VM. However, we observe that this may not apply to all applications due to different workload patterns and different resource management schemes performed by virtual machines and containers. Our work can guide application developers, system administrators and researchers to better design and deploy big data applications on their platforms to improve the overall performance.
基于hypervisor的虚拟化技术已经成功地用于为Hadoop和现在的Spark应用程序部署高性能和可伸缩的基础设施。基于容器的虚拟化技术正在成为一种重要的选择,与虚拟机(VM)相比,由于其轻量级操作和更好的可伸缩性,它被越来越多地使用。随着Docker等容器化技术的成熟和性能的提高,我们可以使用Docker来加速大数据应用。但是,由于应用程序具有不同的行为和资源需求,因此在用Docker替换传统的基于hypervisor的虚拟机之前,有必要对运行在云中、使用虚拟机和Docker容器的应用程序的性能进行分析和比较。VM为使用自己分配的资源运行的不同虚拟机提供分布式资源管理,而Docker依赖于所有容器之间的共享资源池。在这里,我们研究了使用虚拟机(VM)和Docker容器的不同Apache Spark应用程序的性能。虽然其他人已经研究了Docker的性能,但这是第一次比较使用Apache Spark的大数据企业云环境中不同虚拟化框架的研究。除了makespan和执行时间外,我们还分析了Spark应用程序对不同资源(CPU,磁盘,内存等)的使用情况。我们的结果表明,与使用VM相比,使用Docker的Spark可以获得10倍以上的速度提升。然而,我们注意到,由于虚拟机和容器执行的工作负载模式和资源管理方案不同,这可能不适用于所有应用程序。我们的工作可以指导应用开发者、系统管理员和研究人员在他们的平台上更好地设计和部署大数据应用,以提高整体性能。
{"title":"Accelerating big data applications using lightweight virtualization framework on enterprise cloud","authors":"J. Bhimani, Zhengyu Yang, M. Leeser, N. Mi","doi":"10.1109/HPEC.2017.8091086","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091086","url":null,"abstract":"Hypervisor-based virtualization technology has been successfully used to deploy high-performance and scalable infrastructure for Hadoop, and now Spark applications. Container-based virtualization techniques are becoming an important option, which is increasingly used due to their lightweight operation and better scaling when compared to Virtual Machines (VM). With containerization techniques such as Docker becoming mature and promising better performance, we can use Docker to speed-up big data applications. However, as applications have different behaviors and resource requirements, before replacing traditional hypervisor-based virtual machines with Docker, it is important to analyze and compare performance of applications running in the cloud with VMs and Docker containers. VM provides distributed resource management for different virtual machines running with their own allocated resources, while Docker relies on shared pool of resources among all containers. Here, we investigate the performance of different Apache Spark applications using both Virtual Machines (VM) and Docker containers. While others have looked at Docker's performance, this is the first study that compares these different virtualization frameworks for a big data enterprise cloud environment using Apache Spark. In addition to makespan and execution time, we also analyze different resource utilization (CPU, disk, memory, etc.) by Spark applications. Our results show that Spark using Docker can obtain speed-up of over 10 times when compared to using VM. However, we observe that this may not apply to all applications due to different workload patterns and different resource management schemes performed by virtual machines and containers. Our work can guide application developers, system administrators and researchers to better design and deploy big data applications on their platforms to improve the overall performance.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"516 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116223911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Truss decomposition on shared-memory parallel systems 共享内存并行系统的桁架分解
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091049
Shaden Smith, Xing Liu, Nesreen Ahmed, A. Tom, F. Petrini, G. Karypis
The scale of data used in graph analytics grows at an unprecedented rate. More than ever, domain experts require efficient and parallel algorithms for tasks in graph analytics. One such task is the truss decomposition, which is a hierarchical decomposition of the edges of a graph and is closely related to the task of triangle enumeration. As evidenced by the recent GraphChallenge, existing algorithms and implementations for truss decomposition are insufficient for the scale of modern datasets. In this work, we propose a parallel algorithm for computing the truss decomposition of massive graphs on a shared-memory system. Our algorithm breaks a computation-efficient serial algorithm into several bulk-synchronous parallel steps which do not rely on atomics or other fine-grained synchronization. We evaluate our algorithm across a variety of synthetic and real-world datasets on a 56-core Intel Xeon system. Our serial implementation achieves over 1400 × speedup over the provided GraphChallenge serial benchmark implementation and is up to 28 × faster than the state-of-the-art shared-memory parallel algorithm.
图形分析中使用的数据规模以前所未有的速度增长。领域专家比以往任何时候都更需要高效并行的算法来处理图分析中的任务。其中一项任务是桁架分解,它是对图的边缘进行分层分解,与三角枚举任务密切相关。正如最近的GraphChallenge所证明的那样,现有的桁架分解算法和实现对于现代数据集的规模是不够的。在这项工作中,我们提出了一种并行算法来计算共享内存系统上海量图的桁架分解。我们的算法将计算效率高的串行算法分解为多个批量同步并行步骤,这些步骤不依赖于原子或其他细粒度同步。我们在56核英特尔至强系统上评估了我们的算法在各种合成和现实世界的数据集上。我们的串行实现比提供的GraphChallenge串行基准测试实现的速度提高了1400倍以上,比最先进的共享内存并行算法快28倍。
{"title":"Truss decomposition on shared-memory parallel systems","authors":"Shaden Smith, Xing Liu, Nesreen Ahmed, A. Tom, F. Petrini, G. Karypis","doi":"10.1109/HPEC.2017.8091049","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091049","url":null,"abstract":"The scale of data used in graph analytics grows at an unprecedented rate. More than ever, domain experts require efficient and parallel algorithms for tasks in graph analytics. One such task is the truss decomposition, which is a hierarchical decomposition of the edges of a graph and is closely related to the task of triangle enumeration. As evidenced by the recent GraphChallenge, existing algorithms and implementations for truss decomposition are insufficient for the scale of modern datasets. In this work, we propose a parallel algorithm for computing the truss decomposition of massive graphs on a shared-memory system. Our algorithm breaks a computation-efficient serial algorithm into several bulk-synchronous parallel steps which do not rely on atomics or other fine-grained synchronization. We evaluate our algorithm across a variety of synthetic and real-world datasets on a 56-core Intel Xeon system. Our serial implementation achieves over 1400 × speedup over the provided GraphChallenge serial benchmark implementation and is up to 28 × faster than the state-of-the-art shared-memory parallel algorithm.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128023983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Collaborative (CPU + GPU) algorithms for triangle counting and truss decomposition on the Minsky architecture: Static graph challenge: Subgraph isomorphism 基于Minsky架构的三角计数和桁架分解的协同(CPU + GPU)算法:静态图挑战:子图同构
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091042
K. Date, Keven Feng, R. Nagi, Jinjun Xiong, N. Kim, Wen-mei W. Hwu
In this paper, we present collaborative CPU + GPU algorithms for triangle counting and truss decomposition, the two fundamental problems in graph analytics. We describe the implementation details and present experimental evaluation on the IBM Minsky platform. The main contribution of this paper is a thorough benchmarking and comparison of the different memory management schemes offered by CUDA 8 and NVLink, which can be harnessed for tackling large problems where the limited GPU memory capacity is the primary bottleneck in traditional computing platform. We find that the collaborative algorithms achieve 28× speedup on average (180× max) for triangle counting, and 165× speedup on average (498× max) for truss decomposition, when compared with the baseline Python implementation provided by the Graph Challenge organizers.
在本文中,我们提出了三角形计数和桁架分解的协同CPU + GPU算法,这是图分析中的两个基本问题。我们描述了实现细节,并在IBM Minsky平台上进行了实验评估。本文的主要贡献是对CUDA 8和NVLink提供的不同内存管理方案进行了全面的基准测试和比较,这可以用于解决传统计算平台中有限的GPU内存容量是主要瓶颈的大型问题。我们发现,与Graph Challenge组织者提供的基线Python实现相比,协作算法在三角形计数方面平均加速28倍(最大加速180倍),在桁架分解方面平均加速165倍(最大加速498倍)。
{"title":"Collaborative (CPU + GPU) algorithms for triangle counting and truss decomposition on the Minsky architecture: Static graph challenge: Subgraph isomorphism","authors":"K. Date, Keven Feng, R. Nagi, Jinjun Xiong, N. Kim, Wen-mei W. Hwu","doi":"10.1109/HPEC.2017.8091042","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091042","url":null,"abstract":"In this paper, we present collaborative CPU + GPU algorithms for triangle counting and truss decomposition, the two fundamental problems in graph analytics. We describe the implementation details and present experimental evaluation on the IBM Minsky platform. The main contribution of this paper is a thorough benchmarking and comparison of the different memory management schemes offered by CUDA 8 and NVLink, which can be harnessed for tackling large problems where the limited GPU memory capacity is the primary bottleneck in traditional computing platform. We find that the collaborative algorithms achieve 28× speedup on average (180× max) for triangle counting, and 165× speedup on average (498× max) for truss decomposition, when compared with the baseline Python implementation provided by the Graph Challenge organizers.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"16 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120984415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Exploring optimizations on shared-memory platforms for parallel triangle counting algorithms 探索共享内存平台上并行三角形计数算法的优化
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091054
A. Tom, N. Sundaram, Nesreen Ahmed, Shaden Smith, Stijn Eyerman, Midhunchandra Kodiyath, I. Hur, F. Petrini, G. Karypis
The widespread use of graphs to model large scale real-world data brings with it the need for fast graph analytics. In this paper, we explore the problem of triangle counting, a fundamental graph-analytic operation, on shared-memory platforms. Existing triangle counting implementations do not effectively utilize the key characteristics of large sparse graphs for tuning their algorithms for performance. We explore such optimizations and develop faster serial and parallel variants of existing algorithms, which outperform the state-of-the-art on Intel manycore and multicore processors. Our algorithms achieve good strong scaling on many graphs with varying scale and degree distributions. Furthermore, we extend our optimizations to a well-known graph processing framework, GraphMat, and demonstrate their generality.
广泛使用图形来模拟大规模的现实世界数据,带来了对快速图形分析的需求。本文探讨了共享内存平台上的三角形计数问题,这是一种基本的图分析操作。现有的三角形计数实现并没有有效地利用大型稀疏图的关键特征来调优其算法以提高性能。我们探索这种优化,并开发现有算法的更快的串行和并行变体,其性能优于英特尔多核和多核处理器上的最新技术。我们的算法在许多具有不同尺度和度分布的图上实现了良好的强缩放。此外,我们将我们的优化扩展到一个众所周知的图形处理框架GraphMat,并展示了它们的通用性。
{"title":"Exploring optimizations on shared-memory platforms for parallel triangle counting algorithms","authors":"A. Tom, N. Sundaram, Nesreen Ahmed, Shaden Smith, Stijn Eyerman, Midhunchandra Kodiyath, I. Hur, F. Petrini, G. Karypis","doi":"10.1109/HPEC.2017.8091054","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091054","url":null,"abstract":"The widespread use of graphs to model large scale real-world data brings with it the need for fast graph analytics. In this paper, we explore the problem of triangle counting, a fundamental graph-analytic operation, on shared-memory platforms. Existing triangle counting implementations do not effectively utilize the key characteristics of large sparse graphs for tuning their algorithms for performance. We explore such optimizations and develop faster serial and parallel variants of existing algorithms, which outperform the state-of-the-art on Intel manycore and multicore processors. Our algorithms achieve good strong scaling on many graphs with varying scale and degree distributions. Furthermore, we extend our optimizations to a well-known graph processing framework, GraphMat, and demonstrate their generality.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117334022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Study on distributed and parallel non-linear optimization algorithm for ocean color remote sensing data 海洋颜色遥感数据分布式并行非线性优化算法研究
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091075
Jung-Ho Um, Sunggeun Han, Hyunwoo Kim, Kyongseok Park
Recent developments in science and technology have made it possible to analyze data observed by satellites using optical properties. By monitoring changes in the ocean environment and ecosystem, we are currently conducting ocean environmental studies to identify abnormal weather phenomena. International aerospace laboratories such as NASA and ESA are publishing these observed data to ocean scientists around the world. Satellite sensing data accumulates day by day, but data volume for the global scale is so large that scientists usually divide the space for only the area of interest and perform time series analyses. Time series analysis is mainly applied to nonlinear distributions. However, studies of the ocean environment require analysis of the global ocean and ocean ecosystems. Data analysis in the global domain requires nonlinear data fitting for every cell of the satellite imagery data. However, commercial and open-source data analysis tools such as Matlab or R do not provide non-linear data fitting for multiple cells. Because of this, there is a difficulty for ocean scientists to directly implement the analysis of data and it is hard to guarantee distributed and parallelized computation performance. Therefore, in this paper, we propose an algorithm that can distribute and parallelize, in a multi-dimensional database environment, the Levenberg-Marquadt (LM) algorithm, which is well known as a non-linear data fitting algorithm. Our algorithm achieved about 7.5 times speed-up on average, compared to the MINPACK LM algorithm, which is based on MPI and written in FORTRAN. In addition, our algorithm improved 74.3 times speed-up when comparing to the maximum performance for each algorithm. As future research, we will utilize the developed algorithms in the ocean science field for data analysis of global scale satellite imagery data.
科学技术的最新发展使利用光学特性分析卫星观测到的数据成为可能。通过监测海洋环境和生态系统的变化,我们目前正在进行海洋环境研究,以识别异常天气现象。美国国家航空航天局和欧洲航天局等国际航空航天实验室正在向世界各地的海洋科学家发布这些观测数据。卫星遥感数据每天都在积累,但全球范围内的数据量太大,科学家通常只对感兴趣的区域划分空间并进行时间序列分析。时间序列分析主要应用于非线性分布。然而,海洋环境的研究需要对全球海洋和海洋生态系统进行分析。全局域的数据分析需要对卫星图像数据的每个单元进行非线性拟合。然而,商业和开源数据分析工具(如Matlab或R)不提供多单元的非线性数据拟合。因此,海洋科学家难以直接实现对数据的分析,难以保证分布式和并行化的计算性能。因此,在本文中,我们提出了一种在多维数据库环境下可以进行分布式和并行化的算法,即众所周知的非线性数据拟合算法Levenberg-Marquadt (LM)算法。与基于MPI并使用FORTRAN编写的MINPACK LM算法相比,我们的算法平均实现了7.5倍的加速。此外,与每种算法的最大性能相比,我们的算法的加速提升了74.3倍。在未来的研究中,我们将利用海洋科学领域开发的算法对全球尺度卫星图像数据进行数据分析。
{"title":"Study on distributed and parallel non-linear optimization algorithm for ocean color remote sensing data","authors":"Jung-Ho Um, Sunggeun Han, Hyunwoo Kim, Kyongseok Park","doi":"10.1109/HPEC.2017.8091075","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091075","url":null,"abstract":"Recent developments in science and technology have made it possible to analyze data observed by satellites using optical properties. By monitoring changes in the ocean environment and ecosystem, we are currently conducting ocean environmental studies to identify abnormal weather phenomena. International aerospace laboratories such as NASA and ESA are publishing these observed data to ocean scientists around the world. Satellite sensing data accumulates day by day, but data volume for the global scale is so large that scientists usually divide the space for only the area of interest and perform time series analyses. Time series analysis is mainly applied to nonlinear distributions. However, studies of the ocean environment require analysis of the global ocean and ocean ecosystems. Data analysis in the global domain requires nonlinear data fitting for every cell of the satellite imagery data. However, commercial and open-source data analysis tools such as Matlab or R do not provide non-linear data fitting for multiple cells. Because of this, there is a difficulty for ocean scientists to directly implement the analysis of data and it is hard to guarantee distributed and parallelized computation performance. Therefore, in this paper, we propose an algorithm that can distribute and parallelize, in a multi-dimensional database environment, the Levenberg-Marquadt (LM) algorithm, which is well known as a non-linear data fitting algorithm. Our algorithm achieved about 7.5 times speed-up on average, compared to the MINPACK LM algorithm, which is based on MPI and written in FORTRAN. In addition, our algorithm improved 74.3 times speed-up when comparing to the maximum performance for each algorithm. As future research, we will utilize the developed algorithms in the ocean science field for data analysis of global scale satellite imagery data.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128670829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2017 IEEE High Performance Extreme Computing Conference (HPEC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1