首页 > 最新文献

2017 IEEE High Performance Extreme Computing Conference (HPEC)最新文献

英文 中文
Optimized task graph mapping on a many-core neuromorphic supercomputer 在多核神经形态超级计算机上优化任务图映射
Pub Date : 2017-09-14 DOI: 10.1109/HPEC.2017.8091066
Indar Sugiarto, Pedro B. Campos, Nizar Dahir, G. Tempesti, S. Furber
This paper presents an approach for improving the overall performance of a general purpose application running as a task graph on a many-core neuromorphic supercomputer. Our task graph framework is based on graceful degradation and amelioration paradigms that strive to achieve high reliability and performance by incorporating fault tolerance and task spawning features. The optimization is applied on an instance of the task graph by performing a soft load balancing on the data traffic between nodes in the graph. We implemented the framework and its optimization on SpiNNaker, a many-core neuromorphic platform containing a million ARM9 processing cores. We evaluate our method using several static mapping examples, where some of them were generated using an evolutionary algorithm. The experiment demonstrates that a performance improvement of up to 8.2% can be achieved when implementing our algorithm on a fully-utilized SpiNNaker communication infrastructure.
本文提出了一种在多核神经形态超级计算机上以任务图形式运行的通用应用程序的整体性能改进方法。我们的任务图框架基于优雅的退化和改进范例,这些范例通过结合容错和任务生成特征来努力实现高可靠性和高性能。通过对图中节点之间的数据流量执行软负载平衡,将优化应用于任务图的实例。我们在SpiNNaker上实现了该框架及其优化,SpiNNaker是一个包含一百万个ARM9处理内核的多核神经形态平台。我们使用几个静态映射示例来评估我们的方法,其中一些示例是使用进化算法生成的。实验表明,在充分利用的SpiNNaker通信基础设施上实现我们的算法时,可以实现高达8.2%的性能改进。
{"title":"Optimized task graph mapping on a many-core neuromorphic supercomputer","authors":"Indar Sugiarto, Pedro B. Campos, Nizar Dahir, G. Tempesti, S. Furber","doi":"10.1109/HPEC.2017.8091066","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091066","url":null,"abstract":"This paper presents an approach for improving the overall performance of a general purpose application running as a task graph on a many-core neuromorphic supercomputer. Our task graph framework is based on graceful degradation and amelioration paradigms that strive to achieve high reliability and performance by incorporating fault tolerance and task spawning features. The optimization is applied on an instance of the task graph by performing a soft load balancing on the data traffic between nodes in the graph. We implemented the framework and its optimization on SpiNNaker, a many-core neuromorphic platform containing a million ARM9 processing cores. We evaluate our method using several static mapping examples, where some of them were generated using an evolutionary algorithm. The experiment demonstrates that a performance improvement of up to 8.2% can be achieved when implementing our algorithm on a fully-utilized SpiNNaker communication infrastructure.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114246554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Dynamic trace-based sampling algorithm for memory usage tracking of enterprise applications 基于动态跟踪的企业应用程序内存使用跟踪抽样算法
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091061
Houssem Daoud, Naser Ezzati-Jivan, M. Dagenais
Excessive memory usage in software applications has become a frequent issue. A high degree of parallelism and the monitoring difficulty for the developer can quickly lead to memory shortage, or can increase the duration of garbage collection cycles. There are several solutions introduced to monitor memory usage in software. However they are neither efficient nor scalable. In this paper, we propose a dynamic tracing-based sampling algorithm to collect and analyse run time information and metrics for memory usage. It is implemented as a kernel module which gathers memory usage data from operating system structures only when a predefined condition is set or a threshold is passed. The thresholds and conditions are preset but can be changed dynamically, based on the application behavior. We tested our solutions to monitor several applications and our evaluation results show that the proposed method generates compact trace data and reduces the time needed for the analysis, without loosing precision.
在软件应用程序中,内存的过度使用已经成为一个常见的问题。高度的并行性和开发人员的监控困难可能很快导致内存短缺,或者可能增加垃圾收集周期的持续时间。介绍了几种解决方案来监控软件中的内存使用情况。然而,它们既不高效也不可扩展。在本文中,我们提出了一种基于动态跟踪的采样算法来收集和分析运行时信息和内存使用指标。它是作为一个内核模块实现的,只有当预定义的条件被设置或阈值被通过时,它才从操作系统结构中收集内存使用数据。阈值和条件是预设的,但可以根据应用程序的行为动态更改。我们测试了我们的解决方案来监控几个应用程序,我们的评估结果表明,所提出的方法生成紧凑的跟踪数据,减少了分析所需的时间,而不会失去精度。
{"title":"Dynamic trace-based sampling algorithm for memory usage tracking of enterprise applications","authors":"Houssem Daoud, Naser Ezzati-Jivan, M. Dagenais","doi":"10.1109/HPEC.2017.8091061","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091061","url":null,"abstract":"Excessive memory usage in software applications has become a frequent issue. A high degree of parallelism and the monitoring difficulty for the developer can quickly lead to memory shortage, or can increase the duration of garbage collection cycles. There are several solutions introduced to monitor memory usage in software. However they are neither efficient nor scalable. In this paper, we propose a dynamic tracing-based sampling algorithm to collect and analyse run time information and metrics for memory usage. It is implemented as a kernel module which gathers memory usage data from operating system structures only when a predefined condition is set or a threshold is passed. The thresholds and conditions are preset but can be changed dynamically, based on the application behavior. We tested our solutions to monitor several applications and our evaluation results show that the proposed method generates compact trace data and reduces the time needed for the analysis, without loosing precision.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121098833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Scalable static and dynamic community detection using Grappolo 可扩展的静态和动态社区检测使用Grappolo
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091047
M. Halappanavar, Hao Lu, A. Kalyanaraman, Antonino Tumeo
Graph clustering, popularly known as community detection, is a fundamental kernel for several applications of relevance to the Defense Advanced Research Projects Agency's (DARPA) Hierarchical Identify Verify Exploit (HIVE) Program. Clusters or communities represent natural divisions within a network that are densely connected within a cluster and sparsely connected to the rest of the network. The need to compute clustering on large scale data necessitates the development of efficient algorithms that can exploit modern architectures that are fundamentally parallel in nature. However, due to their irregular and inherently sequential nature, many of the current algorithms for community detection are challenging to parallelize. In response to the HIVE Graph Challenge, we present several parallelization heuristics for fast community detection using the Louvain method as the serial template. We implement all the heuristics in a software library called Grappolo. Using the inputs from the HIVE Challenge, we demonstrate superior performance and high quality solutions based on four parallelization heuristics. We use Grappolo on static graphs as the first step towards community detection on streaming graphs.
图聚类,通常被称为社区检测,是与国防高级研究计划局(DARPA)分层识别验证漏洞(HIVE)计划相关的几个应用程序的基本内核。集群或社区代表网络内的自然分区,这些分区在集群内紧密相连,与网络的其他部分稀疏相连。在大规模数据上计算集群的需要需要开发高效的算法,这些算法可以利用本质上基本并行的现代架构。然而,由于其不规则性和固有的序列性,许多现有的社区检测算法在并行化方面存在挑战。为了应对HIVE图挑战,我们提出了几种以Louvain方法为串行模板的并行化启发式快速社区检测方法。我们在一个名为Grappolo的软件库中实现了所有的启发式算法。利用HIVE Challenge的输入,我们展示了基于四种并行化启发式的卓越性能和高质量解决方案。我们在静态图上使用Grappolo作为在流图上进行社区检测的第一步。
{"title":"Scalable static and dynamic community detection using Grappolo","authors":"M. Halappanavar, Hao Lu, A. Kalyanaraman, Antonino Tumeo","doi":"10.1109/HPEC.2017.8091047","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091047","url":null,"abstract":"Graph clustering, popularly known as community detection, is a fundamental kernel for several applications of relevance to the Defense Advanced Research Projects Agency's (DARPA) Hierarchical Identify Verify Exploit (HIVE) Program. Clusters or communities represent natural divisions within a network that are densely connected within a cluster and sparsely connected to the rest of the network. The need to compute clustering on large scale data necessitates the development of efficient algorithms that can exploit modern architectures that are fundamentally parallel in nature. However, due to their irregular and inherently sequential nature, many of the current algorithms for community detection are challenging to parallelize. In response to the HIVE Graph Challenge, we present several parallelization heuristics for fast community detection using the Louvain method as the serial template. We implement all the heuristics in a software library called Grappolo. Using the inputs from the HIVE Challenge, we demonstrate superior performance and high quality solutions based on four parallelization heuristics. We use Grappolo on static graphs as the first step towards community detection on streaming graphs.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122563181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
OSCAR: Optimizing SCrAtchpad reuse for graph processing OSCAR:优化图形处理的刮板重用
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091070
Shreyas G. Singapura, Ajitesh Srivastava, R. Kannan, V. Prasanna
Recently, architectures with scratchpad memory are gaining popularity. These architectures consist of low bandwidth, large capacity DRAM and high bandwidth, user addressable small capacity scratchpad. Existing algorithms must be redesigned to take advantage of the high bandwidth while overcoming the constraint on capacity of scratchpad. In this paper, we propose an optimized edge-centric graph processing algorithm for scratchpad based architectures. Our key contribution is significant reduction in (slower) DRAM accesses through intelligent reuse of scratchpad data. We trade off reduction in DRAM accesses for slightly higher scratchpad accesses. However, due to the much higher bandwidth of scratchpad, the total memory access cost (DRAM + scratchpad) is significantly reduced. We validate our analysis with experiments on real world graphs using a simulator which mimics the scratchpad based architecture using Single Source Shortest Path (SSSP) and Breadth First Search (BFS). Our experimental results demonstrate 1.7× to 2.7× reduction in DRAM accesses leading to an improvement of 1.4× to 2× in total memory (DRAM + scratchpad) accesses.
最近,带有刮板存储器的架构越来越受欢迎。这些架构由低带宽、大容量DRAM和高带宽、用户可寻址的小容量刮记板组成。现有算法必须重新设计,以利用高带宽,同时克服刮记板容量的限制。在本文中,我们提出了一种优化的边缘中心图处理算法,用于基于刮擦板的架构。我们的主要贡献是通过智能重用刮板数据显著减少(较慢的)DRAM访问。我们用DRAM访问的减少来换取稍高的刮擦板访问。但是,由于scratchpad的带宽高得多,因此内存访问的总成本(DRAM + scratchpad)显著降低。我们通过在真实世界图上的实验来验证我们的分析,使用模拟器模拟基于刮板的架构,使用单源最短路径(SSSP)和广度优先搜索(BFS)。我们的实验结果表明,DRAM访问减少1.7到2.7倍,导致总内存(DRAM +刮擦板)访问提高1.4到2倍。
{"title":"OSCAR: Optimizing SCrAtchpad reuse for graph processing","authors":"Shreyas G. Singapura, Ajitesh Srivastava, R. Kannan, V. Prasanna","doi":"10.1109/HPEC.2017.8091070","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091070","url":null,"abstract":"Recently, architectures with scratchpad memory are gaining popularity. These architectures consist of low bandwidth, large capacity DRAM and high bandwidth, user addressable small capacity scratchpad. Existing algorithms must be redesigned to take advantage of the high bandwidth while overcoming the constraint on capacity of scratchpad. In this paper, we propose an optimized edge-centric graph processing algorithm for scratchpad based architectures. Our key contribution is significant reduction in (slower) DRAM accesses through intelligent reuse of scratchpad data. We trade off reduction in DRAM accesses for slightly higher scratchpad accesses. However, due to the much higher bandwidth of scratchpad, the total memory access cost (DRAM + scratchpad) is significantly reduced. We validate our analysis with experiments on real world graphs using a simulator which mimics the scratchpad based architecture using Single Source Shortest Path (SSSP) and Breadth First Search (BFS). Our experimental results demonstrate 1.7× to 2.7× reduction in DRAM accesses leading to an improvement of 1.4× to 2× in total memory (DRAM + scratchpad) accesses.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124294374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Algorithm and hardware co-optimized solution for large SpMV problems 大型SpMV问题的算法与硬件协同优化解决方案
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091096
Fazle Sadi, L. Pileggi, F. Franchetti
Sparse Matrix-Vector multiplication (SpMV) is a fundamental kernel for many scientific and engineering applications. However, SpMV performance and efficiency are poor on commercial of-the-shelf (COTS) architectures, specially when the data size exceeds on-chip memory or last level cache (LLC). In this work we present an algorithm co-optimized hardware accelerator for large SpMV problems. We start with exploring the basic difference in data transfer characteristics for various SpMV algorithms. We propose an algorithm that requires the least amount of data transfer while ensuring main memory streaming for all accesses. However, the proposed algorithm requires an efficient multi-way merge, which is difficult to achieve with COTS architectures. Hence, we propose a hardware accelerator model that includes an Application Specific Integrated Circuit (ASIC) for the muti-way merge operation. The proposed accelerator incorporates state of the art 3D stacked High Bandwidth Memory (HBM) in order to demonstrate the proposed algorithm's capability coupled with the latest technologies. Simulation results using standard benchmarks show improvements of over 100× against COTS architectures with commercial libraries for both energy efficiency and performance.
稀疏矩阵向量乘法(SpMV)是许多科学和工程应用的基本核心。然而,SpMV在商用货架(COTS)架构上的性能和效率很差,特别是当数据大小超过片上存储器或最后一级缓存(LLC)时。在这项工作中,我们提出了一个算法协同优化的大型SpMV问题硬件加速器。我们首先探索各种SpMV算法在数据传输特性上的基本差异。我们提出了一种算法,它需要最少的数据传输,同时确保所有访问的主内存流。然而,该算法需要高效的多路合并,这在COTS架构下很难实现。因此,我们提出了一种硬件加速器模型,其中包括用于多路合并操作的专用集成电路(ASIC)。该加速器采用了最先进的3D堆叠高带宽存储器(HBM),以展示该算法与最新技术相结合的能力。使用标准基准测试的仿真结果显示,在能源效率和性能方面,与商用库相比,COTS架构的改进超过100倍。
{"title":"Algorithm and hardware co-optimized solution for large SpMV problems","authors":"Fazle Sadi, L. Pileggi, F. Franchetti","doi":"10.1109/HPEC.2017.8091096","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091096","url":null,"abstract":"Sparse Matrix-Vector multiplication (SpMV) is a fundamental kernel for many scientific and engineering applications. However, SpMV performance and efficiency are poor on commercial of-the-shelf (COTS) architectures, specially when the data size exceeds on-chip memory or last level cache (LLC). In this work we present an algorithm co-optimized hardware accelerator for large SpMV problems. We start with exploring the basic difference in data transfer characteristics for various SpMV algorithms. We propose an algorithm that requires the least amount of data transfer while ensuring main memory streaming for all accesses. However, the proposed algorithm requires an efficient multi-way merge, which is difficult to achieve with COTS architectures. Hence, we propose a hardware accelerator model that includes an Application Specific Integrated Circuit (ASIC) for the muti-way merge operation. The proposed accelerator incorporates state of the art 3D stacked High Bandwidth Memory (HBM) in order to demonstrate the proposed algorithm's capability coupled with the latest technologies. Simulation results using standard benchmarks show improvements of over 100× against COTS architectures with commercial libraries for both energy efficiency and performance.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131385640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Distributed workflows for modeling experimental data 为实验数据建模的分布式工作流
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091071
V. Lynch, Jose Borreguero Calvo, E. Deelman, Rafael Ferreira da Silva, Monojoy Goswami, Yawei Hui, E. Lingerfelt, J. Vetter
Modeling helps explain the fundamental physics hidden behind experimental data. In the case of material modeling, running one simulation rarely results in output that reproduces the experimental data. Often one or more of the force field parameters are not precisely known and must be optimized for the output to match that of the experiment. Since the simulations require high performance computing (HPC) resources and there are usually many simulations to run, a workflow is very useful to prevent errors and assure that the simulations are identical except for the parameters that need to be varied. The use of HPC implies distributed workflows, but the optimization and steps to compare the simulation results and experimental data are done on a local workstation. We will present results from force field refinement of data collected at the Spallation Neutron Source using Kepler, Pegasus, and BEAM workflows and discuss what we have learned from using these workflows.
建模有助于解释隐藏在实验数据背后的基本物理原理。在材料建模的情况下,运行一个模拟很少会产生再现实验数据的输出。通常一个或多个力场参数是不精确已知的,必须优化输出以匹配实验结果。由于模拟需要高性能计算(HPC)资源,并且通常有许多模拟要运行,工作流对于防止错误和确保模拟除了需要改变的参数外是相同的非常有用。HPC的使用意味着分布式工作流程,但优化和比较仿真结果和实验数据的步骤是在本地工作站完成的。我们将介绍使用Kepler、Pegasus和BEAM工作流程在散裂中子源收集的数据的力场细化结果,并讨论我们从使用这些工作流程中学到的东西。
{"title":"Distributed workflows for modeling experimental data","authors":"V. Lynch, Jose Borreguero Calvo, E. Deelman, Rafael Ferreira da Silva, Monojoy Goswami, Yawei Hui, E. Lingerfelt, J. Vetter","doi":"10.1109/HPEC.2017.8091071","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091071","url":null,"abstract":"Modeling helps explain the fundamental physics hidden behind experimental data. In the case of material modeling, running one simulation rarely results in output that reproduces the experimental data. Often one or more of the force field parameters are not precisely known and must be optimized for the output to match that of the experiment. Since the simulations require high performance computing (HPC) resources and there are usually many simulations to run, a workflow is very useful to prevent errors and assure that the simulations are identical except for the parameters that need to be varied. The use of HPC implies distributed workflows, but the optimization and steps to compare the simulation results and experimental data are done on a local workstation. We will present results from force field refinement of data collected at the Spallation Neutron Source using Kepler, Pegasus, and BEAM workflows and discuss what we have learned from using these workflows.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133462399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Parallel k-truss decomposition on multicore systems 多核系统的并行k-桁架分解
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091052
H. Kabir, Kamesh Madduri
We discuss our submission to the HPEC 2017 Static Graph Challenge on k-truss decomposition and triangle counting. Our results use an algorithm called PKT (Parallel k-truss) designed for multicore systems. We are able to process almost all Graph Challenge datasets in under a minute on a 24-core server with 128 GB memory. For a synthetic Graph500 graph with 17 million vertices and 523 million edges, triangle counting takes 16 seconds and truss decomposition takes 29 minutes on the 24-core server.
我们讨论了我们提交给HPEC 2017关于k-桁架分解和三角形计数的静态图挑战。我们的结果使用了一种称为PKT(并行k-桁架)的算法,该算法是为多核系统设计的。我们能够在24核128 GB内存的服务器上在一分钟内处理几乎所有的Graph Challenge数据集。对于一个包含1700万个顶点和5.23亿个边的合成Graph500图,在24核服务器上,三角形计数需要16秒,桁架分解需要29分钟。
{"title":"Parallel k-truss decomposition on multicore systems","authors":"H. Kabir, Kamesh Madduri","doi":"10.1109/HPEC.2017.8091052","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091052","url":null,"abstract":"We discuss our submission to the HPEC 2017 Static Graph Challenge on k-truss decomposition and triangle counting. Our results use an algorithm called PKT (Parallel k-truss) designed for multicore systems. We are able to process almost all Graph Challenge datasets in under a minute on a 24-core server with 128 GB memory. For a synthetic Graph500 graph with 17 million vertices and 523 million edges, triangle counting takes 16 seconds and truss decomposition takes 29 minutes on the 24-core server.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117271440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Performance challenges for heterogeneous distributed tensor decompositions 异构分布张量分解的性能挑战
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091023
Thomas B. Rolinger, T. Simon, Christopher D. Krieger
Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIA's GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intel's CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8× faster than ReFacTo when using cuSPARSE.
张量分解是多维数组的分解,在大规模数据分析中变得越来越重要。常用的张量分解算法是正则分解/交替最小二乘拟合并行分解(CP-ALS)。为现实世界的应用程序建模的张量通常非常大且稀疏,这推动了对分解算法(如CP-ALS)的高性能实现的需求,这些算法可以利用许多类型的计算资源。在这项工作中,我们提出了ReFacTo,一种基于DeFacTo的异构分布式张量分解实现,DeFacTo是一种现有的CP-ALS分布式内存方法。DFacTo将CP-ALS的关键程序简化为一系列稀疏矩阵向量乘法(spmv)。ReFacTo通过MPI利用集群中的gpu来执行这些spmv,并使用OpenMP线程来并行化其他例程。我们在使用NVIDIA基于gpu的cuSPARSE库时评估了ReFacTo的性能,并将其与使用英特尔基于cpu的数学内核库(MKL)的SpMV的替代实现进行了比较。此外,我们还根据我们观察到的结果讨论了异构分布张量分解的性能挑战。我们发现,在最多32个节点上,使用MKL时ReFacTo的SpMV比使用cuSPARSE时ReFacTo快6.8倍。
{"title":"Performance challenges for heterogeneous distributed tensor decompositions","authors":"Thomas B. Rolinger, T. Simon, Christopher D. Krieger","doi":"10.1109/HPEC.2017.8091023","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091023","url":null,"abstract":"Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIA's GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intel's CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8× faster than ReFacTo when using cuSPARSE.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121922920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Integrating productivity-oriented programming languages with high-performance data structures 集成面向生产力的编程语言和高性能数据结构
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091068
Rohit Varkey Thankachan, Eric R. Hein, B. Swenson, James P. Fairbanks
This paper shows that Julia provides sufficient performance to bridge the performance gap between productivity-oriented languages and low-level languages for complex memory intensive computation tasks such as graph traversal. We provide performance guidelines for using complex low-level data structures in high productivity languages and present the first parallel integration on the productivity-oriented language side for graph analysis. Performance on the Graph500 benchmark demonstrates that the Julia implementation is competitive with the native C/OpenMP implementation.
本文表明,Julia提供了足够的性能来弥合面向生产力的语言和低级语言之间的性能差距,以处理复杂的内存密集型计算任务,如图遍历。我们提供了在高生产力语言中使用复杂的低级数据结构的性能指南,并在面向生产力的语言方面为图形分析提供了第一个并行集成。Graph500基准测试上的性能表明Julia实现与本机C/OpenMP实现具有竞争力。
{"title":"Integrating productivity-oriented programming languages with high-performance data structures","authors":"Rohit Varkey Thankachan, Eric R. Hein, B. Swenson, James P. Fairbanks","doi":"10.1109/HPEC.2017.8091068","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091068","url":null,"abstract":"This paper shows that Julia provides sufficient performance to bridge the performance gap between productivity-oriented languages and low-level languages for complex memory intensive computation tasks such as graph traversal. We provide performance guidelines for using complex low-level data structures in high productivity languages and present the first parallel integration on the productivity-oriented language side for graph analysis. Performance on the Graph500 benchmark demonstrates that the Julia implementation is competitive with the native C/OpenMP implementation.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"PP 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126755861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Out of memory SVD solver for big data 内存不足的SVD解决大数据
Pub Date : 2017-09-01 DOI: 10.1109/HPEC.2017.8091029
A. Haidar, K. Kabir, Diana Fayad, S. Tomov, J. Dongarra
Many applications — from data compression to numerical weather prediction and information retrieval — need to compute large dense singular value decompositions (SVD). When the problems are too large to fit into the computer's main memory, specialized out-of-core algorithms that use disk storage are required. A typical example is when trying to analyze a large data set through tools like MATLAB or Octave, but the data is just too large to be loaded. To overcome this, we designed a class of out-of-memory (OOM) algorithms to reduce, as well as overlap communication with computation. Of particular interest is OOM algorithms for matrices of size m × n, where m >> n or m << n, e.g., corresponding to cases of too many variables, or too many observations. To design OOM SVDs, we first study the communications cost for the SVD techniques as well as for the QR/LQ factorization followed by SVD. We present the theoretical analysis about the data movement cost and strategies to design OOM SVD algorithms. We show performance results for multicore architecture that illustrate our theoretical findings and match our performance models. Moreover, our experimental results show the feasibility and superiority of the OOM SVD.
从数据压缩到数值天气预报和信息检索,许多应用都需要计算大密度奇异值分解(SVD)。当问题太大而无法装入计算机的主存储器时,就需要使用磁盘存储的专用外核算法。一个典型的例子是,当试图通过MATLAB或Octave等工具分析大型数据集时,但是数据太大而无法加载。为了克服这个问题,我们设计了一类内存不足(OOM)算法来减少通信与计算的重叠。特别感兴趣的是大小为m × n的矩阵的OOM算法,其中m >> n或m << n,例如,对应于太多变量或太多观察值的情况。为了设计面向对象的SVD,我们首先研究了SVD技术的通信成本,以及SVD之后的QR/LQ分解的通信成本。对数据移动成本进行了理论分析,并提出了面向对象奇异值分解算法的设计策略。我们展示了多核架构的性能结果,说明了我们的理论发现并匹配了我们的性能模型。实验结果表明了该方法的可行性和优越性。
{"title":"Out of memory SVD solver for big data","authors":"A. Haidar, K. Kabir, Diana Fayad, S. Tomov, J. Dongarra","doi":"10.1109/HPEC.2017.8091029","DOIUrl":"https://doi.org/10.1109/HPEC.2017.8091029","url":null,"abstract":"Many applications — from data compression to numerical weather prediction and information retrieval — need to compute large dense singular value decompositions (SVD). When the problems are too large to fit into the computer's main memory, specialized out-of-core algorithms that use disk storage are required. A typical example is when trying to analyze a large data set through tools like MATLAB or Octave, but the data is just too large to be loaded. To overcome this, we designed a class of out-of-memory (OOM) algorithms to reduce, as well as overlap communication with computation. Of particular interest is OOM algorithms for matrices of size m × n, where m >> n or m << n, e.g., corresponding to cases of too many variables, or too many observations. To design OOM SVDs, we first study the communications cost for the SVD techniques as well as for the QR/LQ factorization followed by SVD. We present the theoretical analysis about the data movement cost and strategies to design OOM SVD algorithms. We show performance results for multicore architecture that illustrate our theoretical findings and match our performance models. Moreover, our experimental results show the feasibility and superiority of the OOM SVD.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131105191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
期刊
2017 IEEE High Performance Extreme Computing Conference (HPEC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1