2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)最新文献

英文中文

27th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2020) Technical program 第27届IEEE高性能计算，数据和分析国际会议(HiPC 2020)技术方案

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/hipc50609.2020.00013

引用次数: 0

Understanding HPC Application I/O Behavior Using System Level Statistics 使用系统级统计理解HPC应用程序I/O行为

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00034

A. Paul, Olaf Faaland, A. Moody, Elsa Gonsiorowski, K. Mohror, A. Butt

The processor performance of high performance computing (HPC) systems is increasing at a much higher rate than storage performance. This imbalance leads to I/O performance bottlenecks in massively parallel HPC applications. Therefore, there is a need for improvements in storage and file system designs to meet the ever-growing I/O needs of HPC applications. Storage and file system designers require a deep understanding of how HPC application I/O behavior affects current storage system installations in order to improve them. In this work, we contribute to this understanding using application-agnostic file system statistics gathered on compute nodes as well as metadata and object storage file system servers. We analyze file system statistics of more than 4 million jobs over a period of three years on two systems at Lawrence Livermore National Laboratory that include a 15 PiB Lustre file system for storage. The results of our study add to the state-of-the-art in I/O understanding by providing insight into how general HPC workloads affect the performance of large-scale storage systems. Some key observations in our study show that reads and writes are evenly distributed across the storage system; applications which perform I/O, spread that I/O across ∼78% of the minutes of their runtime on average; less than 22% of HPC users who submit write-intensive jobs perform efficient writes to the file system; and I/O contention seriously impacts I/O performance.

高性能计算(HPC)系统的处理器性能正以比存储性能高得多的速度增长。这种不平衡导致大规模并行HPC应用程序中的I/O性能瓶颈。因此，需要改进存储和文件系统设计，以满足HPC应用程序不断增长的I/O需求。存储和文件系统设计人员需要深入了解HPC应用程序I/O行为如何影响当前的存储系统安装，以便改进它们。在这项工作中，我们使用在计算节点以及元数据和对象存储文件系统服务器上收集的与应用程序无关的文件系统统计数据来促进这种理解。我们分析了劳伦斯利弗莫尔国家实验室(Lawrence Livermore National Laboratory)三个系统上超过400万个工作的文件系统统计数据，这两个系统包括一个用于存储的15 PiB Lustre文件系统。我们的研究结果通过深入了解一般HPC工作负载如何影响大型存储系统的性能，增加了对I/O的最新理解。我们研究中的一些关键观察结果表明，读和写在整个存储系统中是均匀分布的;执行I/O的应用程序将I/O平均分配到运行时时间的78%分钟;在提交写密集型作业的HPC用户中，只有不到22%的人对文件系统执行了有效的写操作;I/O争用严重影响I/O性能。

{"title":"Understanding HPC Application I/O Behavior Using System Level Statistics","authors":"A. Paul, Olaf Faaland, A. Moody, Elsa Gonsiorowski, K. Mohror, A. Butt","doi":"10.1109/HiPC50609.2020.00034","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00034","url":null,"abstract":"The processor performance of high performance computing (HPC) systems is increasing at a much higher rate than storage performance. This imbalance leads to I/O performance bottlenecks in massively parallel HPC applications. Therefore, there is a need for improvements in storage and file system designs to meet the ever-growing I/O needs of HPC applications. Storage and file system designers require a deep understanding of how HPC application I/O behavior affects current storage system installations in order to improve them. In this work, we contribute to this understanding using application-agnostic file system statistics gathered on compute nodes as well as metadata and object storage file system servers. We analyze file system statistics of more than 4 million jobs over a period of three years on two systems at Lawrence Livermore National Laboratory that include a 15 PiB Lustre file system for storage. The results of our study add to the state-of-the-art in I/O understanding by providing insight into how general HPC workloads affect the performance of large-scale storage systems. Some key observations in our study show that reads and writes are evenly distributed across the storage system; applications which perform I/O, spread that I/O across ∼78% of the minutes of their runtime on average; less than 22% of HPC users who submit write-intensive jobs perform efficient writes to the file system; and I/O contention seriously impacts I/O performance.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133040078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

GPU-FPtuner: Mixed-precision Auto-tuning for Floating-point Applications on GPU GPU- fptuner: GPU上浮点应用的混合精度自动调谐

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00043

Ruidong Gu, M. Becchi

GPUs have been extensively used to accelerate scientific applications from a variety of domains: computational fluid dynamics, astronomy and astrophysics, climate modeling, numerical analysis, to name a few. Many of these applications rely on floating-point arithmetic, which is approximate in nature. High-precision libraries have been proposed to mitigate accuracy issues due to the use of floating-point arithmetic. However, these libraries offer increased accuracy at a significant performance cost. Previous work, primarily focusing on CPU code and on standard IEEE floating-point data types, has explored mixed precision as a compromise between performance and accuracy. In this work, we propose a mixed precision autotuner for GPU applications that rely on floating-point arithmetic. Our tool supports standard 32- and 64-bit floating-point arithmetic, as well as high precision through the QD library. Our autotuner relies on compiler analysis to reduce the size of the tuning space. In particular, our tuning strategy takes into account code patterns prone to error propagation and GPU-specific considerations to generate a tuning plan that balances performance and accuracy. Our autotuner pipeline, implemented using the ROSE compiler and Python scripts, is fully automated and the code is available in open source. Our experimental results collected on benchmark applications with various code complexities show performance-accuracy tradeoffs for these applications and the effectiveness of our tool in identifying representative tuning points.

gpu已被广泛用于加速各种领域的科学应用:计算流体动力学，天文学和天体物理学，气候建模，数值分析，仅举几例。这些应用程序中的许多都依赖于浮点运算，这在本质上是近似的。已经提出了高精度库来缓解由于使用浮点运算而导致的精度问题。然而，这些库以显著的性能代价提供了更高的准确性。以前的工作主要集中在CPU代码和标准IEEE浮点数据类型上，探索了混合精度作为性能和精度之间的折衷。在这项工作中，我们为依赖浮点运算的GPU应用程序提出了一个混合精度自动调谐器。我们的工具支持标准的32位和64位浮点运算，以及通过QD库实现的高精度。我们的自动调优器依赖于编译器分析来减少调优空间的大小。特别是，我们的调优策略考虑了容易产生错误传播的代码模式和特定于gpu的考虑，以生成平衡性能和准确性的调优计划。我们的自动调谐器管道是使用ROSE编译器和Python脚本实现的，是完全自动化的，代码是开源的。我们在具有各种代码复杂性的基准应用程序上收集的实验结果显示了这些应用程序的性能-精度权衡以及我们的工具在识别代表性调优点方面的有效性。

{"title":"GPU-FPtuner: Mixed-precision Auto-tuning for Floating-point Applications on GPU","authors":"Ruidong Gu, M. Becchi","doi":"10.1109/HiPC50609.2020.00043","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00043","url":null,"abstract":"GPUs have been extensively used to accelerate scientific applications from a variety of domains: computational fluid dynamics, astronomy and astrophysics, climate modeling, numerical analysis, to name a few. Many of these applications rely on floating-point arithmetic, which is approximate in nature. High-precision libraries have been proposed to mitigate accuracy issues due to the use of floating-point arithmetic. However, these libraries offer increased accuracy at a significant performance cost. Previous work, primarily focusing on CPU code and on standard IEEE floating-point data types, has explored mixed precision as a compromise between performance and accuracy. In this work, we propose a mixed precision autotuner for GPU applications that rely on floating-point arithmetic. Our tool supports standard 32- and 64-bit floating-point arithmetic, as well as high precision through the QD library. Our autotuner relies on compiler analysis to reduce the size of the tuning space. In particular, our tuning strategy takes into account code patterns prone to error propagation and GPU-specific considerations to generate a tuning plan that balances performance and accuracy. Our autotuner pipeline, implemented using the ROSE compiler and Python scripts, is fully automated and the code is available in open source. Our experimental results collected on benchmark applications with various code complexities show performance-accuracy tradeoffs for these applications and the effectiveness of our tool in identifying representative tuning points.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122985739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

On the Marriage of Asynchronous Many Task Runtimes and Big Data: A Glance 论异步多任务运行时与大数据的结合

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00037

Joshua D. Suetterlein, J. Manzano, A. Márquez, G. Gao

The rise of the accelerator-based architectures and reconfigurable computing have showcased the weakness of software stack toolchains that still maintain a static view of the hardware instead of relying on a symbiotic relationship between static (e.g., compilers) and dynamic tools (e.g., runtimes). In the past decades, this need has given rise to adaptive runtimes with increasingly finer computational tasks. These finer tasks help to take advantage of the hardware by switching out when a long latency operation is encountered (because of the deeper memory hierarchies and new memory technologies that might target streaming instead of random access), thus trading off idle time for unrelated work. Examples of these finer task runtimes are Asynchronous Many Task (AMT) runtimes, in which highly efficient computational graphs run on a variety of hardware. Due to its inherent latency tolerant characteristics, latency-sensitive applications, such as Graph Analytics and Big Data can effectively use these runtimes. This paper aims to present an example of how the careful design of an AMT can exploit the hardware substrate when faced with high latency applications such as the ones given in the Big Data domain. Moreover, with its introspection and adaptive capabilities, we aim to show the power of these runtimes when facing the changing requirements of application workloads. We use the Performance Open Community Runtime (P-OCR) as our vehicle to demonstrate the concepts presented here.

基于加速器的架构和可重构计算的兴起显示了软件堆栈工具链的弱点，这些工具链仍然保持硬件的静态视图，而不是依赖于静态(例如编译器)和动态工具(例如运行时)之间的共生关系。在过去的几十年里，这种需求产生了具有越来越精细的计算任务的自适应运行时。这些更精细的任务在遇到长延迟操作时(因为更深层的内存层次结构和新的内存技术可能针对流访问而不是随机访问)通过切换来帮助利用硬件，从而将空闲时间用于不相关的工作。这些更好的任务运行时的例子是异步多任务(AMT)运行时，其中高效的计算图在各种硬件上运行。由于其固有的延迟容忍特性，延迟敏感的应用程序，如Graph Analytics和Big Data可以有效地使用这些运行时。本文旨在提供一个例子，说明在面对诸如大数据领域中给出的高延迟应用时，AMT的精心设计如何利用硬件基板。此外，通过自省和自适应功能，我们的目标是在面对不断变化的应用程序工作负载需求时展示这些运行时的强大功能。我们使用性能开放社区运行时(Performance Open Community Runtime, P-OCR)作为演示这里介绍的概念的工具。

{"title":"On the Marriage of Asynchronous Many Task Runtimes and Big Data: A Glance","authors":"Joshua D. Suetterlein, J. Manzano, A. Márquez, G. Gao","doi":"10.1109/HiPC50609.2020.00037","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00037","url":null,"abstract":"The rise of the accelerator-based architectures and reconfigurable computing have showcased the weakness of software stack toolchains that still maintain a static view of the hardware instead of relying on a symbiotic relationship between static (e.g., compilers) and dynamic tools (e.g., runtimes). In the past decades, this need has given rise to adaptive runtimes with increasingly finer computational tasks. These finer tasks help to take advantage of the hardware by switching out when a long latency operation is encountered (because of the deeper memory hierarchies and new memory technologies that might target streaming instead of random access), thus trading off idle time for unrelated work. Examples of these finer task runtimes are Asynchronous Many Task (AMT) runtimes, in which highly efficient computational graphs run on a variety of hardware. Due to its inherent latency tolerant characteristics, latency-sensitive applications, such as Graph Analytics and Big Data can effectively use these runtimes. This paper aims to present an example of how the careful design of an AMT can exploit the hardware substrate when faced with high latency applications such as the ones given in the Big Data domain. Moreover, with its introspection and adaptive capabilities, we aim to show the power of these runtimes when facing the changing requirements of application workloads. We use the Performance Open Community Runtime (P-OCR) as our vehicle to demonstrate the concepts presented here.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115791195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Pipelined Preconditioned Conjugate Gradient Methods for Distributed Memory Systems 分布式存储系统的流水线预条件共轭梯度方法

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00029

Manas Tiwari, Sathish S. Vadhiyar

Preconditioned Conjugate Gradient (PCG) method has been one of the widely used methods for solving linear systems of equations for sparse problems. Pipelined PCG (PIPECG) attempts to eliminate the dependencies in the computations in the PCG algorithm and overlap non-dependent computations by reorganizing the traditional PCG code and using non-blocking allreduces. We have developed a novel pipelined PCG algorithm called PIPECG-OATI (One Allreduce per Two Iterations) that provides large overlap of global communication and computations at higher number of cores in distributed memory CPU systems. Our method achieves this overlapping by using iteration combination and by introducing new non-recurrence computations. We compare our method with other pipelined CG methods on a variety of problems and demonstrate that our method always gives the least runtimes. Our method gives up to 3x speedup over PCG method and 1.73x speedup over PIPECG method at large number of cores.

预条件共轭梯度法(PCG)是求解线性方程组稀疏问题的一种广泛应用的方法。流水线PCG (pipelining PCG, PIPECG)试图通过重组传统PCG代码和使用非阻塞的allreduce来消除PCG算法中计算中的依赖性和重叠非依赖性计算。我们开发了一种新的流水线PCG算法，称为pipeg - oati(每两次迭代一次Allreduce)，它在分布式内存CPU系统中提供了更高核数的全局通信和计算的大重叠。我们的方法通过迭代组合和引入新的非递归计算来实现这种重叠。我们将我们的方法与其他流水线CG方法在各种问题上进行了比较，并证明我们的方法总是给出最少的运行时间。在大量核数下，我们的方法比PCG方法加速3倍，比PIPECG方法加速1.73倍。

引用次数: 0

[Title page] (标题页)

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/hipc50609.2020.00002

引用次数: 0

Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications 面向并行Python应用的基于rdma的高效通信协程

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00025

A. Shafi, J. Hashmi, H. Subramoni, D. Panda

Python is emerging as a popular language in the data science community due to its ease-of-use, vibrant community, and rich set of libraries. Dask is a popular Python-based distributed computing framework that allows users to process large amounts of data on parallel hardware. The Dask distributed package is a non-blocking, asynchronous, and concurrent library that offers support for distributed execution of tasks on datacenter and HPC environments. A few key requirements of designing high-performance communication backends for Dask distributed is to provide scalable support for coroutines that are unlike regular Python functions and can only be invoked from asynchronous applications. In this paper, we present Blink—a high-performance communication library for Dask on high-performance RDMA networks like InfiniBand. Blink offers a multi-layered architecture that matches the communication requirements of Dask and exploits high-performance interconnects using a Cython wrapper layer to the C backend. We evaluate the performance of Blink against other counterparts using various micro-benchmarks and application kernels on three different cluster testbeds with varying interconnect speeds. Our micro-benchmark evaluation reveals that Blink outperforms other communication backends by more than 3× for message sizes ranging from 1 Byte to 64 KByte, and by a factor of 2× for message sizes ranging from 128 KByte to 8 MByte. Using various application-level evaluations, we demonstrate that Dask achieves up to 7% improvement in application throughput (e.g., total worker throughput).

由于其易用性、充满活力的社区和丰富的库集，Python正在成为数据科学社区中的流行语言。Dask是一种流行的基于python的分布式计算框架，它允许用户在并行硬件上处理大量数据。Dask分布式包是一个非阻塞、异步和并发的库，它支持在数据中心和HPC环境中分布式执行任务。为分布式Dask设计高性能通信后端的几个关键要求是为协程提供可扩展的支持，这些协程与常规Python函数不同，只能从异步应用程序调用。在本文中，我们提出了blink -一个高性能通信库，用于在高性能RDMA网络(如InfiniBand)上的Dask。Blink提供了一个多层体系结构，它符合Dask的通信需求，并利用使用Cython包装层到C后端的高性能互连。我们在三个不同的集群测试平台上使用不同的微基准测试和应用程序内核，以不同的互连速度对Blink的性能进行了评估。我们的微基准评估显示，在消息大小从1字节到64 KByte的情况下，Blink的性能比其他通信后端高出3倍以上，在消息大小从128 KByte到8 MByte的情况下，Blink的性能高出2倍。使用各种应用程序级别的评估，我们证明了Dask在应用程序吞吐量(例如，总工作人员吞吐量)方面实现了高达7%的改进。

{"title":"Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications","authors":"A. Shafi, J. Hashmi, H. Subramoni, D. Panda","doi":"10.1109/HiPC50609.2020.00025","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00025","url":null,"abstract":"Python is emerging as a popular language in the data science community due to its ease-of-use, vibrant community, and rich set of libraries. Dask is a popular Python-based distributed computing framework that allows users to process large amounts of data on parallel hardware. The Dask distributed package is a non-blocking, asynchronous, and concurrent library that offers support for distributed execution of tasks on datacenter and HPC environments. A few key requirements of designing high-performance communication backends for Dask distributed is to provide scalable support for coroutines that are unlike regular Python functions and can only be invoked from asynchronous applications. In this paper, we present Blink—a high-performance communication library for Dask on high-performance RDMA networks like InfiniBand. Blink offers a multi-layered architecture that matches the communication requirements of Dask and exploits high-performance interconnects using a Cython wrapper layer to the C backend. We evaluate the performance of Blink against other counterparts using various micro-benchmarks and application kernels on three different cluster testbeds with varying interconnect speeds. Our micro-benchmark evaluation reveals that Blink outperforms other communication backends by more than 3× for message sizes ranging from 1 Byte to 64 KByte, and by a factor of 2× for message sizes ranging from 128 KByte to 8 MByte. Using various application-level evaluations, we demonstrate that Dask achieves up to 7% improvement in application throughput (e.g., total worker throughput).","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124138887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Parallel Hierarchical Clustering using Rank-Two Nonnegative Matrix Factorization 基于二阶非负矩阵分解的并行分层聚类

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/HiPC50609.2020.00028

Lawton Manning, Grey Ballard, R. Kannan, Haesun Park

Nonnegative Matrix Factorization (NMF) is an effective tool for clustering nonnegative data, either for computing a flat partitioning of a dataset or for determining a hierarchy of similarity. In this paper, we propose a parallel algorithm for hierarchical clustering that uses a divide-and-conquer approach based on rank-two NMF to split a data set into two cohesive parts. Not only does this approach uncover more structure in the data than a flat NMF clustering, but also rank-two NMF can be computed more quickly than for general ranks, providing comparable overall time to solution. Our data distribution and parallelization strategies are designed to maintain computational load balance throughout the data-dependent hierarchy of computation while limiting interprocess communication, allowing the algorithm to scale to large dense and sparse data sets. We demonstrate the scalability of our parallel algorithm in terms of data size (up to 800 GB) and number of processors (up to 80 nodes of the Summit supercomputer), applying the hierarchical clustering approach to hyperspectral imaging and image classification data. Our algorithm for Rank-2 NMF scales perfectly on up to 1000s of cores and the entire hierarchical clustering method achieves 5.9x speedup scaling from 10 to 80 nodes on the 800 GB dataset.

非负矩阵分解(NMF)是聚类非负数据的有效工具，用于计算数据集的平面划分或确定相似性层次。在本文中，我们提出了一种并行分层聚类算法，该算法使用基于二级NMF的分而治之方法将数据集分成两个内聚部分。与平面NMF聚类相比，这种方法不仅可以揭示数据中的更多结构，而且可以比一般排名更快地计算排名2的NMF，从而提供可比较的解决方案总时间。我们的数据分布和并行化策略旨在在整个依赖数据的计算层次中保持计算负载平衡，同时限制进程间通信，允许算法扩展到大型密集和稀疏数据集。我们在数据大小(高达800 GB)和处理器数量(高达Summit超级计算机的80个节点)方面展示了并行算法的可扩展性，并将分层聚类方法应用于高光谱成像和图像分类数据。我们的Rank-2 NMF算法在多达1000个核心上完美扩展，整个分层聚类方法在800 GB数据集上从10个节点扩展到80个节点，实现了5.9倍的加速。

{"title":"Parallel Hierarchical Clustering using Rank-Two Nonnegative Matrix Factorization","authors":"Lawton Manning, Grey Ballard, R. Kannan, Haesun Park","doi":"10.1109/HiPC50609.2020.00028","DOIUrl":"https://doi.org/10.1109/HiPC50609.2020.00028","url":null,"abstract":"Nonnegative Matrix Factorization (NMF) is an effective tool for clustering nonnegative data, either for computing a flat partitioning of a dataset or for determining a hierarchy of similarity. In this paper, we propose a parallel algorithm for hierarchical clustering that uses a divide-and-conquer approach based on rank-two NMF to split a data set into two cohesive parts. Not only does this approach uncover more structure in the data than a flat NMF clustering, but also rank-two NMF can be computed more quickly than for general ranks, providing comparable overall time to solution. Our data distribution and parallelization strategies are designed to maintain computational load balance throughout the data-dependent hierarchy of computation while limiting interprocess communication, allowing the algorithm to scale to large dense and sparse data sets. We demonstrate the scalability of our parallel algorithm in terms of data size (up to 800 GB) and number of processors (up to 80 nodes of the Summit supercomputer), applying the hierarchical clustering approach to hyperspectral imaging and image classification data. Our algorithm for Rank-2 NMF scales perfectly on up to 1000s of cores and the entire hierarchical clustering method achieves 5.9x speedup scaling from 10 to 80 nodes on the 800 GB dataset.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130860892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Message from the General Co-Chairs 一般共同主席的致辞

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/micro.2006.35

A. Luque, Yousef Ibrahim, J. J. Rodríguez

The main conference spans over three days from January 8 through January 10 and it is adjoined by two days of workshops before and after the main conference days. The first conference day will begin with two keynote speeches from two research leaders from the academia: Carla-Fabiana Chiasserini from Politecnico di Torino, Italy, and Gerhard P. Fettweis from TU-Dresden, Germany. The second day, January 9 will start with a newly-introduced fireside chat with the two COMSNETS lifetime achievement awardees! On the second day, we will also have a distinguished banquet speaker in the evening: Rahul Mangharam, from University of Pennsylvania, USA. On the third day of the main conference, we will have two distinguished keynote speakers from the industry: Sriram Rajamani. From Microsoft Research, India, and Saravanan Radhakrishnan, CISCO, India.

主要会议从1月8日持续到1月10日，为期三天，在主要会议日之前和之后分别有两天的研讨会。第一天的会议将以两位学术界研究领袖的主题演讲开始:来自意大利都灵理工大学的Carla-Fabiana Chiasserini和来自德国德累斯顿理工大学的Gerhard P. Fettweis。第二天，1月9日，我们将与两位COMSNETS终身成就奖获得者进行一次全新的炉边聊天!第二天晚上，我们还将邀请到一位杰出的宴会主讲人:来自美国宾夕法尼亚大学的Rahul Mangharam。在主会议的第三天，我们将邀请到来自业界的两位杰出的主题演讲者:Sriram Rajamani。来自印度微软研究院和印度思科公司的Saravanan Radhakrishnan。

引用次数: 0

HiPC 2020 Technical Program Committee 重债穷国2020技术规划委员会

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

Pub Date : 2020-12-01 DOI: 10.1109/hipc50609.2020.00008

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀