Proceedings of the 37th International Conference on Supercomputing最新文献_第2页

Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication 通过基于mpi的fpga间通信实现可重构HPC

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593720

Nicholas Contini, B. Ramesh, Kaushik Kandadi Suresh, Tu Tran, Benjamin Michalowicz, M. Abduljabbar, H. Subramoni, D. Panda

Modern HPC faces new challenges with the slowing of Moore's Law and the end of Dennard Scaling. Traditional computing architectures can no longer be expected to drive today's HPC loads, as shown by the adoption of heterogeneous system design leveraging accelerators such as GPUs and TPUs. Recently, FPGAs have become viable candidates as HPC accelerators. These devices can accelerate workloads by replicating implemented compute units to enable task parallelism, overlapping computation between and within kernels to enable pipeline parallelism, and increasing data locality by sending data directly between compute units. While many solutions for inter-FPGA communication have been presented, these proposed designs generally rely on inter-FPGA networks, unique system setups, and/or the consumption of soft logic resources on the chip. In this paper, we propose an FPGA-aware MPI runtime that avoids such shortcomings. Our MPI implementation does not use any special system setup other than plugging FPGA accelerators into PCIe slots. All communication is orchestrated by the host, utilizing the PCIe interconnect and inter-host network to implement message passing. We propose advanced designs that address data movement challenges and reduce the need for explicit data movement between the device and host (staging) in FPGA applications. We achieve up to 50% reduction in latency for point-to-point transfers compared to application-level staging.

随着摩尔定律的放缓和登纳德尺度的终结，现代高性能计算面临着新的挑战。传统的计算架构不能再期望驱动今天的HPC负载，正如采用利用gpu和tpu等加速器的异构系统设计所表明的那样。最近，fpga已经成为高性能计算加速器的可行候选。这些设备可以通过复制实现的计算单元来实现任务并行性、在内核之间和内核内部重叠计算来实现流水线并行性，以及通过在计算单元之间直接发送数据来增加数据局部性来加速工作负载。虽然已经提出了许多fpga间通信的解决方案，但这些建议的设计通常依赖于fpga间网络，独特的系统设置和/或芯片上软逻辑资源的消耗。在本文中，我们提出了一个fpga感知的MPI运行时，以避免这些缺点。我们的MPI实现除了将FPGA加速器插入PCIe插槽外，不使用任何特殊的系统设置。所有通信都由主机编排，利用PCIe互连和主机间网络实现消息传递。我们提出了解决数据移动挑战的高级设计，并减少了FPGA应用中设备和主机(分段)之间显式数据移动的需求。与应用级分级相比，我们将点对点传输的延迟减少了50%。

{"title":"Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication","authors":"Nicholas Contini, B. Ramesh, Kaushik Kandadi Suresh, Tu Tran, Benjamin Michalowicz, M. Abduljabbar, H. Subramoni, D. Panda","doi":"10.1145/3577193.3593720","DOIUrl":"https://doi.org/10.1145/3577193.3593720","url":null,"abstract":"Modern HPC faces new challenges with the slowing of Moore's Law and the end of Dennard Scaling. Traditional computing architectures can no longer be expected to drive today's HPC loads, as shown by the adoption of heterogeneous system design leveraging accelerators such as GPUs and TPUs. Recently, FPGAs have become viable candidates as HPC accelerators. These devices can accelerate workloads by replicating implemented compute units to enable task parallelism, overlapping computation between and within kernels to enable pipeline parallelism, and increasing data locality by sending data directly between compute units. While many solutions for inter-FPGA communication have been presented, these proposed designs generally rely on inter-FPGA networks, unique system setups, and/or the consumption of soft logic resources on the chip. In this paper, we propose an FPGA-aware MPI runtime that avoids such shortcomings. Our MPI implementation does not use any special system setup other than plugging FPGA accelerators into PCIe slots. All communication is orchestrated by the host, utilizing the PCIe interconnect and inter-host network to implement message passing. We propose advanced designs that address data movement challenges and reduce the need for explicit data movement between the device and host (staging) in FPGA applications. We achieve up to 50% reduction in latency for point-to-point transfers compared to application-level staging.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"188 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115429158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards a Unified Implementation of GEMM in BLIS BLIS中GEMM的统一实现

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593707

RuQing G. Xu, Field G. Van Zee, Robert A. van de Geijn

Matrix libraries often focus on achieving high performance for problems considered to be either "small" or "large", as these two scenarios tend to respond best to different optimization strategies. We propose a unified technique for implementing matrix operations like general matrix multiplication (gemm) that can achieve high performance for both small and large problem sizes. The key is to fuse packing - an operation that copies data to a contiguous layout in memory and which is critical for large matrix performance - with the first computational "pass" over that data. This boosts performance across the problem size spectrum. As a result, tuning general-purpose libraries becomes simpler since it obviates the need to carefully express and parameterize logic that chooses between a "small matrix" strategy and a "large matrix" strategy. A prototype implementation of the technique built with the BLAS-like Library Instantiation Software (BLIS) framework is described and performance on a range of architectures is reported.

矩阵库通常专注于为“小”或“大”问题实现高性能，因为这两种场景往往对不同的优化策略做出最佳响应。我们提出了一种统一的技术来实现矩阵运算，如一般矩阵乘法(gem)，可以在小问题和大问题上实现高性能。关键是将打包(一种将数据复制到内存中连续布局的操作，这对大矩阵性能至关重要)与数据的第一次计算“传递”融合在一起。这提高了问题大小范围内的性能。因此，调优通用库变得更加简单，因为它避免了在“小矩阵”策略和“大矩阵”策略之间进行选择时仔细表达和参数化逻辑的需要。描述了用类blas库实例化软件(BLIS)框架构建的该技术的原型实现，并报告了在一系列体系结构上的性能。

引用次数: 0

Scalable parallelization for the solution of phonon Boltzmann Transport Equation 声子玻尔兹曼输运方程解的可伸缩并行化

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593723

H. Tran, Siddharth Saurav, P. Sadayappan, S. Mazumder, H. Sundar

The Boltzmann Transport Equation (BTE) for phonons is often used to predict thermal transport at submicron scales in semiconductors. The BTE is a seven-dimensional nonlinear integro-differential equation, resulting in difficulty in its solution even after linearization under the single relaxation time approximation. Furthermore, parallelization and load balancing are challenging, given the high dimensionality and variability of the linear systems' conditioning. This work presents a 'synthetic' scalable parallelization method for solving the BTE on large-scale systems. The method includes cell-based parallelization, combined band+cell-based parallelization, and batching technique. The essential computational ingredient of cell-based parallelization is a sparse matrix-vector product (SpMV) that can be integrated with an existing linear algebra library like PETSc. The combined approach enhances the cell-based method by further parallelizing the band dimension to take advantage of low inter-band communication costs. For the batched approach, we developed a batched SpMV that enables multiple linear systems to be solved simultaneously, merging many MPI messages to reduce communication costs, thus maintaining scalability when the grain size becomes very small. We present numerical experiments to demonstrate our method's excellent speedups and scalability up to 16384 cores for a problem with 12.6 billion unknowns.

声子的玻尔兹曼输运方程(BTE)常用于预测半导体中亚微米尺度的热输运。BTE是一个七维非线性积分微分方程，在单一的松弛时间近似下，即使经过线性化处理也很难求解。此外，考虑到线性系统的高维性和可变性，并行化和负载平衡是具有挑战性的。本文提出了一种“综合”可扩展并行化方法，用于解决大规模系统上的BTE问题。该方法包括基于单元的并行化、组合频带+基于单元的并行化和批处理技术。基于单元的并行化的基本计算要素是稀疏矩阵向量积(SpMV)，它可以与现有的线性代数库(如PETSc)集成。该组合方法通过进一步并行化频带维度来增强基于小区的方法，以利用低频带间通信成本的优势。对于批处理方法，我们开发了一种批处理SpMV，可以同时解决多个线性系统，合并许多MPI消息以降低通信成本，从而在粒度变得非常小时保持可扩展性。我们提出了数值实验来证明我们的方法具有出色的加速和可扩展性，最多可达16384个内核，用于解决具有126亿个未知数的问题。

{"title":"Scalable parallelization for the solution of phonon Boltzmann Transport Equation","authors":"H. Tran, Siddharth Saurav, P. Sadayappan, S. Mazumder, H. Sundar","doi":"10.1145/3577193.3593723","DOIUrl":"https://doi.org/10.1145/3577193.3593723","url":null,"abstract":"The Boltzmann Transport Equation (BTE) for phonons is often used to predict thermal transport at submicron scales in semiconductors. The BTE is a seven-dimensional nonlinear integro-differential equation, resulting in difficulty in its solution even after linearization under the single relaxation time approximation. Furthermore, parallelization and load balancing are challenging, given the high dimensionality and variability of the linear systems' conditioning. This work presents a 'synthetic' scalable parallelization method for solving the BTE on large-scale systems. The method includes cell-based parallelization, combined band+cell-based parallelization, and batching technique. The essential computational ingredient of cell-based parallelization is a sparse matrix-vector product (SpMV) that can be integrated with an existing linear algebra library like PETSc. The combined approach enhances the cell-based method by further parallelizing the band dimension to take advantage of low inter-band communication costs. For the batched approach, we developed a batched SpMV that enables multiple linear systems to be solved simultaneously, merging many MPI messages to reduce communication costs, thus maintaining scalability when the grain size becomes very small. We present numerical experiments to demonstrate our method's excellent speedups and scalability up to 16384 cores for a problem with 12.6 billion unknowns.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125285839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FAZ: A flexible auto-tuned modular error-bounded compression framework for scientific data FAZ:一个灵活的自动调整的模块化错误限制压缩框架，用于科学数据

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593721

Jinyang Liu, S. Di, Kai Zhao, Xin Liang, Zizhong Chen, F. Cappello

Error-bounded lossy compression has been effective to resolve the big scientific data issue because it has a great potential to significantly reduce the data volume while allowing users to control data distortion based on specified error bounds. However, none of the existing error-bounded lossy compressors can always obtain the best compression quality because of the diverse characteristics of different datasets. In this paper, we develop FAZ, a flexible and adaptive error-bounded lossy compression framework, which projects a fairly high capability of adapting to diverse datasets. FAZ can always keep the compression quality at the best level compared with other state-of-the-art compressors for different datasets. We perform a comprehensive evaluation using 6 real-world scientific applications and 6 other state-of-the-art error-bounded lossy compressors. Experiments show that compared with the other existing lossy compressors, FAZ can improve the compression ratio by up to 120%, 190%, and 75% when setting the same error bound, the same PSNR and the same SSIM, respectively.

错误界有损压缩在允许用户根据指定的错误界控制数据失真的同时，具有显著减少数据量的巨大潜力，是解决大科学数据问题的有效方法。然而，由于不同数据集的特性不同，现有的误差有界有损压缩器都不能始终获得最佳的压缩质量。在本文中，我们开发了一种灵活的自适应错误有界有损压缩框架FAZ，它具有相当高的适应各种数据集的能力。与其他最先进的压缩器相比，FAZ可以始终将不同数据集的压缩质量保持在最佳水平。我们使用6个真实世界的科学应用程序和6个其他最先进的误差有界有损压缩机进行全面评估。实验表明，在相同的误差界、相同的PSNR和相同的SSIM条件下，FAZ与现有的其他有损压缩器相比，压缩比分别提高了120%、190%和75%。

引用次数: 1

DyVer: Dynamic Version Handling for Array Databases 数组数据库的动态版本处理

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593734

Amelie Chi Zhou, Zhoubin Ke, Jianming Lao

Array databases are important data management systems for scientific applications. In array databases, version handling is an important problem due to the no-overwrite feature of scientific data. Existing studies for optimizing data versioning in array databases are relatively simple, which either focus on minimizing storage sizes or improving simple version chains. In this paper, we focus on two challenges: (1) how to balance the tradeoff between storage size and query time for numerous version data, which may have derivative relationships with each other; (2) how to dynamically maintain this balance with continuously added new versions. To address the above challenges, this paper presents DyVer, a versioning framework for SciDB which is one of the most well-known array databases. DyVer includes two techniques, including an efficient storage layout optimizer to quickly reduce data query time under storage capacity constraint and a version segment technique to cope with dynamic version additions. We evaluate DyVer using real-world scientific datasets. Results show that DyVer can achieve up to 95% improvement on the average query time compared to state-of-the-art data versioning techniques under the same storage capacity constraint.

阵列数据库是科学应用中重要的数据管理系统。在数组数据库中，由于科学数据的无覆盖特性，版本处理是一个重要的问题。现有的优化数组数据库数据版本控制的研究相对简单，要么关注于最小化存储大小，要么关注于改进简单的版本链。在本文中，我们关注两个挑战:(1)如何平衡存储大小和查询时间之间的权衡，对于大量版本数据，它们之间可能存在派生关系;(2)如何通过不断添加新版本来动态地保持这种平衡。为了解决上述挑战，本文提出了DyVer，一个用于最著名的数组数据库之一的SciDB的版本控制框架。DyVer包含两种技术，一种是有效的存储布局优化器，用于在存储容量受限的情况下快速减少数据查询时间;另一种是版本段技术，用于处理动态版本添加。我们使用真实世界的科学数据集来评估DyVer。结果表明，在相同的存储容量约束下，与最先进的数据版本控制技术相比，DyVer可以在平均查询时间上实现高达95%的改进。

{"title":"DyVer: Dynamic Version Handling for Array Databases","authors":"Amelie Chi Zhou, Zhoubin Ke, Jianming Lao","doi":"10.1145/3577193.3593734","DOIUrl":"https://doi.org/10.1145/3577193.3593734","url":null,"abstract":"Array databases are important data management systems for scientific applications. In array databases, version handling is an important problem due to the no-overwrite feature of scientific data. Existing studies for optimizing data versioning in array databases are relatively simple, which either focus on minimizing storage sizes or improving simple version chains. In this paper, we focus on two challenges: (1) how to balance the tradeoff between storage size and query time for numerous version data, which may have derivative relationships with each other; (2) how to dynamically maintain this balance with continuously added new versions. To address the above challenges, this paper presents DyVer, a versioning framework for SciDB which is one of the most well-known array databases. DyVer includes two techniques, including an efficient storage layout optimizer to quickly reduce data query time under storage capacity constraint and a version segment technique to cope with dynamic version additions. We evaluate DyVer using real-world scientific datasets. Results show that DyVer can achieve up to 95% improvement on the average query time compared to state-of-the-art data versioning techniques under the same storage capacity constraint.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133138428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distributed-Memory Parallel JointNMF 分布式内存并行连接

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593733

Srinivas Eswar, Benjamin Cobb, Koby Hayashi, R. Kannan, Grey Ballard, R. Vuduc, Haesun Park

Joint Nonnegative Matrix Factorization (JointNMF) is a hybrid method for mining information from datasets that contain both feature and connection information. We propose distributed-memory parallelizations of three algorithms for solving the JointNMF problem based on Alternating Nonnegative Least Squares, Projected Gradient Descent, and Projected Gauss-Newton. We extend well-known communication-avoiding algorithms using a single processor grid case to our coupled case on two processor grids. We demonstrate the scalability of the algorithms on up to 960 cores (40 nodes) with 60% parallel efficiency. The more sophisticated Alternating Nonnegative Least Squares (ANLS) and Gauss-Newton variants outperform the first-order gradient descent method in reducing the objective on large-scale problems. We perform a topic modelling task on a large corpus of academic papers that consists of over 37 million paper abstracts and nearly a billion citation relationships, demonstrating the utility and scalability of the methods.

联合非负矩阵分解(JointNMF)是一种从包含特征信息和连接信息的数据集中挖掘信息的混合方法。我们提出了三种基于交替非负最小二乘、投影梯度下降和投影高斯-牛顿的分布式内存并行算法来解决JointNMF问题。我们将使用单处理器网格的著名通信避免算法扩展到我们在两个处理器网格上的耦合情况。我们在多达960个核(40个节点)上演示了算法的可扩展性，并行效率为60%。更复杂的交替非负最小二乘(ANLS)和高斯-牛顿变体在减少大规模问题的目标方面优于一阶梯度下降方法。我们对由超过3700万篇论文摘要和近10亿引用关系组成的大型学术论文语料库执行主题建模任务，展示了方法的实用性和可扩展性。

引用次数: 0

Lightweight Huffman Coding for Efficient GPU Compression 高效GPU压缩的轻量级霍夫曼编码

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593736

Milan Shah, Xiaodong Yu, S. Di, M. Becchi, F. Cappello

Lossy compression is often deployed in scientific applications to reduce data footprint and improve data transfers and I/O performance. Especially for applications requiring on-the-flight compression, it is essential to minimize compression's runtime. In this paper, we design a scheme to improve the performance of cuSZ, a GPU-based lossy compressor. We observe that Huffman coding - used by cuSZ to compress metadata generated during compression - incurs a performance overhead that can be significant, especially for smaller datasets. Our work seeks to reduce the Huffman coding runtime with minimal-to-no impact on cuSZ's compression efficiency. Our contributions are as follows. First, we examine a variety of probability distributions to determine which distributions closely model the input to cuSZ's Huffman coding stage. From these distributions, we create a dictionary of pre-computed codebooks such that during compression, a codebook is selected from the dictionary instead of computing a custom codebook. Second, we explore three codebook selection criteria to be applied at runtime. Finally, we evaluate our scheme on real-world datasets and in the context of two important application use cases, HDF5 and MPI, using an NVIDIA A100 GPU. Our evaluation shows that our method can reduce the Huffman coding penalty by a factor of 78--92×, translating to a total speedup of up to 5× over baseline cuSZ. Smaller HDF5 chunk sizes enjoy over an 8× speedup in compression and MPI messages on the scale of tens of MB have a 1.4--30.5× speedup in communication time.

有损压缩通常部署在科学应用程序中，以减少数据占用，提高数据传输和I/O性能。特别是对于需要动态压缩的应用程序，最小化压缩的运行时间是至关重要的。本文设计了一种改进基于gpu的有损压缩器cuSZ性能的方案。我们观察到，Huffman编码——cuSZ用来压缩压缩过程中生成的元数据——会产生很大的性能开销，特别是对于较小的数据集。我们的工作旨在减少霍夫曼编码运行时，对cuSZ的压缩效率影响最小甚至没有影响。我们的贡献如下。首先，我们检查了各种概率分布，以确定哪些分布与cuSZ的霍夫曼编码阶段的输入密切相关。从这些分布中，我们创建了一个预先计算的码本字典，以便在压缩过程中，从字典中选择一个码本，而不是计算自定义码本。其次，我们探讨了运行时应用的三个码本选择标准。最后，我们使用NVIDIA A100 GPU在现实世界数据集和两个重要应用用例HDF5和MPI的背景下评估我们的方案。我们的评估表明，我们的方法可以将霍夫曼编码惩罚减少78- 92倍，在基线cuSZ上转化为高达5倍的总加速。较小的HDF5块大小在压缩方面有超过8倍的加速，而几十MB的MPI消息在通信时间方面有1.4- 30.5倍的加速。

{"title":"Lightweight Huffman Coding for Efficient GPU Compression","authors":"Milan Shah, Xiaodong Yu, S. Di, M. Becchi, F. Cappello","doi":"10.1145/3577193.3593736","DOIUrl":"https://doi.org/10.1145/3577193.3593736","url":null,"abstract":"Lossy compression is often deployed in scientific applications to reduce data footprint and improve data transfers and I/O performance. Especially for applications requiring on-the-flight compression, it is essential to minimize compression's runtime. In this paper, we design a scheme to improve the performance of cuSZ, a GPU-based lossy compressor. We observe that Huffman coding - used by cuSZ to compress metadata generated during compression - incurs a performance overhead that can be significant, especially for smaller datasets. Our work seeks to reduce the Huffman coding runtime with minimal-to-no impact on cuSZ's compression efficiency. Our contributions are as follows. First, we examine a variety of probability distributions to determine which distributions closely model the input to cuSZ's Huffman coding stage. From these distributions, we create a dictionary of pre-computed codebooks such that during compression, a codebook is selected from the dictionary instead of computing a custom codebook. Second, we explore three codebook selection criteria to be applied at runtime. Finally, we evaluate our scheme on real-world datasets and in the context of two important application use cases, HDF5 and MPI, using an NVIDIA A100 GPU. Our evaluation shows that our method can reduce the Huffman coding penalty by a factor of 78--92×, translating to a total speedup of up to 5× over baseline cuSZ. Smaller HDF5 chunk sizes enjoy over an 8× speedup in compression and MPI messages on the scale of tens of MB have a 1.4--30.5× speedup in communication time.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128303633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Seizing the Bandwidth Scaling of On-Package Interconnect in a Post-Moore's Law World 在后摩尔定律世界中抓住封装互连的带宽扩展

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593702

Grigory Chirkov, D. Wentzlaff

The slowing and forecasted end of Moore's Law have forced designers to look beyond simply adding transistors, encouraging them to employ other unused resources as a manner to increase chip performance. At the same time, in recent years, inter-die interconnect technologies made a huge leap forward, dramatically increasing the available bandwidth. While the end of Moore's Law will inevitably slow down the performance advances of single-die setups, interconnect technologies will likely continue to scale. We envision a future where designers must create ways to exploit interconnect utilization for better system performance. As an example of a feature that converts interconnect utilization into performance, we present Meduza - a write-update coherence protocol for future chiplet systems. Meduza extends previous write-update protocols to systems with multi-level cache hierarchies. Meduza improves execution speed in our benchmark suite by 19% when compared to the MESIF coherence protocol on a chiplet-based system. Moreover, Meduza promises even more advantages in future systems. This work shows that by exploiting excess interconnect bandwidth, there is significant potential for additional performance in modern and future chiplet systems.

摩尔定律的缓慢和预测的终结迫使设计师们不再仅仅考虑增加晶体管，而是鼓励他们利用其他未使用的资源来提高芯片性能。同时，近年来，芯片间互连技术取得了巨大的飞跃，极大地增加了可用带宽。虽然摩尔定律的终结将不可避免地减缓单芯片设置的性能进步，但互连技术可能会继续扩展。我们设想未来设计人员必须创造方法来利用互连来获得更好的系统性能。作为一个将互连利用率转化为性能的特性的例子，我们提出了Meduza -一个用于未来芯片系统的写更新一致性协议。Meduza将以前的写更新协议扩展到具有多级缓存层次结构的系统。与基于芯片的系统上的MESIF一致性协议相比，Meduza在我们的基准套件中的执行速度提高了19%。此外，Meduza承诺在未来的系统中会有更多的优势。这项工作表明，通过利用多余的互连带宽，在现代和未来的芯片系统中有额外性能的巨大潜力。

{"title":"Seizing the Bandwidth Scaling of On-Package Interconnect in a Post-Moore's Law World","authors":"Grigory Chirkov, D. Wentzlaff","doi":"10.1145/3577193.3593702","DOIUrl":"https://doi.org/10.1145/3577193.3593702","url":null,"abstract":"The slowing and forecasted end of Moore's Law have forced designers to look beyond simply adding transistors, encouraging them to employ other unused resources as a manner to increase chip performance. At the same time, in recent years, inter-die interconnect technologies made a huge leap forward, dramatically increasing the available bandwidth. While the end of Moore's Law will inevitably slow down the performance advances of single-die setups, interconnect technologies will likely continue to scale. We envision a future where designers must create ways to exploit interconnect utilization for better system performance. As an example of a feature that converts interconnect utilization into performance, we present Meduza - a write-update coherence protocol for future chiplet systems. Meduza extends previous write-update protocols to systems with multi-level cache hierarchies. Meduza improves execution speed in our benchmark suite by 19% when compared to the MESIF coherence protocol on a chiplet-based system. Moreover, Meduza promises even more advantages in future systems. This work shows that by exploiting excess interconnect bandwidth, there is significant potential for additional performance in modern and future chiplet systems.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130870155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

DStore: A Lightweight Scalable Learning Model Repository with Fine-Grain Tensor-Level Access DStore:具有细粒度张量级访问的轻量级可扩展学习模型存储库

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593730

Meghana Madhyastha, Robert Underwood, R. Burns, Bogdan Nicolae

The ability to share and reuse deep learning (DL) models is a key driver that facilitates the rapid adoption of artificial intelligence (AI) in both industrial and scientific applications. However, state-of-the-art approaches to store and access DL models efficiently at scale lag behind. Most often, DL models are serialized by using various formats (e.g., HDF5, SavedModel) and stored as files on POSIX file systems. While simple and portable, such an approach exhibits high serialization and I/O overheads, especially under concurrency. Additionally, the emergence of advanced AI techniques (transfer learning, sensitivity analysis, explainability, etc.) introduces the need for fine-grained access to tensors to facilitate the extraction and reuse of individual or subsets of tensors. Such patterns are underserved by state-of-the-art approaches. Requiring tensors to be read in bulk incurs suboptimal performance, scales poorly, and/or overutilizes network bandwidth. In this paper we propose a lightweight, distributed, RDMA-enabled learning model repository that addresses these challenges. Specifically we introduce several ideas: compact architecture graph representation with stable hashing and client-side metadata caching, scalable load balancing on multiple providers, RDMA-optimized data staging, and direct access to raw tensor data. We evaluate our proposal in extensive experiments that involve different access patterns using learning models of diverse shapes and sizes. Our evaluations show a significant improvement (between 2 and 30× over a variety of state-of-the-art model storage approaches while scaling to half the Cooley cluster at the Argonne Leadership Computing Facility.

共享和重用深度学习(DL)模型的能力是促进人工智能(AI)在工业和科学应用中快速采用的关键驱动因素。然而，最先进的大规模存储和访问深度学习模型的方法落后了。大多数情况下，DL模型通过使用各种格式(例如HDF5, SavedModel)进行序列化，并作为文件存储在POSIX文件系统上。虽然简单且可移植，但这种方法显示出很高的序列化和I/O开销，特别是在并发性下。此外，先进的人工智能技术(迁移学习、敏感性分析、可解释性等)的出现引入了对细粒度访问张量的需求，以促进张量的单个或子集的提取和重用。最先进的方法无法满足这种模式。要求大量读取张量会导致性能不佳、可伸缩性差和/或过度使用网络带宽。在本文中，我们提出了一个轻量级的、分布式的、支持rdma的学习模型存储库来解决这些挑战。具体来说，我们介绍了几个想法:紧凑的架构图表示，稳定的散列和客户端元数据缓存，多个提供者上的可扩展负载平衡，rdma优化的数据分段，以及直接访问原始张量数据。我们在广泛的实验中评估了我们的建议，这些实验涉及使用不同形状和大小的学习模型的不同访问模式。我们的评估显示，在扩展到阿贡领导计算设施的Cooley集群的一半时，与各种最先进的模型存储方法相比，有了显着的改进(在2到30倍之间)。

{"title":"DStore: A Lightweight Scalable Learning Model Repository with Fine-Grain Tensor-Level Access","authors":"Meghana Madhyastha, Robert Underwood, R. Burns, Bogdan Nicolae","doi":"10.1145/3577193.3593730","DOIUrl":"https://doi.org/10.1145/3577193.3593730","url":null,"abstract":"The ability to share and reuse deep learning (DL) models is a key driver that facilitates the rapid adoption of artificial intelligence (AI) in both industrial and scientific applications. However, state-of-the-art approaches to store and access DL models efficiently at scale lag behind. Most often, DL models are serialized by using various formats (e.g., HDF5, SavedModel) and stored as files on POSIX file systems. While simple and portable, such an approach exhibits high serialization and I/O overheads, especially under concurrency. Additionally, the emergence of advanced AI techniques (transfer learning, sensitivity analysis, explainability, etc.) introduces the need for fine-grained access to tensors to facilitate the extraction and reuse of individual or subsets of tensors. Such patterns are underserved by state-of-the-art approaches. Requiring tensors to be read in bulk incurs suboptimal performance, scales poorly, and/or overutilizes network bandwidth. In this paper we propose a lightweight, distributed, RDMA-enabled learning model repository that addresses these challenges. Specifically we introduce several ideas: compact architecture graph representation with stable hashing and client-side metadata caching, scalable load balancing on multiple providers, RDMA-optimized data staging, and direct access to raw tensor data. We evaluate our proposal in extensive experiments that involve different access patterns using learning models of diverse shapes and sizes. Our evaluations show a significant improvement (between 2 and 30× over a variety of state-of-the-art model storage approaches while scaling to half the Cooley cluster at the Argonne Leadership Computing Facility.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127128608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training 异构智能网卡推荐模型推理与训练系统软硬件协同设计

Proceedings of the 37th International Conference on Supercomputing

Pub Date : 2023-06-21 DOI: 10.1145/3577193.3593724

Anqi Guo, Y. Hao, Chunshu Wu, Pouya Haghi, Zhenyu Pan, Min Si, Dingwen Tao, Ang Li, Martin C. Herbordt, Tong Geng

Deep Learning Recommendation Models (DLRMs) are important applications in various domains and have evolved into one of the largest and most important machine learning applications. With their trillions of parameters necessarily exceeding the high bandwidth memory (HBM) capacity of GPUs, ever more massive DLRMs require large-scale multi-node systems for distributed training and inference. However, these all suffer from the all-to-all communication bottleneck, which limits scalability. SmartNICs couple computation and communication capabilities to provide powerful network-facing heterogeneous devices that reduce communication overhead. There has not, however, been a distributed system design that fully leverages SmartNIC resources to address scalability of DLRMs. We propose a software-hardware co-design of a heterogeneous SmartNIC system that overcomes the communication bottleneck of distributed DLRMs, mitigates the pressure on memory bandwidth, and improves computation efficiency. We provide a set of SmartNIC designs of cache systems (including local cache and remote cache) and SmartNIC computation kernels that reduce data movement, relieve memory lookup intensity, and improve the GPU's computation efficiency. In addition, we propose a graph algorithm that improves the data locality of queries within batches and optimizes the overall system performance with higher data reuse. Our evaluation shows that the system achieves 2.1× latency speedup for inference and 1.6× throughput speedup for training.

深度学习推荐模型(Deep Learning Recommendation Models, dlrm)是各个领域的重要应用，已经发展成为最大、最重要的机器学习应用之一。由于其数万亿的参数必然超过gpu的高带宽内存(HBM)容量，越来越多的大规模dlrm需要大规模的多节点系统来进行分布式训练和推理。然而，这些都受到所有对所有通信瓶颈的影响，这限制了可伸缩性。smartnic将计算和通信能力结合起来，提供强大的面向网络的异构设备，降低通信开销。然而，目前还没有一种分布式系统设计能够充分利用SmartNIC资源来解决dlrm的可扩展性问题。本文提出一种软硬件协同设计的异构SmartNIC系统，克服了分布式dlrm的通信瓶颈，减轻了内存带宽的压力，提高了计算效率。我们提供了一套SmartNIC缓存系统设计(包括本地缓存和远程缓存)和SmartNIC计算内核，减少数据移动，减轻内存查找强度，提高GPU的计算效率。此外，我们提出了一种图算法，该算法提高了批量查询的数据局部性，并通过更高的数据重用优化了整体系统性能。我们的评估表明，系统在推理方面实现了2.1倍的延迟加速，在训练方面实现了1.6倍的吞吐量加速。

{"title":"Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training","authors":"Anqi Guo, Y. Hao, Chunshu Wu, Pouya Haghi, Zhenyu Pan, Min Si, Dingwen Tao, Ang Li, Martin C. Herbordt, Tong Geng","doi":"10.1145/3577193.3593724","DOIUrl":"https://doi.org/10.1145/3577193.3593724","url":null,"abstract":"Deep Learning Recommendation Models (DLRMs) are important applications in various domains and have evolved into one of the largest and most important machine learning applications. With their trillions of parameters necessarily exceeding the high bandwidth memory (HBM) capacity of GPUs, ever more massive DLRMs require large-scale multi-node systems for distributed training and inference. However, these all suffer from the all-to-all communication bottleneck, which limits scalability. SmartNICs couple computation and communication capabilities to provide powerful network-facing heterogeneous devices that reduce communication overhead. There has not, however, been a distributed system design that fully leverages SmartNIC resources to address scalability of DLRMs. We propose a software-hardware co-design of a heterogeneous SmartNIC system that overcomes the communication bottleneck of distributed DLRMs, mitigates the pressure on memory bandwidth, and improves computation efficiency. We provide a set of SmartNIC designs of cache systems (including local cache and remote cache) and SmartNIC computation kernels that reduce data movement, relieve memory lookup intensity, and improve the GPU's computation efficiency. In addition, we propose a graph algorithm that improves the data locality of queries within batches and optimizes the overall system performance with higher data reuse. Our evaluation shows that the system achieves 2.1× latency speedup for inference and 1.6× throughput speedup for training.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130860890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2