ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing最新文献

英文中文

μSteal

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3463529

Amirhossein Mirhosseini, T. Wenisch

Modern internet services are moving towards distributed microservice architectures, wherein a complex application is decomposed into numerous discrete microservices to improve programmability, reliability, manageability, and scalability. A key property of microservice-based architectures is that common microservices may be shared by multiple end-to-end cloud services. As an example, a speech-recognition microservice might serve as an early node in the microservice graphs of several end-to-end services. However, given the dissimilarities across microservice graphs and varying end-to-end latency constraints across services, shared microservices may need to operate under differing latency constraints for each service. As a result, in existing systems, most providers either deploy multiple instance pools for each latency constraint, or require all requests to needlessly meet the most stringent constraint. In this paper, we argue that sharing microservice instances across multiple services can reduce significantly the number of instances, especially under highly asymmetric latency constraints. We propose a request scheduling mechanism, called μSteal, which leverages preemptive work and resource stealing to schedule the arriving requests to cores within a ``mixed-criticality'' microservice instance. μSteal provisions ``core reservations'' for each request class based on their latency requirements, but allows a class to steal cores from other classes if they would otherwise remain idle. But, when a class requires its full reservation, μSteal preempts stolen cores, returning them to their reserved class. μSteal employs a runtime feedback controller augmented by a queuing theory-based analytical model to tune core reservations across classes, seeking to maximize the request throughput within each instance while meeting all classes' latency constraints. We show that μSteal reduces required instances for several shared microservice deployments by 1.29x as compared to deploying multiple, segregated instance pools.

{"title":"μSteal","authors":"Amirhossein Mirhosseini, T. Wenisch","doi":"10.1145/3447818.3463529","DOIUrl":"https://doi.org/10.1145/3447818.3463529","url":null,"abstract":"Modern internet services are moving towards distributed microservice architectures, wherein a complex application is decomposed into numerous discrete microservices to improve programmability, reliability, manageability, and scalability. A key property of microservice-based architectures is that common microservices may be shared by multiple end-to-end cloud services. As an example, a speech-recognition microservice might serve as an early node in the microservice graphs of several end-to-end services. However, given the dissimilarities across microservice graphs and varying end-to-end latency constraints across services, shared microservices may need to operate under differing latency constraints for each service. As a result, in existing systems, most providers either deploy multiple instance pools for each latency constraint, or require all requests to needlessly meet the most stringent constraint. In this paper, we argue that sharing microservice instances across multiple services can reduce significantly the number of instances, especially under highly asymmetric latency constraints. We propose a request scheduling mechanism, called μSteal, which leverages preemptive work and resource stealing to schedule the arriving requests to cores within a ``mixed-criticality'' microservice instance. μSteal provisions ``core reservations'' for each request class based on their latency requirements, but allows a class to steal cores from other classes if they would otherwise remain idle. But, when a class requires its full reservation, μSteal preempts stolen cores, returning them to their reserved class. μSteal employs a runtime feedback controller augmented by a queuing theory-based analytical model to tune core reservations across classes, seeking to maximize the request throughput within each instance while meeting all classes' latency constraints. We show that μSteal reduces required instances for several shared microservice deployments by 1.29x as compared to deploying multiple, segregated instance pools.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"136 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77940233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

NPBench

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460360

A. Ziogas, Tal Ben-Nun, Timo Schneider, T. Hoefler

Python, already one of the most popular languages for scientific computing, has made significant inroads in High Performance Computing (HPC). At the center of Python's ecosystem is NumPy, an efficient implementation of the multi-dimensional array (tensor) structure, together with basic arithmetic and linear algebra. Compared to traditional HPC languages, the relatively low performance of Python and NumPy has spawned significant research in compilers and frameworks that decouple Python's compact representation from the underlying implementation. However, it is challenging to compare language compatibility and performance among different frameworks and architectures without a standard set of benchmarks and metrics. To that end, we introduce NPBench, a set of NumPy code samples representing a large variety of HPC applications. We use NPBench to test popular NumPy-accelerating compilers and frameworks on a variety of metrics. NPBench will guide both end-users and framework developers focusing on performance and will drive further use of Python in the high-performance scientific domains.

引用次数: 15

Accelerating DNNs inference with predictive layer fusion 用预测层融合加速dnn推理

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460378

MohammadHossein Olyaiy, Christopher Ng, Mieszko Lis

Many modern convolutional neural neworks (CNNs) rely on bottleneck block structures where the activation tensor is mapped between higher dimensions using an intermediate low dimension, and convolved with depthwise feature filters rather than multi-channel filters. Because most of the computation lies in computing the large dimensional tensors, however, such networks cannot be scaled without significant computation costs. In this paper, we show how emph{fusing} the layers inside these blocks can dramatically reduce the multiplication count (by 6--20x) at the cost of extra additions. ReLU nonlinearities are predicted dynamically, and only the activations that survive ReLU contribute to directly compute the output of the block. We also propose FusioNet, a CNN architecture optimized for fusion, as well as ARCHON, a novel accelerator design with a dataflow optimized for fused networks. When FusioNet is executed on the proposed accelerator, it yields up to 5.8x faster inference compared to compact networks executed on a dense DNN accelerator, and 2.1x faster inference compared to the same networks when pruned and executed on a sparse DNN accelerator.

许多现代卷积神经网络(cnn)依赖于瓶颈块结构，其中激活张量使用中间低维在高维之间进行映射，并使用深度特征滤波器而不是多通道滤波器进行卷积。然而，由于大部分的计算是在计算大维张量，这样的网络无法在没有显著计算成本的情况下进行扩展。在本文中，我们展示了如何emph{融合}这些块内的层可以以额外的添加为代价显着减少乘法计数(减少6- 20倍)。ReLU非线性是动态预测的，只有在ReLU中幸存的激活才能直接计算区块的输出。我们还提出了一种针对融合优化的CNN架构FusioNet，以及一种针对融合网络优化的数据流的新型加速器ARCHON。当在提议的加速器上执行FusioNet时，与在密集DNN加速器上执行的紧凑网络相比，它产生的推理速度快5.8倍，与在稀疏DNN加速器上执行的修剪和执行的相同网络相比，它产生的推理速度快2.1倍。

{"title":"Accelerating DNNs inference with predictive layer fusion","authors":"MohammadHossein Olyaiy, Christopher Ng, Mieszko Lis","doi":"10.1145/3447818.3460378","DOIUrl":"https://doi.org/10.1145/3447818.3460378","url":null,"abstract":"Many modern convolutional neural neworks (CNNs) rely on bottleneck block structures where the activation tensor is mapped between higher dimensions using an intermediate low dimension, and convolved with depthwise feature filters rather than multi-channel filters. Because most of the computation lies in computing the large dimensional tensors, however, such networks cannot be scaled without significant computation costs. In this paper, we show how emph{fusing} the layers inside these blocks can dramatically reduce the multiplication count (by 6--20x) at the cost of extra additions. ReLU nonlinearities are predicted dynamically, and only the activations that survive ReLU contribute to directly compute the output of the block. We also propose FusioNet, a CNN architecture optimized for fusion, as well as ARCHON, a novel accelerator design with a dataflow optimized for fused networks. When FusioNet is executed on the proposed accelerator, it yields up to 5.8x faster inference compared to compact networks executed on a dense DNN accelerator, and 2.1x faster inference compared to the same networks when pruned and executed on a sparse DNN accelerator.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75584350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Does it matter?: OMPSanitizer: an impact analyzer of reported data races in OpenMP programs 这有关系吗?OMPSanitizer: OpenMP程序中报告的数据竞争的影响分析器

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460379

Wenwen Wang, Pei-Hung Lin

Data races are a primary source of concurrency bugs in parallel programs. Yet, debugging data races is not easy, even with a large amount of data race detection tools. In particular, there still exists a manually-intensive and time-consuming investigation process after data races are reported by existing race detection tools. To address this issue, we present OMPSanitizer in this paper. OMPSanitizer employs a novel and semantic-aware impact analysis mechanism to assess the potential impact of detected data races so that developers can focus on data races with a high probability to produce a harmful impact. This way, OMPSanitizer can remove the heavy debugging burden of data races from developers and simultaneously enhance the debugging efficiency. We have implemented OMPSanitizer based on the widely-used dynamic binary instrumentation infrastructure, Intel Pin. Our evaluation results on a broad range of OpenMP programs from the DataRaceBench benchmark suite and an ECP Proxy application demonstrate that OMPSanitizer can precisely report the impact of data races detected by existing race detectors, e.g., Helgrind and ThreadSanitizer. We believe OMPSanitizer will provide a new perspective on automating the debugging support for data races in OpenMP programs.

数据竞争是并行程序并发性错误的主要来源。然而，调试数据竞争并不容易，即使有大量的数据竞争检测工具。特别是，在现有的竞争检测工具报告数据竞争之后，仍然存在一个人工密集且耗时的调查过程。为了解决这个问题，我们在本文中提出了OMPSanitizer。OMPSanitizer采用一种新颖的、语义感知的影响分析机制来评估检测到的数据竞争的潜在影响，这样开发人员就可以专注于可能产生有害影响的数据竞争。通过这种方式，OMPSanitizer可以消除开发人员对数据竞争的沉重调试负担，同时提高调试效率。我们已经实现了基于广泛使用的动态二进制仪器基础设施的OMPSanitizer, Intel Pin。我们对来自DataRaceBench基准套件和ECP代理应用程序的广泛OpenMP程序的评估结果表明，OMPSanitizer可以精确地报告由现有竞争检测器(例如Helgrind和ThreadSanitizer)检测到的数据竞争的影响。我们相信，OMPSanitizer将为OpenMP程序中数据竞争的自动化调试支持提供一个新的视角。

{"title":"Does it matter?: OMPSanitizer: an impact analyzer of reported data races in OpenMP programs","authors":"Wenwen Wang, Pei-Hung Lin","doi":"10.1145/3447818.3460379","DOIUrl":"https://doi.org/10.1145/3447818.3460379","url":null,"abstract":"Data races are a primary source of concurrency bugs in parallel programs. Yet, debugging data races is not easy, even with a large amount of data race detection tools. In particular, there still exists a manually-intensive and time-consuming investigation process after data races are reported by existing race detection tools. To address this issue, we present OMPSanitizer in this paper. OMPSanitizer employs a novel and semantic-aware impact analysis mechanism to assess the potential impact of detected data races so that developers can focus on data races with a high probability to produce a harmful impact. This way, OMPSanitizer can remove the heavy debugging burden of data races from developers and simultaneously enhance the debugging efficiency. We have implemented OMPSanitizer based on the widely-used dynamic binary instrumentation infrastructure, Intel Pin. Our evaluation results on a broad range of OpenMP programs from the DataRaceBench benchmark suite and an ECP Proxy application demonstrate that OMPSanitizer can precisely report the impact of data races detected by existing race detectors, e.g., Helgrind and ThreadSanitizer. We believe OMPSanitizer will provide a new perspective on automating the debugging support for data races in OpenMP programs.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"150 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74736179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A practical tile size selection model for affine loop nests 一个实用的仿射环巢瓷砖尺寸选择模型

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3462213

Kumudha Narasimhan, Aravind Acharya, Abhinav Baid, Uday Bondhugula

Loop tiling for locality is an important transformation for general-purpose and domain-specific compilation as it allows programs to exploit the benefits of deep memory hierarchies. Most code generation tools with the infrastructure to perform automatic tiling of loop nests rely on auto-tuning to find good tile sizes. Tile size selection models proposed in the literature either fall back to modeling complex non-linear optimization problems or tackle a narrow class of inputs. Hence, a fast and generic tile size selection model is desirable for it to be adopted into compiler infrastructures like those of GCC, LLVM, or MLIR. In this paper, we propose a new, fast and lightweight tile size selection model that considers temporal and spatial reuse along dimensions of a loop nest. For an n-dimensional loop nest, we determine the tile sizes by calculating the zeros of a polynomial in a single variable of degree at most n. Our tile size calculation model also accounts for vectorizability of the innermost dimension. We demonstrate the generality of our approach by selecting benchmarks from various domains: linear algebra kernels, digital signal processing (DSP) and image processing. We implement our tile size selection model in PolyMage (a domain-specific language and compiler for image processing pipelines) and Pluto (state-of-the-art polyhedral auto-parallelizer). Implementing the model in PolyMage allows us to extend it to DSP and linear algebra domains and also incorporate idiom recognition phases so that optimized vendor-specific library implementations could be utilized whenever profitable. Our experiments demonstrate a significant geomean performance gain of 2.2x over Matlab on benchmarks from the DSP domain. For PolyBench, we obtain a geomean speedup of 1.04x (maximum speedup of 1.3x) over Pluto.

针对局部性的循环平铺是通用和特定领域编译的重要转换，因为它允许程序利用深度内存层次结构的好处。大多数代码生成工具都具有执行循环巢自动平铺的基础结构，它们依赖于自动调优来找到合适的平铺大小。文献中提出的瓷砖尺寸选择模型要么回归到复杂非线性优化问题的建模，要么解决一个狭窄的输入类别。因此，需要一个快速且通用的tile大小选择模型，以便将其采用到诸如GCC、LLVM或MLIR之类的编译器基础结构中。在本文中，我们提出了一种新的，快速和轻量级的瓷砖尺寸选择模型，该模型考虑了沿环形巢尺寸的时间和空间重用。对于n维的循环巢，我们通过计算最多n次的单个变量中的多项式的零点来确定瓦片大小。我们的瓦片大小计算模型还考虑了最内维的向量化。我们通过选择来自不同领域的基准来证明我们方法的通用性:线性代数核，数字信号处理(DSP)和图像处理。我们在PolyMage(用于图像处理管道的特定领域语言和编译器)和Pluto(最先进的多面体自动并行化器)中实现了我们的贴图大小选择模型。在PolyMage中实现模型允许我们将其扩展到DSP和线性代数领域，并且还包含习语识别阶段，以便优化特定于供应商的库实现可以在有利可图的时候使用。我们的实验表明，在DSP领域的基准测试中，与Matlab相比，几何性能显著提高2.2倍。对于PolyBench，我们在Pluto上获得了1.04倍的几何加速(最大加速为1.3倍)。

{"title":"A practical tile size selection model for affine loop nests","authors":"Kumudha Narasimhan, Aravind Acharya, Abhinav Baid, Uday Bondhugula","doi":"10.1145/3447818.3462213","DOIUrl":"https://doi.org/10.1145/3447818.3462213","url":null,"abstract":"Loop tiling for locality is an important transformation for general-purpose and domain-specific compilation as it allows programs to exploit the benefits of deep memory hierarchies. Most code generation tools with the infrastructure to perform automatic tiling of loop nests rely on auto-tuning to find good tile sizes. Tile size selection models proposed in the literature either fall back to modeling complex non-linear optimization problems or tackle a narrow class of inputs. Hence, a fast and generic tile size selection model is desirable for it to be adopted into compiler infrastructures like those of GCC, LLVM, or MLIR. In this paper, we propose a new, fast and lightweight tile size selection model that considers temporal and spatial reuse along dimensions of a loop nest. For an n-dimensional loop nest, we determine the tile sizes by calculating the zeros of a polynomial in a single variable of degree at most n. Our tile size calculation model also accounts for vectorizability of the innermost dimension. We demonstrate the generality of our approach by selecting benchmarks from various domains: linear algebra kernels, digital signal processing (DSP) and image processing. We implement our tile size selection model in PolyMage (a domain-specific language and compiler for image processing pipelines) and Pluto (state-of-the-art polyhedral auto-parallelizer). Implementing the model in PolyMage allows us to extend it to DSP and linear algebra domains and also incorporate idiom recognition phases so that optimized vendor-specific library implementations could be utilized whenever profitable. Our experiments demonstrate a significant geomean performance gain of 2.2x over Matlab on benchmarks from the DSP domain. For PolyBench, we obtain a geomean speedup of 1.04x (maximum speedup of 1.3x) over Pluto.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86228203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A performance portability framework for Python Python的性能可移植性框架

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460376

Nader Al Awar, Steven Zhu, G. Biros, Miloš Gligorić

Kokkos is a programming model for writing performance portable applications for all major high performance computing platforms. It provides abstractions for data management and common parallel operations, allowing developers to write portable high performance code with minimal knowledge of architecture-specific details. Kokkos is implemented as a heavily-templated C++ library. However, C++ is not ideal for rapid prototyping and quick algorithmic exploration. An increasing number of developers use Python for scientific computing, machine learning, and data analytics. In this paper, we present a new Python framework, dubbed PyKokkos, for writing performance portable applications entirely in Python. PyKokkos provides Kokkos-like abstractions that are easier to use and more concise than the C++ interface. We implemented PyKokkos by building a translator from a subset of Python to C++ Kokkos and bridging necessary function calls via automatically generated Python bindings. PyKokkos is also compatible with NumPy, a widely-used high performance Python library. By porting several existing Kokkos applications to PyKokkos, including ExaMiniMD (∼3k lines of code in C++), we show that the latter can achieve efficient execution with low performance overhead.

Kokkos是一个编程模型，用于为所有主要的高性能计算平台编写性能可移植的应用程序。它为数据管理和通用并行操作提供了抽象，允许开发人员编写可移植的高性能代码，而不需要了解特定于体系结构的细节。Kokkos是作为一个重度模板化的c++库实现的。然而，c++对于快速原型和快速算法探索来说并不理想。越来越多的开发人员使用Python进行科学计算、机器学习和数据分析。在本文中，我们提出了一个新的Python框架，称为PyKokkos，用于完全用Python编写性能可移植的应用程序。PyKokkos提供了类似kokkos的抽象，比c++接口更容易使用和更简洁。我们通过构建一个从Python子集到c++ Kokkos的转换器来实现PyKokkos，并通过自动生成的Python绑定桥接必要的函数调用。PyKokkos还兼容NumPy，一个广泛使用的高性能Python库。通过将几个现有的Kokkos应用程序移植到PyKokkos，包括ExaMiniMD(在c++中大约3k行代码)，我们展示了后者可以以低性能开销实现高效执行。

{"title":"A performance portability framework for Python","authors":"Nader Al Awar, Steven Zhu, G. Biros, Miloš Gligorić","doi":"10.1145/3447818.3460376","DOIUrl":"https://doi.org/10.1145/3447818.3460376","url":null,"abstract":"Kokkos is a programming model for writing performance portable applications for all major high performance computing platforms. It provides abstractions for data management and common parallel operations, allowing developers to write portable high performance code with minimal knowledge of architecture-specific details. Kokkos is implemented as a heavily-templated C++ library. However, C++ is not ideal for rapid prototyping and quick algorithmic exploration. An increasing number of developers use Python for scientific computing, machine learning, and data analytics. In this paper, we present a new Python framework, dubbed PyKokkos, for writing performance portable applications entirely in Python. PyKokkos provides Kokkos-like abstractions that are easier to use and more concise than the C++ interface. We implemented PyKokkos by building a translator from a subset of Python to C++ Kokkos and bridging necessary function calls via automatically generated Python bindings. PyKokkos is also compatible with NumPy, a widely-used high performance Python library. By porting several existing Kokkos applications to PyKokkos, including ExaMiniMD (∼3k lines of code in C++), we show that the latter can achieve efficient execution with low performance overhead.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83282241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Topology-aware optimizations for multi-GPU ptychographic image reconstruction 多gpu平面图像重建的拓扑感知优化

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460380

Xiaodong Yu, Tekin Bicer, R. Kettimuthu, Ian T Foster

Ptychography is an advanced high-resolution X-ray imaging technique that can generate extremely large datasets. Ptychographic reconstruction transforms reciprocal space experimental data to high-resolution 2D real-space images. GPUs have been used extensively to meet the computational requirements of the reconstruction. Generic multi-GPU reconstruction solutions use common communication topologies, such as P2P graph and ring, that are provided by MPI and NCCL libraries, to establish inter-GPU communications. However, these common topologies assume homogeneous physical links between GPUs, resulting in sub-optimal performance on heterogeneous configurations that are composed of both high- (e.g., NVLink) and low-speed (e.g., PCIe) interconnects. This mismatch between application-level communication topology and physical interconnection can cause data transfer congestion, inefficient memory access, and under-utilization of network resources. Here we present topology-aware designs and optimizations to address the aforementioned mismatch and boost end-to-end application performance. We introduce topology-aware data splitting, propose a novel communication topology, and incorporate asynchronous data movement and computation. We evaluate our design and optimizations using real and artificial datasets and compare its performance with that of the direct P2P and NCCL-based approaches. The results show that our optimizations always outperform the counterparts and achieve up to 5.13× and 1.63× communication and end-to-end application speedups, respectively.

印刷术是一种先进的高分辨率x射线成像技术，可以生成非常大的数据集。平面重建将互反空间实验数据转换为高分辨率的二维实空间图像。为了满足重建的计算需求，图形处理器被广泛使用。通用的多gpu重构方案使用MPI和NCCL库提供的P2P图、环等通用通信拓扑来建立gpu间通信。然而，这些常见的拓扑假设gpu之间的物理链路是均匀的，导致在由高速(例如NVLink)和低速(例如PCIe)互连组成的异构配置上的性能不是最优的。应用程序级通信拓扑与物理互连之间的这种不匹配可能导致数据传输拥塞、内存访问效率低下以及网络资源利用率不足。本文介绍了拓扑感知设计和优化，以解决上述不匹配问题并提高端到端应用程序性能。我们引入了拓扑感知的数据分割，提出了一种新的通信拓扑，并结合了异步数据移动和计算。我们使用真实和人工数据集评估我们的设计和优化，并将其性能与直接P2P和基于nccl的方法进行比较。结果表明，我们的优化总是优于同行，并分别实现高达5.13倍和1.63倍的通信和端到端应用程序加速。

{"title":"Topology-aware optimizations for multi-GPU ptychographic image reconstruction","authors":"Xiaodong Yu, Tekin Bicer, R. Kettimuthu, Ian T Foster","doi":"10.1145/3447818.3460380","DOIUrl":"https://doi.org/10.1145/3447818.3460380","url":null,"abstract":"Ptychography is an advanced high-resolution X-ray imaging technique that can generate extremely large datasets. Ptychographic reconstruction transforms reciprocal space experimental data to high-resolution 2D real-space images. GPUs have been used extensively to meet the computational requirements of the reconstruction. Generic multi-GPU reconstruction solutions use common communication topologies, such as P2P graph and ring, that are provided by MPI and NCCL libraries, to establish inter-GPU communications. However, these common topologies assume homogeneous physical links between GPUs, resulting in sub-optimal performance on heterogeneous configurations that are composed of both high- (e.g., NVLink) and low-speed (e.g., PCIe) interconnects. This mismatch between application-level communication topology and physical interconnection can cause data transfer congestion, inefficient memory access, and under-utilization of network resources. Here we present topology-aware designs and optimizations to address the aforementioned mismatch and boost end-to-end application performance. We introduce topology-aware data splitting, propose a novel communication topology, and incorporate asynchronous data movement and computation. We evaluate our design and optimizations using real and artificial datasets and compare its performance with that of the direct P2P and NCCL-based approaches. The results show that our optimizations always outperform the counterparts and achieve up to 5.13× and 1.63× communication and end-to-end application speedups, respectively.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"514 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80097226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Athena: high-performance sparse tensor contraction sequence on heterogeneous memory Athena:异构存储器上的高性能稀疏张量收缩序列

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460355

Jiawen Liu, Dong Li, R. Gioiosa, Jiajia Li

Sparse tensor contraction sequence has been widely employed in many fields, such as chemistry and physics. However, how to efficiently implement the sequence faces multiple challenges, such as redundant computations and memory operations, massive memory consumption, and inefficient utilization of hardware. To address the above challenges, we introduce Athena, a high-performance framework for SpTC sequences. Athena introduces new data structures, leverages emerging Optane-based heterogeneous memory (HM) architecture, and adopts stage parallelism. In particular, Athena introduces shared hash table-represented sparse accumulator to eliminate unnecessary input processing and data migration; Athena uses a novel data-semantic guided dynamic migration solution to make the best use of the Optane-based HM for high performance; Athena also co-runs execution phases with different characteristics to enable high hardware utilization. Evaluating with 12 datasets, we show that Athena brings 327-7362× speedup over the state-of-the-art SpTC algorithm. With the dynamic data placement guided by data semantics, Athena brings performance improvement on Optane-based HM over a state-of-the-art software-based data management solution, a hardware-based data management solution, and PMM-only by 1.58×, 1.82×, and 2.34× respectively. Athena also showcases its effectiveness in quantum chemistry and physics scenarios.

稀疏张量收缩序列在化学、物理等领域得到了广泛的应用。然而，如何有效地实现该序列面临着冗余计算和内存操作、大量内存消耗和硬件利用率低下等诸多挑战。为了解决上述问题，我们引入了一个高性能的SpTC序列框架Athena。Athena引入了新的数据结构，利用了新兴的基于optane的异构内存(HM)架构，并采用了阶段并行。特别是，Athena引入了共享哈希表表示的稀疏累加器，消除了不必要的输入处理和数据迁移;Athena使用了一种新颖的数据语义引导的动态迁移解决方案，以充分利用基于optane的HM实现高性能;Athena还共同运行具有不同特征的执行阶段，以实现高硬件利用率。通过对12个数据集的评估，我们发现Athena比最先进的SpTC算法带来了327- 7362x的加速。通过数据语义指导的动态数据放置，Athena在基于optane的HM上带来了性能改进，比最先进的基于软件的数据管理解决方案、基于硬件的数据管理解决方案和PMM-only分别提高了1.58倍、1.82倍和2.34倍。雅典娜还展示了其在量子化学和物理场景中的有效性。

{"title":"Athena: high-performance sparse tensor contraction sequence on heterogeneous memory","authors":"Jiawen Liu, Dong Li, R. Gioiosa, Jiajia Li","doi":"10.1145/3447818.3460355","DOIUrl":"https://doi.org/10.1145/3447818.3460355","url":null,"abstract":"Sparse tensor contraction sequence has been widely employed in many fields, such as chemistry and physics. However, how to efficiently implement the sequence faces multiple challenges, such as redundant computations and memory operations, massive memory consumption, and inefficient utilization of hardware. To address the above challenges, we introduce Athena, a high-performance framework for SpTC sequences. Athena introduces new data structures, leverages emerging Optane-based heterogeneous memory (HM) architecture, and adopts stage parallelism. In particular, Athena introduces shared hash table-represented sparse accumulator to eliminate unnecessary input processing and data migration; Athena uses a novel data-semantic guided dynamic migration solution to make the best use of the Optane-based HM for high performance; Athena also co-runs execution phases with different characteristics to enable high hardware utilization. Evaluating with 12 datasets, we show that Athena brings 327-7362× speedup over the state-of-the-art SpTC algorithm. With the dynamic data placement guided by data semantics, Athena brings performance improvement on Optane-based HM over a state-of-the-art software-based data management solution, a hardware-based data management solution, and PMM-only by 1.58×, 1.82×, and 2.34× respectively. Athena also showcases its effectiveness in quantum chemistry and physics scenarios.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86711076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

PLANAR 平面

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1007/978-3-642-41714-6_162253

Adrián Barredo, Adrià Armejach, J. Beard, Miquel Moretó

引用次数: 59

Omegaflow Omegaflow

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460367

Yaoyang Zhou, Zihao Yu, Chuanqi Zhang, Yinan Xu, Huizhe Wang, Sa Wang, Ninghui Sun, Yungang Bao

This paper investigates how to better track and deliver dependency in dependency-based cores to exploit instruction-level parallelism (ILP) as much as possible. To this end, we first propose an analytical performance model for the state-of-art dependency-based core, Forwardflow, and figure out two vital factors affecting its upper bound of performance. Then we propose Omegaflow,a dependency-based architecture adopting three new techniques, which respond to the discovered factors. Experimental results show that Omegaflow improves IPC by 24.6% compared to the state-of-the-art design, approaching the performance of the OoO architecture with an ideal scheduler (94.4%) without increasing the clock cycle and consumes only 8.82% more energy than Forwardflow.

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀