首页 > 最新文献

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing最新文献

英文 中文
ALTO
A. Helal, Jan Laukemann, Fabio Checconi, Jesmin Jahan Tithi, Teresa M. Ranadive, F. Petrini, Jeewhan Choi
The analysis of high-dimensional sparse data is becoming increasingly popular in many important domains. However, real-world sparse tensors are challenging to process due to their irregular shapes and data distributions. We propose the Adaptive Linearized Tensor Order (ALTO) format, a novel mode-agnostic (general) representation that keeps neighboring nonzero elements in the multi-dimensional space close to each other in memory. To generate the indexing metadata, ALTO uses an adaptive bit encoding scheme that trades off index computations for lower memory usage and more effective use of memory bandwidth. Moreover, by decoupling its sparse representation from the irregular spatial distribution of nonzero elements, ALTO eliminates the workload imbalance and greatly reduces the synchronization overhead of tensor computations. As a result, the parallel performance of ALTO-based tensor operations becomes a function of their inherent data reuse. On a gamut of tensor datasets, ALTO outperforms an oracle that selects the best state-of-the-art format for each dataset, when used in key tensor decomposition operations. Specifically, ALTO achieves a geometric mean speedup of 8x over the best mode-agnostic (coordinate and hierarchical coordinate) formats, while delivering a geometric mean compression ratio of 4.x relative to the best mode-specific (compressed sparse fiber) formats.
{"title":"ALTO","authors":"A. Helal, Jan Laukemann, Fabio Checconi, Jesmin Jahan Tithi, Teresa M. Ranadive, F. Petrini, Jeewhan Choi","doi":"10.1145/3447818.3461703","DOIUrl":"https://doi.org/10.1145/3447818.3461703","url":null,"abstract":"The analysis of high-dimensional sparse data is becoming increasingly popular in many important domains. However, real-world sparse tensors are challenging to process due to their irregular shapes and data distributions. We propose the Adaptive Linearized Tensor Order (ALTO) format, a novel mode-agnostic (general) representation that keeps neighboring nonzero elements in the multi-dimensional space close to each other in memory. To generate the indexing metadata, ALTO uses an adaptive bit encoding scheme that trades off index computations for lower memory usage and more effective use of memory bandwidth. Moreover, by decoupling its sparse representation from the irregular spatial distribution of nonzero elements, ALTO eliminates the workload imbalance and greatly reduces the synchronization overhead of tensor computations. As a result, the parallel performance of ALTO-based tensor operations becomes a function of their inherent data reuse. On a gamut of tensor datasets, ALTO outperforms an oracle that selects the best state-of-the-art format for each dataset, when used in key tensor decomposition operations. Specifically, ALTO achieves a geometric mean speedup of 8x over the best mode-agnostic (coordinate and hierarchical coordinate) formats, while delivering a geometric mean compression ratio of 4.x relative to the best mode-specific (compressed sparse fiber) formats.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":" 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91412390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
μSteal
Amirhossein Mirhosseini, T. Wenisch
Modern internet services are moving towards distributed microservice architectures, wherein a complex application is decomposed into numerous discrete microservices to improve programmability, reliability, manageability, and scalability. A key property of microservice-based architectures is that common microservices may be shared by multiple end-to-end cloud services. As an example, a speech-recognition microservice might serve as an early node in the microservice graphs of several end-to-end services. However, given the dissimilarities across microservice graphs and varying end-to-end latency constraints across services, shared microservices may need to operate under differing latency constraints for each service. As a result, in existing systems, most providers either deploy multiple instance pools for each latency constraint, or require all requests to needlessly meet the most stringent constraint. In this paper, we argue that sharing microservice instances across multiple services can reduce significantly the number of instances, especially under highly asymmetric latency constraints. We propose a request scheduling mechanism, called μSteal, which leverages preemptive work and resource stealing to schedule the arriving requests to cores within a ``mixed-criticality'' microservice instance. μSteal provisions ``core reservations'' for each request class based on their latency requirements, but allows a class to steal cores from other classes if they would otherwise remain idle. But, when a class requires its full reservation, μSteal preempts stolen cores, returning them to their reserved class. μSteal employs a runtime feedback controller augmented by a queuing theory-based analytical model to tune core reservations across classes, seeking to maximize the request throughput within each instance while meeting all classes' latency constraints. We show that μSteal reduces required instances for several shared microservice deployments by 1.29x as compared to deploying multiple, segregated instance pools.
{"title":"μSteal","authors":"Amirhossein Mirhosseini, T. Wenisch","doi":"10.1145/3447818.3463529","DOIUrl":"https://doi.org/10.1145/3447818.3463529","url":null,"abstract":"Modern internet services are moving towards distributed microservice architectures, wherein a complex application is decomposed into numerous discrete microservices to improve programmability, reliability, manageability, and scalability. A key property of microservice-based architectures is that common microservices may be shared by multiple end-to-end cloud services. As an example, a speech-recognition microservice might serve as an early node in the microservice graphs of several end-to-end services. However, given the dissimilarities across microservice graphs and varying end-to-end latency constraints across services, shared microservices may need to operate under differing latency constraints for each service. As a result, in existing systems, most providers either deploy multiple instance pools for each latency constraint, or require all requests to needlessly meet the most stringent constraint. In this paper, we argue that sharing microservice instances across multiple services can reduce significantly the number of instances, especially under highly asymmetric latency constraints. We propose a request scheduling mechanism, called μSteal, which leverages preemptive work and resource stealing to schedule the arriving requests to cores within a ``mixed-criticality'' microservice instance. μSteal provisions ``core reservations'' for each request class based on their latency requirements, but allows a class to steal cores from other classes if they would otherwise remain idle. But, when a class requires its full reservation, μSteal preempts stolen cores, returning them to their reserved class. μSteal employs a runtime feedback controller augmented by a queuing theory-based analytical model to tune core reservations across classes, seeking to maximize the request throughput within each instance while meeting all classes' latency constraints. We show that μSteal reduces required instances for several shared microservice deployments by 1.29x as compared to deploying multiple, segregated instance pools.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"136 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77940233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
SumMerge: an efficient algorithm and implementation for weight repetition-aware DNN inference SumMerge:权重重复感知深度神经网络推理的有效算法和实现
Rohan Baskar Prabhakar, Sachit Kuhar, R. Agrawal, C. Hughes, Christopher W. Fletcher
Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit emph{weight repetition}. The key observation is that due to DNN quantization schemes---which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight---the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs emph{with weight-dependent structure}. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09x-2.05x and 1.04x-1.51x relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7x to 15.4x.
深度神经网络(DNN)的推理效率是目前依赖深度学习的众多领域的一个关键问题。最近一个有希望的加速推理的方向是利用emph{权重重复}。关键的观察结果是,由于DNN量化方案——试图通过减少表示每个权重所需的比特数来减少DNN存储需求——相同的权重必然会在过滤器内部和跨过滤器重复多次。这使得权重重复感知推理内核能够分解和记忆常见的子计算,减少每个推理的算术,同时仍然保持量化的压缩优势。然而,重大挑战依然存在。例如,权重重复在推理操作中引入了明显的不规则性,因此(到目前为止)需要定制硬件加速器来获得净收益。本文提出了SumMerge:一种新的算法和一套实现技术,使权重重复在cpu等通用设备上实现。关键思想是将推理表述为遍历emph{具有权重相关结构}的数据流图序列。我们开发了一种离线启发式方法来选择一个数据流图结构,该结构可以最大限度地减少每个推理(给定训练过的权重值)的算术运算,并使用一个有效的在线过程来遍历每个数据流图并计算给定DNN输入的推理结果。我们将上述实现为一个优化的c++例程,该例程运行在带有矢量扩展的商用多核处理器上,并相对于英特尔优化的库oneDNN和现有技术的权重重复算法(AGR)评估性能。当应用于六种不同的量化方案时,SumMerge相对于oneDNN和AGR分别实现了1.09 -2.05倍和1.04 -1.51倍的加速,同时将DNN模型压缩了8.7倍至15.4倍。
{"title":"SumMerge: an efficient algorithm and implementation for weight repetition-aware DNN inference","authors":"Rohan Baskar Prabhakar, Sachit Kuhar, R. Agrawal, C. Hughes, Christopher W. Fletcher","doi":"10.1145/3447818.3460375","DOIUrl":"https://doi.org/10.1145/3447818.3460375","url":null,"abstract":"Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit emph{weight repetition}. The key observation is that due to DNN quantization schemes---which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight---the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs emph{with weight-dependent structure}. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09x-2.05x and 1.04x-1.51x relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7x to 15.4x.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91064216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Does it matter?: OMPSanitizer: an impact analyzer of reported data races in OpenMP programs 这有关系吗?OMPSanitizer: OpenMP程序中报告的数据竞争的影响分析器
Wenwen Wang, Pei-Hung Lin
Data races are a primary source of concurrency bugs in parallel programs. Yet, debugging data races is not easy, even with a large amount of data race detection tools. In particular, there still exists a manually-intensive and time-consuming investigation process after data races are reported by existing race detection tools. To address this issue, we present OMPSanitizer in this paper. OMPSanitizer employs a novel and semantic-aware impact analysis mechanism to assess the potential impact of detected data races so that developers can focus on data races with a high probability to produce a harmful impact. This way, OMPSanitizer can remove the heavy debugging burden of data races from developers and simultaneously enhance the debugging efficiency. We have implemented OMPSanitizer based on the widely-used dynamic binary instrumentation infrastructure, Intel Pin. Our evaluation results on a broad range of OpenMP programs from the DataRaceBench benchmark suite and an ECP Proxy application demonstrate that OMPSanitizer can precisely report the impact of data races detected by existing race detectors, e.g., Helgrind and ThreadSanitizer. We believe OMPSanitizer will provide a new perspective on automating the debugging support for data races in OpenMP programs.
数据竞争是并行程序并发性错误的主要来源。然而,调试数据竞争并不容易,即使有大量的数据竞争检测工具。特别是,在现有的竞争检测工具报告数据竞争之后,仍然存在一个人工密集且耗时的调查过程。为了解决这个问题,我们在本文中提出了OMPSanitizer。OMPSanitizer采用一种新颖的、语义感知的影响分析机制来评估检测到的数据竞争的潜在影响,这样开发人员就可以专注于可能产生有害影响的数据竞争。通过这种方式,OMPSanitizer可以消除开发人员对数据竞争的沉重调试负担,同时提高调试效率。我们已经实现了基于广泛使用的动态二进制仪器基础设施的OMPSanitizer, Intel Pin。我们对来自DataRaceBench基准套件和ECP代理应用程序的广泛OpenMP程序的评估结果表明,OMPSanitizer可以精确地报告由现有竞争检测器(例如Helgrind和ThreadSanitizer)检测到的数据竞争的影响。我们相信,OMPSanitizer将为OpenMP程序中数据竞争的自动化调试支持提供一个新的视角。
{"title":"Does it matter?: OMPSanitizer: an impact analyzer of reported data races in OpenMP programs","authors":"Wenwen Wang, Pei-Hung Lin","doi":"10.1145/3447818.3460379","DOIUrl":"https://doi.org/10.1145/3447818.3460379","url":null,"abstract":"Data races are a primary source of concurrency bugs in parallel programs. Yet, debugging data races is not easy, even with a large amount of data race detection tools. In particular, there still exists a manually-intensive and time-consuming investigation process after data races are reported by existing race detection tools. To address this issue, we present OMPSanitizer in this paper. OMPSanitizer employs a novel and semantic-aware impact analysis mechanism to assess the potential impact of detected data races so that developers can focus on data races with a high probability to produce a harmful impact. This way, OMPSanitizer can remove the heavy debugging burden of data races from developers and simultaneously enhance the debugging efficiency. We have implemented OMPSanitizer based on the widely-used dynamic binary instrumentation infrastructure, Intel Pin. Our evaluation results on a broad range of OpenMP programs from the DataRaceBench benchmark suite and an ECP Proxy application demonstrate that OMPSanitizer can precisely report the impact of data races detected by existing race detectors, e.g., Helgrind and ThreadSanitizer. We believe OMPSanitizer will provide a new perspective on automating the debugging support for data races in OpenMP programs.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"150 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74736179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A practical tile size selection model for affine loop nests 一个实用的仿射环巢瓷砖尺寸选择模型
Kumudha Narasimhan, Aravind Acharya, Abhinav Baid, Uday Bondhugula
Loop tiling for locality is an important transformation for general-purpose and domain-specific compilation as it allows programs to exploit the benefits of deep memory hierarchies. Most code generation tools with the infrastructure to perform automatic tiling of loop nests rely on auto-tuning to find good tile sizes. Tile size selection models proposed in the literature either fall back to modeling complex non-linear optimization problems or tackle a narrow class of inputs. Hence, a fast and generic tile size selection model is desirable for it to be adopted into compiler infrastructures like those of GCC, LLVM, or MLIR. In this paper, we propose a new, fast and lightweight tile size selection model that considers temporal and spatial reuse along dimensions of a loop nest. For an n-dimensional loop nest, we determine the tile sizes by calculating the zeros of a polynomial in a single variable of degree at most n. Our tile size calculation model also accounts for vectorizability of the innermost dimension. We demonstrate the generality of our approach by selecting benchmarks from various domains: linear algebra kernels, digital signal processing (DSP) and image processing. We implement our tile size selection model in PolyMage (a domain-specific language and compiler for image processing pipelines) and Pluto (state-of-the-art polyhedral auto-parallelizer). Implementing the model in PolyMage allows us to extend it to DSP and linear algebra domains and also incorporate idiom recognition phases so that optimized vendor-specific library implementations could be utilized whenever profitable. Our experiments demonstrate a significant geomean performance gain of 2.2x over Matlab on benchmarks from the DSP domain. For PolyBench, we obtain a geomean speedup of 1.04x (maximum speedup of 1.3x) over Pluto.
针对局部性的循环平铺是通用和特定领域编译的重要转换,因为它允许程序利用深度内存层次结构的好处。大多数代码生成工具都具有执行循环巢自动平铺的基础结构,它们依赖于自动调优来找到合适的平铺大小。文献中提出的瓷砖尺寸选择模型要么回归到复杂非线性优化问题的建模,要么解决一个狭窄的输入类别。因此,需要一个快速且通用的tile大小选择模型,以便将其采用到诸如GCC、LLVM或MLIR之类的编译器基础结构中。在本文中,我们提出了一种新的,快速和轻量级的瓷砖尺寸选择模型,该模型考虑了沿环形巢尺寸的时间和空间重用。对于n维的循环巢,我们通过计算最多n次的单个变量中的多项式的零点来确定瓦片大小。我们的瓦片大小计算模型还考虑了最内维的向量化。我们通过选择来自不同领域的基准来证明我们方法的通用性:线性代数核,数字信号处理(DSP)和图像处理。我们在PolyMage(用于图像处理管道的特定领域语言和编译器)和Pluto(最先进的多面体自动并行化器)中实现了我们的贴图大小选择模型。在PolyMage中实现模型允许我们将其扩展到DSP和线性代数领域,并且还包含习语识别阶段,以便优化特定于供应商的库实现可以在有利可图的时候使用。我们的实验表明,在DSP领域的基准测试中,与Matlab相比,几何性能显著提高2.2倍。对于PolyBench,我们在Pluto上获得了1.04倍的几何加速(最大加速为1.3倍)。
{"title":"A practical tile size selection model for affine loop nests","authors":"Kumudha Narasimhan, Aravind Acharya, Abhinav Baid, Uday Bondhugula","doi":"10.1145/3447818.3462213","DOIUrl":"https://doi.org/10.1145/3447818.3462213","url":null,"abstract":"Loop tiling for locality is an important transformation for general-purpose and domain-specific compilation as it allows programs to exploit the benefits of deep memory hierarchies. Most code generation tools with the infrastructure to perform automatic tiling of loop nests rely on auto-tuning to find good tile sizes. Tile size selection models proposed in the literature either fall back to modeling complex non-linear optimization problems or tackle a narrow class of inputs. Hence, a fast and generic tile size selection model is desirable for it to be adopted into compiler infrastructures like those of GCC, LLVM, or MLIR. In this paper, we propose a new, fast and lightweight tile size selection model that considers temporal and spatial reuse along dimensions of a loop nest. For an n-dimensional loop nest, we determine the tile sizes by calculating the zeros of a polynomial in a single variable of degree at most n. Our tile size calculation model also accounts for vectorizability of the innermost dimension. We demonstrate the generality of our approach by selecting benchmarks from various domains: linear algebra kernels, digital signal processing (DSP) and image processing. We implement our tile size selection model in PolyMage (a domain-specific language and compiler for image processing pipelines) and Pluto (state-of-the-art polyhedral auto-parallelizer). Implementing the model in PolyMage allows us to extend it to DSP and linear algebra domains and also incorporate idiom recognition phases so that optimized vendor-specific library implementations could be utilized whenever profitable. Our experiments demonstrate a significant geomean performance gain of 2.2x over Matlab on benchmarks from the DSP domain. For PolyBench, we obtain a geomean speedup of 1.04x (maximum speedup of 1.3x) over Pluto.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86228203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A performance portability framework for Python Python的性能可移植性框架
Nader Al Awar, Steven Zhu, G. Biros, Miloš Gligorić
Kokkos is a programming model for writing performance portable applications for all major high performance computing platforms. It provides abstractions for data management and common parallel operations, allowing developers to write portable high performance code with minimal knowledge of architecture-specific details. Kokkos is implemented as a heavily-templated C++ library. However, C++ is not ideal for rapid prototyping and quick algorithmic exploration. An increasing number of developers use Python for scientific computing, machine learning, and data analytics. In this paper, we present a new Python framework, dubbed PyKokkos, for writing performance portable applications entirely in Python. PyKokkos provides Kokkos-like abstractions that are easier to use and more concise than the C++ interface. We implemented PyKokkos by building a translator from a subset of Python to C++ Kokkos and bridging necessary function calls via automatically generated Python bindings. PyKokkos is also compatible with NumPy, a widely-used high performance Python library. By porting several existing Kokkos applications to PyKokkos, including ExaMiniMD (∼3k lines of code in C++), we show that the latter can achieve efficient execution with low performance overhead.
Kokkos是一个编程模型,用于为所有主要的高性能计算平台编写性能可移植的应用程序。它为数据管理和通用并行操作提供了抽象,允许开发人员编写可移植的高性能代码,而不需要了解特定于体系结构的细节。Kokkos是作为一个重度模板化的c++库实现的。然而,c++对于快速原型和快速算法探索来说并不理想。越来越多的开发人员使用Python进行科学计算、机器学习和数据分析。在本文中,我们提出了一个新的Python框架,称为PyKokkos,用于完全用Python编写性能可移植的应用程序。PyKokkos提供了类似kokkos的抽象,比c++接口更容易使用和更简洁。我们通过构建一个从Python子集到c++ Kokkos的转换器来实现PyKokkos,并通过自动生成的Python绑定桥接必要的函数调用。PyKokkos还兼容NumPy,一个广泛使用的高性能Python库。通过将几个现有的Kokkos应用程序移植到PyKokkos,包括ExaMiniMD(在c++中大约3k行代码),我们展示了后者可以以低性能开销实现高效执行。
{"title":"A performance portability framework for Python","authors":"Nader Al Awar, Steven Zhu, G. Biros, Miloš Gligorić","doi":"10.1145/3447818.3460376","DOIUrl":"https://doi.org/10.1145/3447818.3460376","url":null,"abstract":"Kokkos is a programming model for writing performance portable applications for all major high performance computing platforms. It provides abstractions for data management and common parallel operations, allowing developers to write portable high performance code with minimal knowledge of architecture-specific details. Kokkos is implemented as a heavily-templated C++ library. However, C++ is not ideal for rapid prototyping and quick algorithmic exploration. An increasing number of developers use Python for scientific computing, machine learning, and data analytics. In this paper, we present a new Python framework, dubbed PyKokkos, for writing performance portable applications entirely in Python. PyKokkos provides Kokkos-like abstractions that are easier to use and more concise than the C++ interface. We implemented PyKokkos by building a translator from a subset of Python to C++ Kokkos and bridging necessary function calls via automatically generated Python bindings. PyKokkos is also compatible with NumPy, a widely-used high performance Python library. By porting several existing Kokkos applications to PyKokkos, including ExaMiniMD (∼3k lines of code in C++), we show that the latter can achieve efficient execution with low performance overhead.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83282241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Topology-aware optimizations for multi-GPU ptychographic image reconstruction 多gpu平面图像重建的拓扑感知优化
Xiaodong Yu, Tekin Bicer, R. Kettimuthu, Ian T Foster
Ptychography is an advanced high-resolution X-ray imaging technique that can generate extremely large datasets. Ptychographic reconstruction transforms reciprocal space experimental data to high-resolution 2D real-space images. GPUs have been used extensively to meet the computational requirements of the reconstruction. Generic multi-GPU reconstruction solutions use common communication topologies, such as P2P graph and ring, that are provided by MPI and NCCL libraries, to establish inter-GPU communications. However, these common topologies assume homogeneous physical links between GPUs, resulting in sub-optimal performance on heterogeneous configurations that are composed of both high- (e.g., NVLink) and low-speed (e.g., PCIe) interconnects. This mismatch between application-level communication topology and physical interconnection can cause data transfer congestion, inefficient memory access, and under-utilization of network resources. Here we present topology-aware designs and optimizations to address the aforementioned mismatch and boost end-to-end application performance. We introduce topology-aware data splitting, propose a novel communication topology, and incorporate asynchronous data movement and computation. We evaluate our design and optimizations using real and artificial datasets and compare its performance with that of the direct P2P and NCCL-based approaches. The results show that our optimizations always outperform the counterparts and achieve up to 5.13× and 1.63× communication and end-to-end application speedups, respectively.
印刷术是一种先进的高分辨率x射线成像技术,可以生成非常大的数据集。平面重建将互反空间实验数据转换为高分辨率的二维实空间图像。为了满足重建的计算需求,图形处理器被广泛使用。通用的多gpu重构方案使用MPI和NCCL库提供的P2P图、环等通用通信拓扑来建立gpu间通信。然而,这些常见的拓扑假设gpu之间的物理链路是均匀的,导致在由高速(例如NVLink)和低速(例如PCIe)互连组成的异构配置上的性能不是最优的。应用程序级通信拓扑与物理互连之间的这种不匹配可能导致数据传输拥塞、内存访问效率低下以及网络资源利用率不足。本文介绍了拓扑感知设计和优化,以解决上述不匹配问题并提高端到端应用程序性能。我们引入了拓扑感知的数据分割,提出了一种新的通信拓扑,并结合了异步数据移动和计算。我们使用真实和人工数据集评估我们的设计和优化,并将其性能与直接P2P和基于nccl的方法进行比较。结果表明,我们的优化总是优于同行,并分别实现高达5.13倍和1.63倍的通信和端到端应用程序加速。
{"title":"Topology-aware optimizations for multi-GPU ptychographic image reconstruction","authors":"Xiaodong Yu, Tekin Bicer, R. Kettimuthu, Ian T Foster","doi":"10.1145/3447818.3460380","DOIUrl":"https://doi.org/10.1145/3447818.3460380","url":null,"abstract":"Ptychography is an advanced high-resolution X-ray imaging technique that can generate extremely large datasets. Ptychographic reconstruction transforms reciprocal space experimental data to high-resolution 2D real-space images. GPUs have been used extensively to meet the computational requirements of the reconstruction. Generic multi-GPU reconstruction solutions use common communication topologies, such as P2P graph and ring, that are provided by MPI and NCCL libraries, to establish inter-GPU communications. However, these common topologies assume homogeneous physical links between GPUs, resulting in sub-optimal performance on heterogeneous configurations that are composed of both high- (e.g., NVLink) and low-speed (e.g., PCIe) interconnects. This mismatch between application-level communication topology and physical interconnection can cause data transfer congestion, inefficient memory access, and under-utilization of network resources. Here we present topology-aware designs and optimizations to address the aforementioned mismatch and boost end-to-end application performance. We introduce topology-aware data splitting, propose a novel communication topology, and incorporate asynchronous data movement and computation. We evaluate our design and optimizations using real and artificial datasets and compare its performance with that of the direct P2P and NCCL-based approaches. The results show that our optimizations always outperform the counterparts and achieve up to 5.13× and 1.63× communication and end-to-end application speedups, respectively.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"514 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80097226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Athena: high-performance sparse tensor contraction sequence on heterogeneous memory Athena:异构存储器上的高性能稀疏张量收缩序列
Jiawen Liu, Dong Li, R. Gioiosa, Jiajia Li
Sparse tensor contraction sequence has been widely employed in many fields, such as chemistry and physics. However, how to efficiently implement the sequence faces multiple challenges, such as redundant computations and memory operations, massive memory consumption, and inefficient utilization of hardware. To address the above challenges, we introduce Athena, a high-performance framework for SpTC sequences. Athena introduces new data structures, leverages emerging Optane-based heterogeneous memory (HM) architecture, and adopts stage parallelism. In particular, Athena introduces shared hash table-represented sparse accumulator to eliminate unnecessary input processing and data migration; Athena uses a novel data-semantic guided dynamic migration solution to make the best use of the Optane-based HM for high performance; Athena also co-runs execution phases with different characteristics to enable high hardware utilization. Evaluating with 12 datasets, we show that Athena brings 327-7362× speedup over the state-of-the-art SpTC algorithm. With the dynamic data placement guided by data semantics, Athena brings performance improvement on Optane-based HM over a state-of-the-art software-based data management solution, a hardware-based data management solution, and PMM-only by 1.58×, 1.82×, and 2.34× respectively. Athena also showcases its effectiveness in quantum chemistry and physics scenarios.
稀疏张量收缩序列在化学、物理等领域得到了广泛的应用。然而,如何有效地实现该序列面临着冗余计算和内存操作、大量内存消耗和硬件利用率低下等诸多挑战。为了解决上述问题,我们引入了一个高性能的SpTC序列框架Athena。Athena引入了新的数据结构,利用了新兴的基于optane的异构内存(HM)架构,并采用了阶段并行。特别是,Athena引入了共享哈希表表示的稀疏累加器,消除了不必要的输入处理和数据迁移;Athena使用了一种新颖的数据语义引导的动态迁移解决方案,以充分利用基于optane的HM实现高性能;Athena还共同运行具有不同特征的执行阶段,以实现高硬件利用率。通过对12个数据集的评估,我们发现Athena比最先进的SpTC算法带来了327- 7362x的加速。通过数据语义指导的动态数据放置,Athena在基于optane的HM上带来了性能改进,比最先进的基于软件的数据管理解决方案、基于硬件的数据管理解决方案和PMM-only分别提高了1.58倍、1.82倍和2.34倍。雅典娜还展示了其在量子化学和物理场景中的有效性。
{"title":"Athena: high-performance sparse tensor contraction sequence on heterogeneous memory","authors":"Jiawen Liu, Dong Li, R. Gioiosa, Jiajia Li","doi":"10.1145/3447818.3460355","DOIUrl":"https://doi.org/10.1145/3447818.3460355","url":null,"abstract":"Sparse tensor contraction sequence has been widely employed in many fields, such as chemistry and physics. However, how to efficiently implement the sequence faces multiple challenges, such as redundant computations and memory operations, massive memory consumption, and inefficient utilization of hardware. To address the above challenges, we introduce Athena, a high-performance framework for SpTC sequences. Athena introduces new data structures, leverages emerging Optane-based heterogeneous memory (HM) architecture, and adopts stage parallelism. In particular, Athena introduces shared hash table-represented sparse accumulator to eliminate unnecessary input processing and data migration; Athena uses a novel data-semantic guided dynamic migration solution to make the best use of the Optane-based HM for high performance; Athena also co-runs execution phases with different characteristics to enable high hardware utilization. Evaluating with 12 datasets, we show that Athena brings 327-7362× speedup over the state-of-the-art SpTC algorithm. With the dynamic data placement guided by data semantics, Athena brings performance improvement on Optane-based HM over a state-of-the-art software-based data management solution, a hardware-based data management solution, and PMM-only by 1.58×, 1.82×, and 2.34× respectively. Athena also showcases its effectiveness in quantum chemistry and physics scenarios.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86711076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
PLANAR 平面
Adrián Barredo, Adrià Armejach, J. Beard, Miquel Moretó
{"title":"PLANAR","authors":"Adrián Barredo, Adrià Armejach, J. Beard, Miquel Moretó","doi":"10.1007/978-3-642-41714-6_162253","DOIUrl":"https://doi.org/10.1007/978-3-642-41714-6_162253","url":null,"abstract":"","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83272279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Omegaflow Omegaflow
Yaoyang Zhou, Zihao Yu, Chuanqi Zhang, Yinan Xu, Huizhe Wang, Sa Wang, Ninghui Sun, Yungang Bao
This paper investigates how to better track and deliver dependency in dependency-based cores to exploit instruction-level parallelism (ILP) as much as possible. To this end, we first propose an analytical performance model for the state-of-art dependency-based core, Forwardflow, and figure out two vital factors affecting its upper bound of performance. Then we propose Omegaflow,a dependency-based architecture adopting three new techniques, which respond to the discovered factors. Experimental results show that Omegaflow improves IPC by 24.6% compared to the state-of-the-art design, approaching the performance of the OoO architecture with an ideal scheduler (94.4%) without increasing the clock cycle and consumes only 8.82% more energy than Forwardflow.
{"title":"Omegaflow","authors":"Yaoyang Zhou, Zihao Yu, Chuanqi Zhang, Yinan Xu, Huizhe Wang, Sa Wang, Ninghui Sun, Yungang Bao","doi":"10.1145/3447818.3460367","DOIUrl":"https://doi.org/10.1145/3447818.3460367","url":null,"abstract":"This paper investigates how to better track and deliver dependency in dependency-based cores to exploit instruction-level parallelism (ILP) as much as possible. To this end, we first propose an analytical performance model for the state-of-art dependency-based core, Forwardflow, and figure out two vital factors affecting its upper bound of performance. Then we propose Omegaflow,a dependency-based architecture adopting three new techniques, which respond to the discovered factors. Experimental results show that Omegaflow improves IPC by 24.6% compared to the state-of-the-art design, approaching the performance of the OoO architecture with an ideal scheduler (94.4%) without increasing the clock cycle and consumes only 8.82% more energy than Forwardflow.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"117 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81779140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1