ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing最新文献_第7页

Power and energy efficient routing for Mach-Zehnder interferometer based photonic switches 基于Mach-Zehnder干涉仪的光子开关的功率和能量高效路由

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460363

Markos Kynigos, J. A. Pascual, J. Navaridas, J. Goodacre, M. Luján

Silicon Photonic top-of-rack (ToR) switches are highly desirable for the datacenter (DC) and high-performance computing (HPC) domains for their potential high-bandwidth and energy efficiency. Recently, photonic Beneš switching fabrics based on Mach-Zehnder Interferometers (MZIs) have been proposed as a promising candidate for the internals of high-performance switches. However, state-of-the-art routing algorithms that control these switching fabrics are either computationally complex or unable to provide non-blocking, energy efficient routing permutations.To address this, we propose for the first time a combination of energy efficient routing algorithms and time-division multiplexing (TDM). We evaluate this approach by conducting a simulation-based performance evaluation of a 16x16 Beneš fabric, deployed as a ToR switch, when handling a set of 8 representative workloads from the DC and HPC domains. Our results show that state-of-the-art approaches (circuit switched energy efficient routing algorithms) introduce up to 23% contention in the switching fabric for some workloads, thereby increasing communication time. We show that augmenting the algorithms with TDM can ameliorate switch fabric contention by segmenting communication data and gracefully interleaving the segments, thus reducing communication time by up to 20% in the best case. We also discuss the impact of the TDM segment size, finding that although a 10KB segment size is the most beneficial in reducing communication time, a 100KB segment size offers similar performance while requiring a less stringent path-computation time window. Finally, we assess the impact of TDM on path-dependent insertion loss and switching energy consumption, finding it to be minimal in all cases.

硅光子架顶式(ToR)交换机因其潜在的高带宽和高能效，在数据中心(DC)和高性能计算(HPC)领域非常受欢迎。近年来，基于Mach-Zehnder干涉仪(MZIs)的光子benei开关织物被认为是高性能开关内部的一个有前途的候选材料。然而，控制这些交换结构的最先进的路由算法要么计算复杂，要么无法提供无阻塞、节能的路由排列。为了解决这个问题，我们首次提出了节能路由算法和时分多路复用(TDM)的组合。在处理来自DC和HPC域的8个代表性工作负载时，我们通过对作为ToR交换机部署的16x16 Beneš fabric进行基于模拟的性能评估来评估这种方法。我们的研究结果表明，最先进的方法(电路交换节能路由算法)在某些工作负载的交换结构中引入了高达23%的争用，从而增加了通信时间。我们表明，通过对通信数据进行分段和优雅地交错，在TDM中增强算法可以改善交换结构争用，从而在最佳情况下将通信时间减少多达20%。我们还讨论了TDM段大小的影响，发现尽管10KB的段大小对减少通信时间最有利，但100KB的段大小提供了类似的性能，同时需要不那么严格的路径计算时间窗口。最后，我们评估时分复用对路径相关的插入损耗和开关能耗的影响，发现它在所有情况下都是最小的。

{"title":"Power and energy efficient routing for Mach-Zehnder interferometer based photonic switches","authors":"Markos Kynigos, J. A. Pascual, J. Navaridas, J. Goodacre, M. Luján","doi":"10.1145/3447818.3460363","DOIUrl":"https://doi.org/10.1145/3447818.3460363","url":null,"abstract":"Silicon Photonic top-of-rack (ToR) switches are highly desirable for the datacenter (DC) and high-performance computing (HPC) domains for their potential high-bandwidth and energy efficiency. Recently, photonic Beneš switching fabrics based on Mach-Zehnder Interferometers (MZIs) have been proposed as a promising candidate for the internals of high-performance switches. However, state-of-the-art routing algorithms that control these switching fabrics are either computationally complex or unable to provide non-blocking, energy efficient routing permutations.To address this, we propose for the first time a combination of energy efficient routing algorithms and time-division multiplexing (TDM). We evaluate this approach by conducting a simulation-based performance evaluation of a 16x16 Beneš fabric, deployed as a ToR switch, when handling a set of 8 representative workloads from the DC and HPC domains. Our results show that state-of-the-art approaches (circuit switched energy efficient routing algorithms) introduce up to 23% contention in the switching fabric for some workloads, thereby increasing communication time. We show that augmenting the algorithms with TDM can ameliorate switch fabric contention by segmenting communication data and gracefully interleaving the segments, thus reducing communication time by up to 20% in the best case. We also discuss the impact of the TDM segment size, finding that although a 10KB segment size is the most beneficial in reducing communication time, a 100KB segment size offers similar performance while requiring a less stringent path-computation time window. Finally, we assess the impact of TDM on path-dependent insertion loss and switching energy consumption, finding it to be minimal in all cases.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"82 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72871757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Optimizing large-scale plasma simulations on persistent memory-based heterogeneous memory with effective data placement across memory hierarchy 在基于持久内存的异构内存上优化大规模等离子体模拟，实现跨内存层次的有效数据放置

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460356

Jie Ren, Jiaolin Luo, I. Peng, Kai Wu, Dong Li

Particle simulations of plasma are important for understanding plasma dynamics in space weather and fusion devices. However, production simulations that use billions and even trillions of computational particles require high memory capacity. In this work, we explore the latest persistent memory (PM) hardware to enable large-scale plasma simulations at unprecedented scales on a single machine. We use WarpX, an advanced plasma simulation code which is mission-critical and targets future exascale systems. We analyze the performance of WarpX on PM-based heterogeneous memory systems and propose to make the best use of memory hierarchy to avoid the impact of inferior performance of PM. We introduce a combination of static and dynamic data placement, and processor-cache prefetch mechanism for performance optimization. We develop a performance model to enable efficient data migration between PM and DRAM in the background, without reducing available bandwidth and parallelism to the application threads. We also build an analytical model to decide when to prefetch for the best use of caches. Our design achieves 66.4% performance improvement over the PM-only baseline and outperforms DRAM-cached, NUMA first-touch, and a state-of-the-art software solution by 38.8%, 45.1% and 83.3%, respectively.

等离子体的粒子模拟对于理解空间天气和聚变装置中的等离子体动力学具有重要意义。然而，使用数十亿甚至数万亿计算粒子的生产模拟需要高内存容量。在这项工作中，我们探索了最新的持久内存(PM)硬件，以便在一台机器上实现前所未有的大规模等离子体模拟。我们使用WarpX，这是一种先进的等离子体模拟代码，它是关键任务，目标是未来的百亿亿级系统。我们分析了WarpX在基于PM的异构存储系统上的性能，并提出了最好地利用内存层次结构来避免PM性能低下的影响。我们引入了静态和动态数据放置的组合，以及用于性能优化的处理器缓存预取机制。我们开发了一个性能模型，在后台实现PM和DRAM之间的有效数据迁移，而不会减少可用带宽和应用程序线程的并行性。我们还建立了一个分析模型来决定何时预取以最佳地利用缓存。我们的设计在纯pm基准上实现了66.4%的性能改进，并且分别比dram缓存、NUMA first-touch和最先进的软件解决方案高出38.8%、45.1%和83.3%。

{"title":"Optimizing large-scale plasma simulations on persistent memory-based heterogeneous memory with effective data placement across memory hierarchy","authors":"Jie Ren, Jiaolin Luo, I. Peng, Kai Wu, Dong Li","doi":"10.1145/3447818.3460356","DOIUrl":"https://doi.org/10.1145/3447818.3460356","url":null,"abstract":"Particle simulations of plasma are important for understanding plasma dynamics in space weather and fusion devices. However, production simulations that use billions and even trillions of computational particles require high memory capacity. In this work, we explore the latest persistent memory (PM) hardware to enable large-scale plasma simulations at unprecedented scales on a single machine. We use WarpX, an advanced plasma simulation code which is mission-critical and targets future exascale systems. We analyze the performance of WarpX on PM-based heterogeneous memory systems and propose to make the best use of memory hierarchy to avoid the impact of inferior performance of PM. We introduce a combination of static and dynamic data placement, and processor-cache prefetch mechanism for performance optimization. We develop a performance model to enable efficient data migration between PM and DRAM in the background, without reducing available bandwidth and parallelism to the application threads. We also build an analytical model to decide when to prefetch for the best use of caches. Our design achieves 66.4% performance improvement over the PM-only baseline and outperforms DRAM-cached, NUMA first-touch, and a state-of-the-art software solution by 38.8%, 45.1% and 83.3%, respectively.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88079377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A systematic approach to improving data locality across Fourier transforms and linear algebra operations 在傅里叶变换和线性代数操作中改进数据局部性的系统方法

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460354

Doru-Thom Popovici, A. Canning, Zhengji Zhao, Lin-wang Wang, J. Shalf

The performance of most scientific applications depends on efficient mathematical libraries. For example, scientific applications like the plane wave based Density Functional Theory approach for electronic structure calculations uses highly optimized libraries for Fourier transforms, dense linear algebra (orthogonalization) and sparse linear algebra (non-local projectors in real space). Although vendor-tuned libraries offer efficient implementations for each standalone mathematical kernel, the partitioning of those calls into sequentially invoked kernels inhibits cross-kernel optimizations that could improve data locality across memory bound operations. In this work we show that, by expressing these kernels as an operation on high dimensional tensors, cross-kernel dataflow optimizations that span FFT, dense and sparse linear algebra, can be readily exposed and exploited. We outline a systematic way of merging the Fourier transforms with the linear algebra computations, improving data locality and reducing data movement to main memory. We show that compared to conventional implementations, this streaming/dataflow approach offers 2x speedup on GPUs and 8x/12x speedup on CPUs compared to a baseline code that uses vendor-optimized libraries. Although we use Density Functional Theory to demonstrate the value of our approach, our methodology is broadly applicable to other applications that use Fourier transforms and linear algebra operations as building blocks.

大多数科学应用程序的性能依赖于高效的数学库。例如，用于电子结构计算的基于平面波的密度泛函理论方法等科学应用使用高度优化的傅立叶变换、密集线性代数(正交化)和稀疏线性代数(实空间中的非局部投影)库。尽管供应商调优的库为每个独立的数学内核提供了高效的实现，但是将这些调用划分为顺序调用的内核抑制了跨内核优化，而跨内存绑定操作可以提高数据局部性。在这项工作中，我们表明，通过将这些核表示为对高维张量的操作，跨FFT、密集和稀疏线性代数的跨核数据流优化可以很容易地暴露和利用。我们提出了一种将傅里叶变换与线性代数计算相结合的系统方法，提高了数据的局域性，减少了数据向主存的移动。我们展示了与传统实现相比，与使用供应商优化库的基准代码相比，这种流/数据流方法在gpu上提供了2倍的加速，在cpu上提供了8倍/12倍的加速。虽然我们使用密度泛函理论来证明我们方法的价值，但我们的方法广泛适用于使用傅里叶变换和线性代数运算作为构建块的其他应用。

{"title":"A systematic approach to improving data locality across Fourier transforms and linear algebra operations","authors":"Doru-Thom Popovici, A. Canning, Zhengji Zhao, Lin-wang Wang, J. Shalf","doi":"10.1145/3447818.3460354","DOIUrl":"https://doi.org/10.1145/3447818.3460354","url":null,"abstract":"The performance of most scientific applications depends on efficient mathematical libraries. For example, scientific applications like the plane wave based Density Functional Theory approach for electronic structure calculations uses highly optimized libraries for Fourier transforms, dense linear algebra (orthogonalization) and sparse linear algebra (non-local projectors in real space). Although vendor-tuned libraries offer efficient implementations for each standalone mathematical kernel, the partitioning of those calls into sequentially invoked kernels inhibits cross-kernel optimizations that could improve data locality across memory bound operations. In this work we show that, by expressing these kernels as an operation on high dimensional tensors, cross-kernel dataflow optimizations that span FFT, dense and sparse linear algebra, can be readily exposed and exploited. We outline a systematic way of merging the Fourier transforms with the linear algebra computations, improving data locality and reducing data movement to main memory. We show that compared to conventional implementations, this streaming/dataflow approach offers 2x speedup on GPUs and 8x/12x speedup on CPUs compared to a baseline code that uses vendor-optimized libraries. Although we use Density Functional Theory to demonstrate the value of our approach, our methodology is broadly applicable to other applications that use Fourier transforms and linear algebra operations as building blocks.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83388020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

An optimized tensor completion library for multiple GPUs 一个优化的张量补全库为多个gpu

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460692

Ming Dun, Yunchun Li, Hailong Yang, Qingxiao Sun, Zhongzhi Luan, D. Qian

Tensor computations are gaining wide adoption in big data analysis and artificial intelligence. Among them, tensor completion is used to predict the missing or unobserved value in tensors. The decomposition-based tensor completion algorithms have attracted significant research attention since they exhibit better parallelization and scalability. However, existing optimization techniques for tensor completion cannot sustain the increasing demand for applying tensor completion on ever larger tensor data. To address the above limitations, we develop the first tensor completion library cuTC on multiple Graphics Processing Units (GPUs) with three widely used optimization algorithms such as alternating least squares (ALS), stochastic gradient descent (SGD) and coordinate descent (CCD+). We propose a novel TB-COO format that leverages warp shuffle and shared memory on GPU to enable efficient reduction. In addition, we adopt the auto-tuning method to determine the optimal parameters for better convergence and performance. We compare cuTC with state-of-the-art tensor completion libraries on real-world datasets, and the results show cuTC achieves significant speedup with similar or even better accuracy.

张量计算在大数据分析和人工智能中得到了广泛的应用。其中，张量补全用于预测张量中的缺失值或未观测值。基于分解的张量补全算法由于具有更好的并行性和可扩展性而引起了广泛的研究关注。然而，现有的张量补全优化技术无法满足在越来越大的张量数据上应用张量补全的需求。为了解决上述限制，我们在多个图形处理单元(gpu)上开发了第一个张量补全库cuTC，其中包括三种广泛使用的优化算法，如交替最小二乘(ALS)，随机梯度下降(SGD)和坐标下降(CCD+)。我们提出了一种新的TB-COO格式，它利用GPU上的warp shuffle和共享内存来实现有效的缩减。此外，我们采用自调谐方法来确定最优参数，以获得更好的收敛性和性能。我们将cuTC与现实世界数据集上最先进的张量补全库进行了比较，结果表明cuTC在相似甚至更好的精度下实现了显著的加速。

{"title":"An optimized tensor completion library for multiple GPUs","authors":"Ming Dun, Yunchun Li, Hailong Yang, Qingxiao Sun, Zhongzhi Luan, D. Qian","doi":"10.1145/3447818.3460692","DOIUrl":"https://doi.org/10.1145/3447818.3460692","url":null,"abstract":"Tensor computations are gaining wide adoption in big data analysis and artificial intelligence. Among them, tensor completion is used to predict the missing or unobserved value in tensors. The decomposition-based tensor completion algorithms have attracted significant research attention since they exhibit better parallelization and scalability. However, existing optimization techniques for tensor completion cannot sustain the increasing demand for applying tensor completion on ever larger tensor data. To address the above limitations, we develop the first tensor completion library cuTC on multiple Graphics Processing Units (GPUs) with three widely used optimization algorithms such as alternating least squares (ALS), stochastic gradient descent (SGD) and coordinate descent (CCD+). We propose a novel TB-COO format that leverages warp shuffle and shared memory on GPU to enable efficient reduction. In addition, we adopt the auto-tuning method to determine the optimal parameters for better convergence and performance. We compare cuTC with state-of-the-art tensor completion libraries on real-world datasets, and the results show cuTC achieves significant speedup with similar or even better accuracy.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75905281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Inter-loop optimization in RAJA using loop chains 基于循环链的RAJA循环间优化

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3461665

Brandon Neth, T. Scogland, B. Supinski, M. Strout

Typical parallelization approaches such as OpenMP and CUDA provide constructs for parallelizing and blocking for data locality for individual loops. By focusing on each loop separately, these approaches fail to leverage sources of data locality possible due to inter-loop data reuse. The loop chain abstraction provides a framework for reasoning about and applying inter-loop optimizations. In this work, we incorporate the loop chain abstraction into RAJA, a performance portability layer for high-performance computing applications. Using the loop-chain-extended RAJA, or RAJALC, developers can have the RAJA library apply loop transformations like loop fusion and overlapped tiling while maintaining the original structure of their programs. By introducing targeted symbolic execution capabilities, we can collect and cache data access information required to verify loop transformations. We evaluate the performance improvement and refactoring costs of our extension. Overall, our results demonstrate 85-98% of the performance improvements of hand-optimized kernels with dramatically fewer code changes.

典型的并行化方法，如OpenMP和CUDA，为单个循环的数据局部性提供了并行化和阻塞结构。由于分别关注每个循环，这些方法无法利用由于循环间数据重用而可能产生的数据源局部性。循环链抽象为推理和应用循环间优化提供了一个框架。在这项工作中，我们将循环链抽象合并到RAJA中，RAJA是高性能计算应用程序的性能可移植性层。使用环链扩展的RAJA或rajarc，开发人员可以让RAJA库应用循环转换，如循环融合和重叠平铺，同时保持程序的原始结构。通过引入目标符号执行功能，我们可以收集和缓存验证循环转换所需的数据访问信息。我们评估了扩展的性能改进和重构成本。总的来说，我们的结果表明，手工优化内核的性能提高了85- 98%，而代码更改却少得多。

{"title":"Inter-loop optimization in RAJA using loop chains","authors":"Brandon Neth, T. Scogland, B. Supinski, M. Strout","doi":"10.1145/3447818.3461665","DOIUrl":"https://doi.org/10.1145/3447818.3461665","url":null,"abstract":"Typical parallelization approaches such as OpenMP and CUDA provide constructs for parallelizing and blocking for data locality for individual loops. By focusing on each loop separately, these approaches fail to leverage sources of data locality possible due to inter-loop data reuse. The loop chain abstraction provides a framework for reasoning about and applying inter-loop optimizations. In this work, we incorporate the loop chain abstraction into RAJA, a performance portability layer for high-performance computing applications. Using the loop-chain-extended RAJA, or RAJALC, developers can have the RAJA library apply loop transformations like loop fusion and overlapped tiling while maintaining the original structure of their programs. By introducing targeted symbolic execution capabilities, we can collect and cache data access information required to verify loop transformations. We evaluate the performance improvement and refactoring costs of our extension. Overall, our results demonstrate 85-98% of the performance improvements of hand-optimized kernels with dramatically fewer code changes.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80901068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Distributed merge forest: a new fast and scalable approach for topological analysis at scale 分布式合并森林:用于大规模拓扑分析的一种新的快速和可扩展的方法

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460358

Xuan Huang, Pavol Klacansky, Steve Petruzza, A. Gyulassy, P. Bremer, Valerio Pascucci

Topological analysis is used in several domains to identify and characterize important features in scientific data, and is now one of the established classes of techniques of proven practical use in scientific computing. The growth in parallelism and problem size tackled by modern simulations poses a particular challenge for these approaches. Fundamentally, the global encoding of topological features necessitates interprocess communication that limits their scaling. In this paper, we extend a new topological paradigm to the case of distributed computing, where the construction of a global merge tree is replaced by a distributed data structure, the merge forest, trading slower individual queries on the structure for faster end-to-end performance and scaling. Empirically, the queries that are most negatively affected also tend to have limited practical use. Our experimental results demonstrate the scalability of both the merge forest construction and the parallel queries needed in scientific workflows, and contrast this scalability with the two established alternatives that construct variations of a global tree.

拓扑分析在许多领域被用于识别和描述科学数据中的重要特征，并且现在是在科学计算中被证明实际使用的已建立的技术类别之一。现代模拟处理的并行性和问题规模的增长对这些方法提出了特别的挑战。从根本上说，拓扑特征的全局编码需要进程间通信，这限制了它们的扩展。在本文中，我们将一种新的拓扑范式扩展到分布式计算的情况下，其中全局合并树的构造被分布式数据结构(合并森林)所取代，在结构上交换较慢的单个查询，以获得更快的端到端性能和可扩展性。根据经验，最受负面影响的查询也往往具有有限的实际用途。我们的实验结果证明了合并森林构建和科学工作流中所需的并行查询的可扩展性，并将这种可扩展性与构建全局树变体的两种已建立的替代方案进行了对比。

{"title":"Distributed merge forest: a new fast and scalable approach for topological analysis at scale","authors":"Xuan Huang, Pavol Klacansky, Steve Petruzza, A. Gyulassy, P. Bremer, Valerio Pascucci","doi":"10.1145/3447818.3460358","DOIUrl":"https://doi.org/10.1145/3447818.3460358","url":null,"abstract":"Topological analysis is used in several domains to identify and characterize important features in scientific data, and is now one of the established classes of techniques of proven practical use in scientific computing. The growth in parallelism and problem size tackled by modern simulations poses a particular challenge for these approaches. Fundamentally, the global encoding of topological features necessitates interprocess communication that limits their scaling. In this paper, we extend a new topological paradigm to the case of distributed computing, where the construction of a global merge tree is replaced by a distributed data structure, the merge forest, trading slower individual queries on the structure for faster end-to-end performance and scaling. Empirically, the queries that are most negatively affected also tend to have limited practical use. Our experimental results demonstrate the scalability of both the merge forest construction and the parallel queries needed in scientific workflows, and contrast this scalability with the two established alternatives that construct variations of a global tree.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87150732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Proceedings of the ACM International Conference on Supercomputing ACM超级计算国际会议论文集

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818

引用次数: 1

Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication 稀疏倍高瘦密矩阵乘法的分布式内存并行算法

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3461472

Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, K. Yelick, A. Buluç

Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evaluate the performance of various techniques for performing SpMM as a distributed computation across many nodes by focusing on GPU accelerators. We examine how the actual local computational performance of state-of-the-art SpMM implementations affect computational efficiency as dimensions change when we scale to large numbers of nodes, which proves to be an unexpectedly important bottleneck. We also consider various distribution strategies, including A-Stationary, B-Stationary, and C-Stationary algorithms, 1.5D and 2D algorithms, and RDMA-based and bulk synchronous methods of data transfer. Our results show that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the performance of local SpMM operations. Our evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies.

稀疏次密矩阵乘法(SpMM)在计算线性代数等成熟领域以及图神经网络等新兴领域都有应用。在本研究中，我们通过关注GPU加速器，评估了将SpMM作为跨多个节点的分布式计算执行的各种技术的性能。我们将研究最先进的SpMM实现的实际本地计算性能如何影响计算效率，因为当我们扩展到大量节点时，维度发生变化，这被证明是一个意想不到的重要瓶颈。我们还考虑了各种分布策略，包括A-Stationary, B-Stationary和C-Stationary算法，1.5D和2D算法，以及基于rdma和批量同步的数据传输方法。我们的研究结果表明，算法和实现技术的最佳选择不仅取决于特定矩阵大小和维度的通信成本，还取决于局部SpMM操作的性能。我们的评估表明，在GPU加速器的参与下，SpMM的最佳设计选择不同于已知在密集矩阵-矩阵或稀疏矩阵-稀疏矩阵乘法中表现良好的传统算法。

{"title":"Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication","authors":"Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, K. Yelick, A. Buluç","doi":"10.1145/3447818.3461472","DOIUrl":"https://doi.org/10.1145/3447818.3461472","url":null,"abstract":"Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evaluate the performance of various techniques for performing SpMM as a distributed computation across many nodes by focusing on GPU accelerators. We examine how the actual local computational performance of state-of-the-art SpMM implementations affect computational efficiency as dimensions change when we scale to large numbers of nodes, which proves to be an unexpectedly important bottleneck. We also consider various distribution strategies, including A-Stationary, B-Stationary, and C-Stationary algorithms, 1.5D and 2D algorithms, and RDMA-based and bulk synchronous methods of data transfer. Our results show that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the performance of local SpMM operations. Our evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87780382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

ALTO

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3461703

A. Helal, Jan Laukemann, Fabio Checconi, Jesmin Jahan Tithi, Teresa M. Ranadive, F. Petrini, Jeewhan Choi

The analysis of high-dimensional sparse data is becoming increasingly popular in many important domains. However, real-world sparse tensors are challenging to process due to their irregular shapes and data distributions. We propose the Adaptive Linearized Tensor Order (ALTO) format, a novel mode-agnostic (general) representation that keeps neighboring nonzero elements in the multi-dimensional space close to each other in memory. To generate the indexing metadata, ALTO uses an adaptive bit encoding scheme that trades off index computations for lower memory usage and more effective use of memory bandwidth. Moreover, by decoupling its sparse representation from the irregular spatial distribution of nonzero elements, ALTO eliminates the workload imbalance and greatly reduces the synchronization overhead of tensor computations. As a result, the parallel performance of ALTO-based tensor operations becomes a function of their inherent data reuse. On a gamut of tensor datasets, ALTO outperforms an oracle that selects the best state-of-the-art format for each dataset, when used in key tensor decomposition operations. Specifically, ALTO achieves a geometric mean speedup of 8x over the best mode-agnostic (coordinate and hierarchical coordinate) formats, while delivering a geometric mean compression ratio of 4.x relative to the best mode-specific (compressed sparse fiber) formats.

{"title":"ALTO","authors":"A. Helal, Jan Laukemann, Fabio Checconi, Jesmin Jahan Tithi, Teresa M. Ranadive, F. Petrini, Jeewhan Choi","doi":"10.1145/3447818.3461703","DOIUrl":"https://doi.org/10.1145/3447818.3461703","url":null,"abstract":"The analysis of high-dimensional sparse data is becoming increasingly popular in many important domains. However, real-world sparse tensors are challenging to process due to their irregular shapes and data distributions. We propose the Adaptive Linearized Tensor Order (ALTO) format, a novel mode-agnostic (general) representation that keeps neighboring nonzero elements in the multi-dimensional space close to each other in memory. To generate the indexing metadata, ALTO uses an adaptive bit encoding scheme that trades off index computations for lower memory usage and more effective use of memory bandwidth. Moreover, by decoupling its sparse representation from the irregular spatial distribution of nonzero elements, ALTO eliminates the workload imbalance and greatly reduces the synchronization overhead of tensor computations. As a result, the parallel performance of ALTO-based tensor operations becomes a function of their inherent data reuse. On a gamut of tensor datasets, ALTO outperforms an oracle that selects the best state-of-the-art format for each dataset, when used in key tensor decomposition operations. Specifically, ALTO achieves a geometric mean speedup of 8x over the best mode-agnostic (coordinate and hierarchical coordinate) formats, while delivering a geometric mean compression ratio of 4.x relative to the best mode-specific (compressed sparse fiber) formats.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":" 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91412390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

SumMerge: an efficient algorithm and implementation for weight repetition-aware DNN inference SumMerge:权重重复感知深度神经网络推理的有效算法和实现

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

Pub Date : 2021-06-03 DOI: 10.1145/3447818.3460375

Rohan Baskar Prabhakar, Sachit Kuhar, R. Agrawal, C. Hughes, Christopher W. Fletcher

Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit emph{weight repetition}. The key observation is that due to DNN quantization schemes---which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight---the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs emph{with weight-dependent structure}. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09x-2.05x and 1.04x-1.51x relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7x to 15.4x.

深度神经网络(DNN)的推理效率是目前依赖深度学习的众多领域的一个关键问题。最近一个有希望的加速推理的方向是利用emph{权重重复}。关键的观察结果是，由于DNN量化方案——试图通过减少表示每个权重所需的比特数来减少DNN存储需求——相同的权重必然会在过滤器内部和跨过滤器重复多次。这使得权重重复感知推理内核能够分解和记忆常见的子计算，减少每个推理的算术，同时仍然保持量化的压缩优势。然而，重大挑战依然存在。例如，权重重复在推理操作中引入了明显的不规则性，因此(到目前为止)需要定制硬件加速器来获得净收益。本文提出了SumMerge:一种新的算法和一套实现技术，使权重重复在cpu等通用设备上实现。关键思想是将推理表述为遍历emph{具有权重相关结构}的数据流图序列。我们开发了一种离线启发式方法来选择一个数据流图结构，该结构可以最大限度地减少每个推理(给定训练过的权重值)的算术运算，并使用一个有效的在线过程来遍历每个数据流图并计算给定DNN输入的推理结果。我们将上述实现为一个优化的c++例程，该例程运行在带有矢量扩展的商用多核处理器上，并相对于英特尔优化的库oneDNN和现有技术的权重重复算法(AGR)评估性能。当应用于六种不同的量化方案时，SumMerge相对于oneDNN和AGR分别实现了1.09 -2.05倍和1.04 -1.51倍的加速，同时将DNN模型压缩了8.7倍至15.4倍。

{"title":"SumMerge: an efficient algorithm and implementation for weight repetition-aware DNN inference","authors":"Rohan Baskar Prabhakar, Sachit Kuhar, R. Agrawal, C. Hughes, Christopher W. Fletcher","doi":"10.1145/3447818.3460375","DOIUrl":"https://doi.org/10.1145/3447818.3460375","url":null,"abstract":"Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit emph{weight repetition}. The key observation is that due to DNN quantization schemes---which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight---the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs emph{with weight-dependent structure}. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09x-2.05x and 1.04x-1.51x relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7x to 15.4x.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91064216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5