首页 > 最新文献

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing最新文献

英文 中文
Power and energy efficient routing for Mach-Zehnder interferometer based photonic switches 基于Mach-Zehnder干涉仪的光子开关的功率和能量高效路由
Markos Kynigos, J. A. Pascual, J. Navaridas, J. Goodacre, M. Luján
Silicon Photonic top-of-rack (ToR) switches are highly desirable for the datacenter (DC) and high-performance computing (HPC) domains for their potential high-bandwidth and energy efficiency. Recently, photonic Beneš switching fabrics based on Mach-Zehnder Interferometers (MZIs) have been proposed as a promising candidate for the internals of high-performance switches. However, state-of-the-art routing algorithms that control these switching fabrics are either computationally complex or unable to provide non-blocking, energy efficient routing permutations.To address this, we propose for the first time a combination of energy efficient routing algorithms and time-division multiplexing (TDM). We evaluate this approach by conducting a simulation-based performance evaluation of a 16x16 Beneš fabric, deployed as a ToR switch, when handling a set of 8 representative workloads from the DC and HPC domains. Our results show that state-of-the-art approaches (circuit switched energy efficient routing algorithms) introduce up to 23% contention in the switching fabric for some workloads, thereby increasing communication time. We show that augmenting the algorithms with TDM can ameliorate switch fabric contention by segmenting communication data and gracefully interleaving the segments, thus reducing communication time by up to 20% in the best case. We also discuss the impact of the TDM segment size, finding that although a 10KB segment size is the most beneficial in reducing communication time, a 100KB segment size offers similar performance while requiring a less stringent path-computation time window. Finally, we assess the impact of TDM on path-dependent insertion loss and switching energy consumption, finding it to be minimal in all cases.
硅光子架顶式(ToR)交换机因其潜在的高带宽和高能效,在数据中心(DC)和高性能计算(HPC)领域非常受欢迎。近年来,基于Mach-Zehnder干涉仪(MZIs)的光子benei开关织物被认为是高性能开关内部的一个有前途的候选材料。然而,控制这些交换结构的最先进的路由算法要么计算复杂,要么无法提供无阻塞、节能的路由排列。为了解决这个问题,我们首次提出了节能路由算法和时分多路复用(TDM)的组合。在处理来自DC和HPC域的8个代表性工作负载时,我们通过对作为ToR交换机部署的16x16 Beneš fabric进行基于模拟的性能评估来评估这种方法。我们的研究结果表明,最先进的方法(电路交换节能路由算法)在某些工作负载的交换结构中引入了高达23%的争用,从而增加了通信时间。我们表明,通过对通信数据进行分段和优雅地交错,在TDM中增强算法可以改善交换结构争用,从而在最佳情况下将通信时间减少多达20%。我们还讨论了TDM段大小的影响,发现尽管10KB的段大小对减少通信时间最有利,但100KB的段大小提供了类似的性能,同时需要不那么严格的路径计算时间窗口。最后,我们评估时分复用对路径相关的插入损耗和开关能耗的影响,发现它在所有情况下都是最小的。
{"title":"Power and energy efficient routing for Mach-Zehnder interferometer based photonic switches","authors":"Markos Kynigos, J. A. Pascual, J. Navaridas, J. Goodacre, M. Luján","doi":"10.1145/3447818.3460363","DOIUrl":"https://doi.org/10.1145/3447818.3460363","url":null,"abstract":"Silicon Photonic top-of-rack (ToR) switches are highly desirable for the datacenter (DC) and high-performance computing (HPC) domains for their potential high-bandwidth and energy efficiency. Recently, photonic Beneš switching fabrics based on Mach-Zehnder Interferometers (MZIs) have been proposed as a promising candidate for the internals of high-performance switches. However, state-of-the-art routing algorithms that control these switching fabrics are either computationally complex or unable to provide non-blocking, energy efficient routing permutations.To address this, we propose for the first time a combination of energy efficient routing algorithms and time-division multiplexing (TDM). We evaluate this approach by conducting a simulation-based performance evaluation of a 16x16 Beneš fabric, deployed as a ToR switch, when handling a set of 8 representative workloads from the DC and HPC domains. Our results show that state-of-the-art approaches (circuit switched energy efficient routing algorithms) introduce up to 23% contention in the switching fabric for some workloads, thereby increasing communication time. We show that augmenting the algorithms with TDM can ameliorate switch fabric contention by segmenting communication data and gracefully interleaving the segments, thus reducing communication time by up to 20% in the best case. We also discuss the impact of the TDM segment size, finding that although a 10KB segment size is the most beneficial in reducing communication time, a 100KB segment size offers similar performance while requiring a less stringent path-computation time window. Finally, we assess the impact of TDM on path-dependent insertion loss and switching energy consumption, finding it to be minimal in all cases.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"82 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72871757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Optimizing large-scale plasma simulations on persistent memory-based heterogeneous memory with effective data placement across memory hierarchy 在基于持久内存的异构内存上优化大规模等离子体模拟,实现跨内存层次的有效数据放置
Jie Ren, Jiaolin Luo, I. Peng, Kai Wu, Dong Li
Particle simulations of plasma are important for understanding plasma dynamics in space weather and fusion devices. However, production simulations that use billions and even trillions of computational particles require high memory capacity. In this work, we explore the latest persistent memory (PM) hardware to enable large-scale plasma simulations at unprecedented scales on a single machine. We use WarpX, an advanced plasma simulation code which is mission-critical and targets future exascale systems. We analyze the performance of WarpX on PM-based heterogeneous memory systems and propose to make the best use of memory hierarchy to avoid the impact of inferior performance of PM. We introduce a combination of static and dynamic data placement, and processor-cache prefetch mechanism for performance optimization. We develop a performance model to enable efficient data migration between PM and DRAM in the background, without reducing available bandwidth and parallelism to the application threads. We also build an analytical model to decide when to prefetch for the best use of caches. Our design achieves 66.4% performance improvement over the PM-only baseline and outperforms DRAM-cached, NUMA first-touch, and a state-of-the-art software solution by 38.8%, 45.1% and 83.3%, respectively.
等离子体的粒子模拟对于理解空间天气和聚变装置中的等离子体动力学具有重要意义。然而,使用数十亿甚至数万亿计算粒子的生产模拟需要高内存容量。在这项工作中,我们探索了最新的持久内存(PM)硬件,以便在一台机器上实现前所未有的大规模等离子体模拟。我们使用WarpX,这是一种先进的等离子体模拟代码,它是关键任务,目标是未来的百亿亿级系统。我们分析了WarpX在基于PM的异构存储系统上的性能,并提出了最好地利用内存层次结构来避免PM性能低下的影响。我们引入了静态和动态数据放置的组合,以及用于性能优化的处理器缓存预取机制。我们开发了一个性能模型,在后台实现PM和DRAM之间的有效数据迁移,而不会减少可用带宽和应用程序线程的并行性。我们还建立了一个分析模型来决定何时预取以最佳地利用缓存。我们的设计在纯pm基准上实现了66.4%的性能改进,并且分别比dram缓存、NUMA first-touch和最先进的软件解决方案高出38.8%、45.1%和83.3%。
{"title":"Optimizing large-scale plasma simulations on persistent memory-based heterogeneous memory with effective data placement across memory hierarchy","authors":"Jie Ren, Jiaolin Luo, I. Peng, Kai Wu, Dong Li","doi":"10.1145/3447818.3460356","DOIUrl":"https://doi.org/10.1145/3447818.3460356","url":null,"abstract":"Particle simulations of plasma are important for understanding plasma dynamics in space weather and fusion devices. However, production simulations that use billions and even trillions of computational particles require high memory capacity. In this work, we explore the latest persistent memory (PM) hardware to enable large-scale plasma simulations at unprecedented scales on a single machine. We use WarpX, an advanced plasma simulation code which is mission-critical and targets future exascale systems. We analyze the performance of WarpX on PM-based heterogeneous memory systems and propose to make the best use of memory hierarchy to avoid the impact of inferior performance of PM. We introduce a combination of static and dynamic data placement, and processor-cache prefetch mechanism for performance optimization. We develop a performance model to enable efficient data migration between PM and DRAM in the background, without reducing available bandwidth and parallelism to the application threads. We also build an analytical model to decide when to prefetch for the best use of caches. Our design achieves 66.4% performance improvement over the PM-only baseline and outperforms DRAM-cached, NUMA first-touch, and a state-of-the-art software solution by 38.8%, 45.1% and 83.3%, respectively.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88079377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A systematic approach to improving data locality across Fourier transforms and linear algebra operations 在傅里叶变换和线性代数操作中改进数据局部性的系统方法
Doru-Thom Popovici, A. Canning, Zhengji Zhao, Lin-wang Wang, J. Shalf
The performance of most scientific applications depends on efficient mathematical libraries. For example, scientific applications like the plane wave based Density Functional Theory approach for electronic structure calculations uses highly optimized libraries for Fourier transforms, dense linear algebra (orthogonalization) and sparse linear algebra (non-local projectors in real space). Although vendor-tuned libraries offer efficient implementations for each standalone mathematical kernel, the partitioning of those calls into sequentially invoked kernels inhibits cross-kernel optimizations that could improve data locality across memory bound operations. In this work we show that, by expressing these kernels as an operation on high dimensional tensors, cross-kernel dataflow optimizations that span FFT, dense and sparse linear algebra, can be readily exposed and exploited. We outline a systematic way of merging the Fourier transforms with the linear algebra computations, improving data locality and reducing data movement to main memory. We show that compared to conventional implementations, this streaming/dataflow approach offers 2x speedup on GPUs and 8x/12x speedup on CPUs compared to a baseline code that uses vendor-optimized libraries. Although we use Density Functional Theory to demonstrate the value of our approach, our methodology is broadly applicable to other applications that use Fourier transforms and linear algebra operations as building blocks.
大多数科学应用程序的性能依赖于高效的数学库。例如,用于电子结构计算的基于平面波的密度泛函理论方法等科学应用使用高度优化的傅立叶变换、密集线性代数(正交化)和稀疏线性代数(实空间中的非局部投影)库。尽管供应商调优的库为每个独立的数学内核提供了高效的实现,但是将这些调用划分为顺序调用的内核抑制了跨内核优化,而跨内存绑定操作可以提高数据局部性。在这项工作中,我们表明,通过将这些核表示为对高维张量的操作,跨FFT、密集和稀疏线性代数的跨核数据流优化可以很容易地暴露和利用。我们提出了一种将傅里叶变换与线性代数计算相结合的系统方法,提高了数据的局域性,减少了数据向主存的移动。我们展示了与传统实现相比,与使用供应商优化库的基准代码相比,这种流/数据流方法在gpu上提供了2倍的加速,在cpu上提供了8倍/12倍的加速。虽然我们使用密度泛函理论来证明我们方法的价值,但我们的方法广泛适用于使用傅里叶变换和线性代数运算作为构建块的其他应用。
{"title":"A systematic approach to improving data locality across Fourier transforms and linear algebra operations","authors":"Doru-Thom Popovici, A. Canning, Zhengji Zhao, Lin-wang Wang, J. Shalf","doi":"10.1145/3447818.3460354","DOIUrl":"https://doi.org/10.1145/3447818.3460354","url":null,"abstract":"The performance of most scientific applications depends on efficient mathematical libraries. For example, scientific applications like the plane wave based Density Functional Theory approach for electronic structure calculations uses highly optimized libraries for Fourier transforms, dense linear algebra (orthogonalization) and sparse linear algebra (non-local projectors in real space). Although vendor-tuned libraries offer efficient implementations for each standalone mathematical kernel, the partitioning of those calls into sequentially invoked kernels inhibits cross-kernel optimizations that could improve data locality across memory bound operations. In this work we show that, by expressing these kernels as an operation on high dimensional tensors, cross-kernel dataflow optimizations that span FFT, dense and sparse linear algebra, can be readily exposed and exploited. We outline a systematic way of merging the Fourier transforms with the linear algebra computations, improving data locality and reducing data movement to main memory. We show that compared to conventional implementations, this streaming/dataflow approach offers 2x speedup on GPUs and 8x/12x speedup on CPUs compared to a baseline code that uses vendor-optimized libraries. Although we use Density Functional Theory to demonstrate the value of our approach, our methodology is broadly applicable to other applications that use Fourier transforms and linear algebra operations as building blocks.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83388020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
An optimized tensor completion library for multiple GPUs 一个优化的张量补全库为多个gpu
Ming Dun, Yunchun Li, Hailong Yang, Qingxiao Sun, Zhongzhi Luan, D. Qian
Tensor computations are gaining wide adoption in big data analysis and artificial intelligence. Among them, tensor completion is used to predict the missing or unobserved value in tensors. The decomposition-based tensor completion algorithms have attracted significant research attention since they exhibit better parallelization and scalability. However, existing optimization techniques for tensor completion cannot sustain the increasing demand for applying tensor completion on ever larger tensor data. To address the above limitations, we develop the first tensor completion library cuTC on multiple Graphics Processing Units (GPUs) with three widely used optimization algorithms such as alternating least squares (ALS), stochastic gradient descent (SGD) and coordinate descent (CCD+). We propose a novel TB-COO format that leverages warp shuffle and shared memory on GPU to enable efficient reduction. In addition, we adopt the auto-tuning method to determine the optimal parameters for better convergence and performance. We compare cuTC with state-of-the-art tensor completion libraries on real-world datasets, and the results show cuTC achieves significant speedup with similar or even better accuracy.
张量计算在大数据分析和人工智能中得到了广泛的应用。其中,张量补全用于预测张量中的缺失值或未观测值。基于分解的张量补全算法由于具有更好的并行性和可扩展性而引起了广泛的研究关注。然而,现有的张量补全优化技术无法满足在越来越大的张量数据上应用张量补全的需求。为了解决上述限制,我们在多个图形处理单元(gpu)上开发了第一个张量补全库cuTC,其中包括三种广泛使用的优化算法,如交替最小二乘(ALS),随机梯度下降(SGD)和坐标下降(CCD+)。我们提出了一种新的TB-COO格式,它利用GPU上的warp shuffle和共享内存来实现有效的缩减。此外,我们采用自调谐方法来确定最优参数,以获得更好的收敛性和性能。我们将cuTC与现实世界数据集上最先进的张量补全库进行了比较,结果表明cuTC在相似甚至更好的精度下实现了显著的加速。
{"title":"An optimized tensor completion library for multiple GPUs","authors":"Ming Dun, Yunchun Li, Hailong Yang, Qingxiao Sun, Zhongzhi Luan, D. Qian","doi":"10.1145/3447818.3460692","DOIUrl":"https://doi.org/10.1145/3447818.3460692","url":null,"abstract":"Tensor computations are gaining wide adoption in big data analysis and artificial intelligence. Among them, tensor completion is used to predict the missing or unobserved value in tensors. The decomposition-based tensor completion algorithms have attracted significant research attention since they exhibit better parallelization and scalability. However, existing optimization techniques for tensor completion cannot sustain the increasing demand for applying tensor completion on ever larger tensor data. To address the above limitations, we develop the first tensor completion library cuTC on multiple Graphics Processing Units (GPUs) with three widely used optimization algorithms such as alternating least squares (ALS), stochastic gradient descent (SGD) and coordinate descent (CCD+). We propose a novel TB-COO format that leverages warp shuffle and shared memory on GPU to enable efficient reduction. In addition, we adopt the auto-tuning method to determine the optimal parameters for better convergence and performance. We compare cuTC with state-of-the-art tensor completion libraries on real-world datasets, and the results show cuTC achieves significant speedup with similar or even better accuracy.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75905281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Inter-loop optimization in RAJA using loop chains 基于循环链的RAJA循环间优化
Brandon Neth, T. Scogland, B. Supinski, M. Strout
Typical parallelization approaches such as OpenMP and CUDA provide constructs for parallelizing and blocking for data locality for individual loops. By focusing on each loop separately, these approaches fail to leverage sources of data locality possible due to inter-loop data reuse. The loop chain abstraction provides a framework for reasoning about and applying inter-loop optimizations. In this work, we incorporate the loop chain abstraction into RAJA, a performance portability layer for high-performance computing applications. Using the loop-chain-extended RAJA, or RAJALC, developers can have the RAJA library apply loop transformations like loop fusion and overlapped tiling while maintaining the original structure of their programs. By introducing targeted symbolic execution capabilities, we can collect and cache data access information required to verify loop transformations. We evaluate the performance improvement and refactoring costs of our extension. Overall, our results demonstrate 85-98% of the performance improvements of hand-optimized kernels with dramatically fewer code changes.
典型的并行化方法,如OpenMP和CUDA,为单个循环的数据局部性提供了并行化和阻塞结构。由于分别关注每个循环,这些方法无法利用由于循环间数据重用而可能产生的数据源局部性。循环链抽象为推理和应用循环间优化提供了一个框架。在这项工作中,我们将循环链抽象合并到RAJA中,RAJA是高性能计算应用程序的性能可移植性层。使用环链扩展的RAJA或rajarc,开发人员可以让RAJA库应用循环转换,如循环融合和重叠平铺,同时保持程序的原始结构。通过引入目标符号执行功能,我们可以收集和缓存验证循环转换所需的数据访问信息。我们评估了扩展的性能改进和重构成本。总的来说,我们的结果表明,手工优化内核的性能提高了85- 98%,而代码更改却少得多。
{"title":"Inter-loop optimization in RAJA using loop chains","authors":"Brandon Neth, T. Scogland, B. Supinski, M. Strout","doi":"10.1145/3447818.3461665","DOIUrl":"https://doi.org/10.1145/3447818.3461665","url":null,"abstract":"Typical parallelization approaches such as OpenMP and CUDA provide constructs for parallelizing and blocking for data locality for individual loops. By focusing on each loop separately, these approaches fail to leverage sources of data locality possible due to inter-loop data reuse. The loop chain abstraction provides a framework for reasoning about and applying inter-loop optimizations. In this work, we incorporate the loop chain abstraction into RAJA, a performance portability layer for high-performance computing applications. Using the loop-chain-extended RAJA, or RAJALC, developers can have the RAJA library apply loop transformations like loop fusion and overlapped tiling while maintaining the original structure of their programs. By introducing targeted symbolic execution capabilities, we can collect and cache data access information required to verify loop transformations. We evaluate the performance improvement and refactoring costs of our extension. Overall, our results demonstrate 85-98% of the performance improvements of hand-optimized kernels with dramatically fewer code changes.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80901068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Distributed merge forest: a new fast and scalable approach for topological analysis at scale 分布式合并森林:用于大规模拓扑分析的一种新的快速和可扩展的方法
Xuan Huang, Pavol Klacansky, Steve Petruzza, A. Gyulassy, P. Bremer, Valerio Pascucci
Topological analysis is used in several domains to identify and characterize important features in scientific data, and is now one of the established classes of techniques of proven practical use in scientific computing. The growth in parallelism and problem size tackled by modern simulations poses a particular challenge for these approaches. Fundamentally, the global encoding of topological features necessitates interprocess communication that limits their scaling. In this paper, we extend a new topological paradigm to the case of distributed computing, where the construction of a global merge tree is replaced by a distributed data structure, the merge forest, trading slower individual queries on the structure for faster end-to-end performance and scaling. Empirically, the queries that are most negatively affected also tend to have limited practical use. Our experimental results demonstrate the scalability of both the merge forest construction and the parallel queries needed in scientific workflows, and contrast this scalability with the two established alternatives that construct variations of a global tree.
拓扑分析在许多领域被用于识别和描述科学数据中的重要特征,并且现在是在科学计算中被证明实际使用的已建立的技术类别之一。现代模拟处理的并行性和问题规模的增长对这些方法提出了特别的挑战。从根本上说,拓扑特征的全局编码需要进程间通信,这限制了它们的扩展。在本文中,我们将一种新的拓扑范式扩展到分布式计算的情况下,其中全局合并树的构造被分布式数据结构(合并森林)所取代,在结构上交换较慢的单个查询,以获得更快的端到端性能和可扩展性。根据经验,最受负面影响的查询也往往具有有限的实际用途。我们的实验结果证明了合并森林构建和科学工作流中所需的并行查询的可扩展性,并将这种可扩展性与构建全局树变体的两种已建立的替代方案进行了对比。
{"title":"Distributed merge forest: a new fast and scalable approach for topological analysis at scale","authors":"Xuan Huang, Pavol Klacansky, Steve Petruzza, A. Gyulassy, P. Bremer, Valerio Pascucci","doi":"10.1145/3447818.3460358","DOIUrl":"https://doi.org/10.1145/3447818.3460358","url":null,"abstract":"Topological analysis is used in several domains to identify and characterize important features in scientific data, and is now one of the established classes of techniques of proven practical use in scientific computing. The growth in parallelism and problem size tackled by modern simulations poses a particular challenge for these approaches. Fundamentally, the global encoding of topological features necessitates interprocess communication that limits their scaling. In this paper, we extend a new topological paradigm to the case of distributed computing, where the construction of a global merge tree is replaced by a distributed data structure, the merge forest, trading slower individual queries on the structure for faster end-to-end performance and scaling. Empirically, the queries that are most negatively affected also tend to have limited practical use. Our experimental results demonstrate the scalability of both the merge forest construction and the parallel queries needed in scientific workflows, and contrast this scalability with the two established alternatives that construct variations of a global tree.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87150732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Proceedings of the ACM International Conference on Supercomputing ACM超级计算国际会议论文集
{"title":"Proceedings of the ACM International Conference on Supercomputing","authors":"","doi":"10.1145/3447818","DOIUrl":"https://doi.org/10.1145/3447818","url":null,"abstract":"","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91040679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
NPBench
A. Ziogas, Tal Ben-Nun, Timo Schneider, T. Hoefler
Python, already one of the most popular languages for scientific computing, has made significant inroads in High Performance Computing (HPC). At the center of Python's ecosystem is NumPy, an efficient implementation of the multi-dimensional array (tensor) structure, together with basic arithmetic and linear algebra. Compared to traditional HPC languages, the relatively low performance of Python and NumPy has spawned significant research in compilers and frameworks that decouple Python's compact representation from the underlying implementation. However, it is challenging to compare language compatibility and performance among different frameworks and architectures without a standard set of benchmarks and metrics. To that end, we introduce NPBench, a set of NumPy code samples representing a large variety of HPC applications. We use NPBench to test popular NumPy-accelerating compilers and frameworks on a variety of metrics. NPBench will guide both end-users and framework developers focusing on performance and will drive further use of Python in the high-performance scientific domains.
{"title":"NPBench","authors":"A. Ziogas, Tal Ben-Nun, Timo Schneider, T. Hoefler","doi":"10.1145/3447818.3460360","DOIUrl":"https://doi.org/10.1145/3447818.3460360","url":null,"abstract":"Python, already one of the most popular languages for scientific computing, has made significant inroads in High Performance Computing (HPC). At the center of Python's ecosystem is NumPy, an efficient implementation of the multi-dimensional array (tensor) structure, together with basic arithmetic and linear algebra. Compared to traditional HPC languages, the relatively low performance of Python and NumPy has spawned significant research in compilers and frameworks that decouple Python's compact representation from the underlying implementation. However, it is challenging to compare language compatibility and performance among different frameworks and architectures without a standard set of benchmarks and metrics. To that end, we introduce NPBench, a set of NumPy code samples representing a large variety of HPC applications. We use NPBench to test popular NumPy-accelerating compilers and frameworks on a variety of metrics. NPBench will guide both end-users and framework developers focusing on performance and will drive further use of Python in the high-performance scientific domains.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"25 4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77389894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Accelerating DNNs inference with predictive layer fusion 用预测层融合加速dnn推理
MohammadHossein Olyaiy, Christopher Ng, Mieszko Lis
Many modern convolutional neural neworks (CNNs) rely on bottleneck block structures where the activation tensor is mapped between higher dimensions using an intermediate low dimension, and convolved with depthwise feature filters rather than multi-channel filters. Because most of the computation lies in computing the large dimensional tensors, however, such networks cannot be scaled without significant computation costs. In this paper, we show how emph{fusing} the layers inside these blocks can dramatically reduce the multiplication count (by 6--20x) at the cost of extra additions. ReLU nonlinearities are predicted dynamically, and only the activations that survive ReLU contribute to directly compute the output of the block. We also propose FusioNet, a CNN architecture optimized for fusion, as well as ARCHON, a novel accelerator design with a dataflow optimized for fused networks. When FusioNet is executed on the proposed accelerator, it yields up to 5.8x faster inference compared to compact networks executed on a dense DNN accelerator, and 2.1x faster inference compared to the same networks when pruned and executed on a sparse DNN accelerator.
许多现代卷积神经网络(cnn)依赖于瓶颈块结构,其中激活张量使用中间低维在高维之间进行映射,并使用深度特征滤波器而不是多通道滤波器进行卷积。然而,由于大部分的计算是在计算大维张量,这样的网络无法在没有显著计算成本的情况下进行扩展。在本文中,我们展示了如何emph{融合}这些块内的层可以以额外的添加为代价显着减少乘法计数(减少6- 20倍)。ReLU非线性是动态预测的,只有在ReLU中幸存的激活才能直接计算区块的输出。我们还提出了一种针对融合优化的CNN架构FusioNet,以及一种针对融合网络优化的数据流的新型加速器ARCHON。当在提议的加速器上执行FusioNet时,与在密集DNN加速器上执行的紧凑网络相比,它产生的推理速度快5.8倍,与在稀疏DNN加速器上执行的修剪和执行的相同网络相比,它产生的推理速度快2.1倍。
{"title":"Accelerating DNNs inference with predictive layer fusion","authors":"MohammadHossein Olyaiy, Christopher Ng, Mieszko Lis","doi":"10.1145/3447818.3460378","DOIUrl":"https://doi.org/10.1145/3447818.3460378","url":null,"abstract":"Many modern convolutional neural neworks (CNNs) rely on bottleneck block structures where the activation tensor is mapped between higher dimensions using an intermediate low dimension, and convolved with depthwise feature filters rather than multi-channel filters. Because most of the computation lies in computing the large dimensional tensors, however, such networks cannot be scaled without significant computation costs. In this paper, we show how emph{fusing} the layers inside these blocks can dramatically reduce the multiplication count (by 6--20x) at the cost of extra additions. ReLU nonlinearities are predicted dynamically, and only the activations that survive ReLU contribute to directly compute the output of the block. We also propose FusioNet, a CNN architecture optimized for fusion, as well as ARCHON, a novel accelerator design with a dataflow optimized for fused networks. When FusioNet is executed on the proposed accelerator, it yields up to 5.8x faster inference compared to compact networks executed on a dense DNN accelerator, and 2.1x faster inference compared to the same networks when pruned and executed on a sparse DNN accelerator.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75584350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication 稀疏倍高瘦密矩阵乘法的分布式内存并行算法
Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, K. Yelick, A. Buluç
Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evaluate the performance of various techniques for performing SpMM as a distributed computation across many nodes by focusing on GPU accelerators. We examine how the actual local computational performance of state-of-the-art SpMM implementations affect computational efficiency as dimensions change when we scale to large numbers of nodes, which proves to be an unexpectedly important bottleneck. We also consider various distribution strategies, including A-Stationary, B-Stationary, and C-Stationary algorithms, 1.5D and 2D algorithms, and RDMA-based and bulk synchronous methods of data transfer. Our results show that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the performance of local SpMM operations. Our evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies.
稀疏次密矩阵乘法(SpMM)在计算线性代数等成熟领域以及图神经网络等新兴领域都有应用。在本研究中,我们通过关注GPU加速器,评估了将SpMM作为跨多个节点的分布式计算执行的各种技术的性能。我们将研究最先进的SpMM实现的实际本地计算性能如何影响计算效率,因为当我们扩展到大量节点时,维度发生变化,这被证明是一个意想不到的重要瓶颈。我们还考虑了各种分布策略,包括A-Stationary, B-Stationary和C-Stationary算法,1.5D和2D算法,以及基于rdma和批量同步的数据传输方法。我们的研究结果表明,算法和实现技术的最佳选择不仅取决于特定矩阵大小和维度的通信成本,还取决于局部SpMM操作的性能。我们的评估表明,在GPU加速器的参与下,SpMM的最佳设计选择不同于已知在密集矩阵-矩阵或稀疏矩阵-稀疏矩阵乘法中表现良好的传统算法。
{"title":"Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication","authors":"Oguz Selvitopi, Benjamin Brock, Israt Nisa, Alok Tripathy, K. Yelick, A. Buluç","doi":"10.1145/3447818.3461472","DOIUrl":"https://doi.org/10.1145/3447818.3461472","url":null,"abstract":"Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evaluate the performance of various techniques for performing SpMM as a distributed computation across many nodes by focusing on GPU accelerators. We examine how the actual local computational performance of state-of-the-art SpMM implementations affect computational efficiency as dimensions change when we scale to large numbers of nodes, which proves to be an unexpectedly important bottleneck. We also consider various distribution strategies, including A-Stationary, B-Stationary, and C-Stationary algorithms, 1.5D and 2D algorithms, and RDMA-based and bulk synchronous methods of data transfer. Our results show that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the performance of local SpMM operations. Our evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87780382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1