首页 > 最新文献

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing最新文献

英文 中文
MD-HM: memoization-based molecular dynamics simulations on big memory system 基于记忆的大内存系统分子动力学模拟
Zhen Xie, Wenqian Dong, Jie Liu, I. Peng, Yanbao Ma, Dong Li
Molecular dynamics (MD) simulation is a fundamental method for modeling ensembles of particles. In this paper, we introduce a new method to improve the performance of MD by leveraging the emerging TB-scale big memory system. In particular, we trade memory capacity for computation capability to improve MD performance by the lookup table-based memoization technique. The traditional memoization technique for the MD simulation uses relatively small DRAM, bases on a suboptimal data structure, and replaces pair-wise computation, which leads to limited performance benefit in the big memory system. We introduce MD-HM, a memoization-based MD simulation framework customized for the big memory system. MD-HM partitions the simulation field into subgrids, and replaces computation in each subgrid as a whole based on a lightweight pattern-match algorithm to recognize computation in the subgrid. MD-HM uses a new two-phase LSM-tree to optimize read/write performance. Evaluating with nine MD simulations, we show that MD-HM outperforms the state-of-the-art LAMMPS simulation framework with an average speedup of 7.6x based on the Intel Optane-based big memory system.
分子动力学(MD)模拟是模拟粒子系综的基本方法。在本文中,我们介绍了一种利用新兴的tb级大存储系统来提高MD性能的新方法。特别是,我们用内存容量换取计算能力,通过基于查找表的记忆技术来提高MD的性能。传统的记忆技术用于MD模拟使用相对较小的DRAM,基于次优数据结构,并取代成对计算,这导致大内存系统的性能优势有限。我们介绍了MD- hm,一个为大存储系统定制的基于记忆的MD仿真框架。MD-HM将仿真场划分为子网格,并基于轻量级模式匹配算法将每个子网格中的计算作为一个整体替换,以识别子网格中的计算。MD-HM使用新的两阶段lsm树来优化读/写性能。通过9次MD模拟评估,我们发现MD- hm在基于Intel optane的大内存系统上的平均加速速度为7.6倍,优于最先进的LAMMPS模拟框架。
{"title":"MD-HM: memoization-based molecular dynamics simulations on big memory system","authors":"Zhen Xie, Wenqian Dong, Jie Liu, I. Peng, Yanbao Ma, Dong Li","doi":"10.1145/3447818.3460365","DOIUrl":"https://doi.org/10.1145/3447818.3460365","url":null,"abstract":"Molecular dynamics (MD) simulation is a fundamental method for modeling ensembles of particles. In this paper, we introduce a new method to improve the performance of MD by leveraging the emerging TB-scale big memory system. In particular, we trade memory capacity for computation capability to improve MD performance by the lookup table-based memoization technique. The traditional memoization technique for the MD simulation uses relatively small DRAM, bases on a suboptimal data structure, and replaces pair-wise computation, which leads to limited performance benefit in the big memory system. We introduce MD-HM, a memoization-based MD simulation framework customized for the big memory system. MD-HM partitions the simulation field into subgrids, and replaces computation in each subgrid as a whole based on a lightweight pattern-match algorithm to recognize computation in the subgrid. MD-HM uses a new two-phase LSM-tree to optimize read/write performance. Evaluating with nine MD simulations, we show that MD-HM outperforms the state-of-the-art LAMMPS simulation framework with an average speedup of 7.6x based on the Intel Optane-based big memory system.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72780823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
On the automatic parallelization of subscripted subscript patterns using array property analysis 利用数组属性分析实现下标模式的自动并行化
Akshay Bhosale, R. Eigenmann
Parallelizing loops with subscripted subscript patterns at compile-time has long been a challenge for automatic parallelizers. In the class of irregular applications that we have analyzed, the presence of subscripted subscript patterns was one of the primary reasons why a significant number of loops could not be automatically parallelized. Loops with such patterns can be parallelized, if the subscript array or the expression in which the subscript array appears possess certain properties, such as monotonicity. The information required to prove the existence of these properties is often present in the application code itself. This suggests that their automatic detection may be feasible. In this paper, we present an algebra for representing and reasoning about subscript array properties, and we discuss a compile-time algorithm, based on symbolic range aggregation, that can prove monotonicity and parallelize key loops. We show that this algorithm can produce significant performance gains, not only in the parallelized loops, but also in the overall applications.
长期以来,在编译时对带有下标下标模式的循环进行并行化一直是自动并行化的一大挑战。在我们分析的不规则应用程序中,下标下标模式的存在是大量循环无法自动并行化的主要原因之一。如果下标数组或出现下标数组的表达式具有某些属性,例如单调性,则具有这种模式的循环可以并行化。证明这些属性存在所需的信息通常存在于应用程序代码本身中。这表明它们的自动检测可能是可行的。本文给出了下标数组性质的表示和推理代数,并讨论了一种基于符号范围聚合的编译时算法,该算法可以证明键循环的单调性和并行性。我们表明,该算法不仅在并行循环中,而且在整个应用程序中都可以产生显着的性能提升。
{"title":"On the automatic parallelization of subscripted subscript patterns using array property analysis","authors":"Akshay Bhosale, R. Eigenmann","doi":"10.1145/3447818.3460424","DOIUrl":"https://doi.org/10.1145/3447818.3460424","url":null,"abstract":"Parallelizing loops with subscripted subscript patterns at compile-time has long been a challenge for automatic parallelizers. In the class of irregular applications that we have analyzed, the presence of subscripted subscript patterns was one of the primary reasons why a significant number of loops could not be automatically parallelized. Loops with such patterns can be parallelized, if the subscript array or the expression in which the subscript array appears possess certain properties, such as monotonicity. The information required to prove the existence of these properties is often present in the application code itself. This suggests that their automatic detection may be feasible. In this paper, we present an algebra for representing and reasoning about subscript array properties, and we discuss a compile-time algorithm, based on symbolic range aggregation, that can prove monotonicity and parallelize key loops. We show that this algorithm can produce significant performance gains, not only in the parallelized loops, but also in the overall applications.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82678889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Tile size selection of affine programs for GPGPUs using polyhedral cross-compilation 贴图大小选择的仿射程序为gpgpu使用多面体交叉编译
K. Abdelaal, Martin Kong
Loop tiling is a key high-level transformation which is known to maximize locality in loop intensive programs. It has been successfully applied to a number of applications including tensor contractions, iterative stencils and machine learning. This technique has also been extended to a wide variety of computational domains and architectures. The performance achieved with this critical transformation largely depends on a set of inputs given, the tile sizes, due to the complex trade-off between locality and parallelism. This problem is exacerbated in GPGPU architectures due to limited hardware resources such as the available shared-memory. In this paper we present a new technique to compute resource conscious tile sizes for affine programs. We use Integer Linear Programming (ILP) constraints and objectives in a cross-compiler fashion to faithfully and effectively mimic the transformations applied in a polyhedral GPU compiler (PPCG). Our approach significantly reduces the need for experimental auto-tuning by generating only two tile size configurations that achieve strong out-of-the-box performance. We evaluate the effectiveness of our technique using the Polybench benchmark suite on two GPGPUs, an AMD Radeon VII and an NVIDIA Tesla V100, using OpenCL and CUDA programming models. Experimental validation reveals that our approach achieves nearly 75% of the best empirically found tile configuration across both architectures.
在循环密集程序中,循环平铺是一种关键的高层转换,它被认为是最大化局部性的。它已经成功地应用于许多应用,包括张量收缩、迭代模板和机器学习。该技术还被扩展到各种各样的计算领域和体系结构。由于局部性和并行性之间的复杂权衡,这种关键转换的性能在很大程度上取决于给定的一组输入,即贴图大小。由于可用的共享内存等硬件资源有限,这个问题在GPGPU架构中更加严重。本文提出了一种计算仿射程序的资源意识瓦片大小的新技术。我们以交叉编译器的方式使用整数线性规划(ILP)约束和目标,忠实而有效地模拟在多面体GPU编译器(PPCG)中应用的转换。我们的方法通过只生成两个块大小配置来实现强大的开箱即用性能,从而大大减少了对实验自动调优的需求。我们使用Polybench基准套件在两个gpgpu上评估我们技术的有效性,AMD Radeon VII和NVIDIA Tesla V100,使用OpenCL和CUDA编程模型。实验验证表明,我们的方法在两种架构中实现了近75%的最佳经验发现的瓷砖配置。
{"title":"Tile size selection of affine programs for GPGPUs using polyhedral cross-compilation","authors":"K. Abdelaal, Martin Kong","doi":"10.1145/3447818.3460369","DOIUrl":"https://doi.org/10.1145/3447818.3460369","url":null,"abstract":"Loop tiling is a key high-level transformation which is known to maximize locality in loop intensive programs. It has been successfully applied to a number of applications including tensor contractions, iterative stencils and machine learning. This technique has also been extended to a wide variety of computational domains and architectures. The performance achieved with this critical transformation largely depends on a set of inputs given, the tile sizes, due to the complex trade-off between locality and parallelism. This problem is exacerbated in GPGPU architectures due to limited hardware resources such as the available shared-memory. In this paper we present a new technique to compute resource conscious tile sizes for affine programs. We use Integer Linear Programming (ILP) constraints and objectives in a cross-compiler fashion to faithfully and effectively mimic the transformations applied in a polyhedral GPU compiler (PPCG). Our approach significantly reduces the need for experimental auto-tuning by generating only two tile size configurations that achieve strong out-of-the-box performance. We evaluate the effectiveness of our technique using the Polybench benchmark suite on two GPGPUs, an AMD Radeon VII and an NVIDIA Tesla V100, using OpenCL and CUDA programming models. Experimental validation reveals that our approach achieves nearly 75% of the best empirically found tile configuration across both architectures.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88397871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Proxima
Yuliana Zamora, Logan T. Ward, G. Sivaraman, I. Foster, H. Hoffmann
Atomistic-scale simulations are prominent scientific applications that require the repetitive execution of a computationally expensive routine to calculate a system's potential energy. Prior work shows that these expensive routines can be replaced with a machine-learned surrogate approximation to accelerate the simulation at the expense of the overall accuracy. The exact balance of speed and accuracy depends on the specific configuration of the surrogate-modeling workflow and the science itself, and prior work leaves it up to the scientist to find a configuration that delivers the required accuracy for their science problem. Unfortunately, due to the underlying system dynamics, it is rare that a single surrogate configuration presents an optimal accuracy/latency trade-off for the entire simulation. In practice, scientists must choose conservative configurations so that accuracy is always acceptable, forgoing possible acceleration. As an alternative, we propose Proxima, a systematic and automated method for dynamically tuning a surrogate-modeling configuration in response to real-time feedback from the ongoing simulation. Proxima estimates the uncertainty of applying a surrogate approximation in each step of an iterative simulation. Using this information, the specific surrogate configuration can be adjusted dynamically to ensure maximum speedup while sustaining a required accuracy metric. We evaluate Proxima using a Monte Carlo sampling application and find that Proxima respects a wide range of user-defined accuracy goals while achieving speedups of 1.02--5.5X relative to a standard
{"title":"Proxima","authors":"Yuliana Zamora, Logan T. Ward, G. Sivaraman, I. Foster, H. Hoffmann","doi":"10.1145/3447818.3460370","DOIUrl":"https://doi.org/10.1145/3447818.3460370","url":null,"abstract":"Atomistic-scale simulations are prominent scientific applications that require the repetitive execution of a computationally expensive routine to calculate a system's potential energy. Prior work shows that these expensive routines can be replaced with a machine-learned surrogate approximation to accelerate the simulation at the expense of the overall accuracy. The exact balance of speed and accuracy depends on the specific configuration of the surrogate-modeling workflow and the science itself, and prior work leaves it up to the scientist to find a configuration that delivers the required accuracy for their science problem. Unfortunately, due to the underlying system dynamics, it is rare that a single surrogate configuration presents an optimal accuracy/latency trade-off for the entire simulation. In practice, scientists must choose conservative configurations so that accuracy is always acceptable, forgoing possible acceleration. As an alternative, we propose Proxima, a systematic and automated method for dynamically tuning a surrogate-modeling configuration in response to real-time feedback from the ongoing simulation. Proxima estimates the uncertainty of applying a surrogate approximation in each step of an iterative simulation. Using this information, the specific surrogate configuration can be adjusted dynamically to ensure maximum speedup while sustaining a required accuracy metric. We evaluate Proxima using a Monte Carlo sampling application and find that Proxima respects a wide range of user-defined accuracy goals while achieving speedups of 1.02--5.5X relative to a standard","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"106 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80731926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
NumaPerf
Xin Zhao, Jin Zhou, Hui Guan, Wei Wang, Xu Liu, Tongping Liu
It is extremely challenging to achieve optimal performance of parallel applications on a NUMA architecture, which necessitates the assistance of profiling tools. However, existing NUMA-profiling tools share some similar shortcomings, such as portability, effectiveness, and helpfulness issues. This paper proposes a novel profiling tool–NumaPerf–that overcomes these issues. NumaPerf aims to identify potential performance issues for any NUMA architecture, instead of only on the current hardware. To achieve this, NumaPerf focuses on memory sharing patterns between threads, instead of real remote accesses. NumaPerf further detects potential thread migrations and load imbalance issues that could significantly affect the performance but are omitted by existing profilers. NumaPerf also identifies cache coherence issues separately that may require different fix strategies. Based on our extensive evaluation, NumaPerf can identify more performance issues than any existing tool, while fixing them leads to significant performance speedup.
{"title":"NumaPerf","authors":"Xin Zhao, Jin Zhou, Hui Guan, Wei Wang, Xu Liu, Tongping Liu","doi":"10.1145/3447818.3460361","DOIUrl":"https://doi.org/10.1145/3447818.3460361","url":null,"abstract":"It is extremely challenging to achieve optimal performance of parallel applications on a NUMA architecture, which necessitates the assistance of profiling tools. However, existing NUMA-profiling tools share some similar shortcomings, such as portability, effectiveness, and helpfulness issues. This paper proposes a novel profiling tool–NumaPerf–that overcomes these issues. NumaPerf aims to identify potential performance issues for any NUMA architecture, instead of only on the current hardware. To achieve this, NumaPerf focuses on memory sharing patterns between threads, instead of real remote accesses. NumaPerf further detects potential thread migrations and load imbalance issues that could significantly affect the performance but are omitted by existing profilers. NumaPerf also identifies cache coherence issues separately that may require different fix strategies. Based on our extensive evaluation, NumaPerf can identify more performance issues than any existing tool, while fixing them leads to significant performance speedup.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81273583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
ProMT
Mazen Alwadi, Aziz Mohaisen, Amr Awad
Current computer systems are vulnerable to a wide range of attacks caused by the proliferation of accelerators, and the fact that current system comprise multiple SoCs provided from different vendors. Thus, major processor vendors are moving towards limiting the trust boundary to the processor chip only as in Intel's SGX, AMD's SME, and ARM's TrustZone. This secure boundary limitation requires protecting the memory content against data remanence attacks, which were performed against DRAM in the form of cold-boot attack and are more successful against NVM due to NVM's data persistency feature. However, implementing secure memory features, such as memory encryption and integrity verification has a non-trivial performance overhead, and can significantly reduce the emerging NVM's expected lifetime. Previous work looked at reducing the overheads of the secure memory implementation by packing more counters into a cache line, increasing the cacheability of security metadata, slightly reducing the size of the integrity tree, or using the ECC chip to store the MAC values. However, the root update process is barely studied, which requires a sequential update of the MAC values in all the integrity tree levels. In this paper, we propose ProMT, a novel memory controller design that ensures a persistently secure system with minimal overheads. ProMT protects the data confidentiality and ensures the data integrity with minimal overheads. ProMT reduces the performance overhead of secure memory implementation to 11.7%, extends the NVM's life time by 3.59x, and enables the system recovery in a fraction of a second.
{"title":"ProMT","authors":"Mazen Alwadi, Aziz Mohaisen, Amr Awad","doi":"10.1145/3447818.3460377","DOIUrl":"https://doi.org/10.1145/3447818.3460377","url":null,"abstract":"Current computer systems are vulnerable to a wide range of attacks caused by the proliferation of accelerators, and the fact that current system comprise multiple SoCs provided from different vendors. Thus, major processor vendors are moving towards limiting the trust boundary to the processor chip only as in Intel's SGX, AMD's SME, and ARM's TrustZone. This secure boundary limitation requires protecting the memory content against data remanence attacks, which were performed against DRAM in the form of cold-boot attack and are more successful against NVM due to NVM's data persistency feature. However, implementing secure memory features, such as memory encryption and integrity verification has a non-trivial performance overhead, and can significantly reduce the emerging NVM's expected lifetime. Previous work looked at reducing the overheads of the secure memory implementation by packing more counters into a cache line, increasing the cacheability of security metadata, slightly reducing the size of the integrity tree, or using the ECC chip to store the MAC values. However, the root update process is barely studied, which requires a sequential update of the MAC values in all the integrity tree levels. In this paper, we propose ProMT, a novel memory controller design that ensures a persistently secure system with minimal overheads. ProMT protects the data confidentiality and ensures the data integrity with minimal overheads. ProMT reduces the performance overhead of secure memory implementation to 11.7%, extends the NVM's life time by 3.59x, and enables the system recovery in a fraction of a second.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80292786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
HyQuas
Chen Zhang, Zeyu Song, Haojie Wang, Kaiyuan Rong, Jidong Zhai
Quantum computing has shown its strong potential in solving certain important problems. Due to the intrinsic limitations of current real quantum computers, quantum circuit simulation still plays an important role in both research and development of quantum computing. GPU-based quantum circuit simulation has been explored due to GPU's high computation capability. Despite previous efforts, existing quantum circuit simulation systems usually rely on a single method to improve poor data locality caused by complex quantum entanglement. However, we observe that existing simulation methods show significantly different performance for different circuit patterns. The optimal performance cannot be obtained only with any single method. To address these challenges, we propose HyQuas, a textbf{Hy}brid partitioner based textbf{Qua}ntum circuit textbf{S}imulation system on GPU, which can automatically select the suitable simulation method for different parts of a given quantum circuit according to its pattern. Moreover, to make better support for HyQuas, we also propose two highly optimized methods, OShareMem and TransMM, as optional choices of HyQuas. We further propose a GPU-centric communication pipelining approach for effective distributed simulation. Experimental results show that HyQuas can achieve up to 10.71 x speedup on a single GPU and 227 x speedup on a GPU cluster over state-of-the-art quantum circuit simulation systems.
{"title":"HyQuas","authors":"Chen Zhang, Zeyu Song, Haojie Wang, Kaiyuan Rong, Jidong Zhai","doi":"10.1145/3447818.3460357","DOIUrl":"https://doi.org/10.1145/3447818.3460357","url":null,"abstract":"Quantum computing has shown its strong potential in solving certain important problems. Due to the intrinsic limitations of current real quantum computers, quantum circuit simulation still plays an important role in both research and development of quantum computing. GPU-based quantum circuit simulation has been explored due to GPU's high computation capability. Despite previous efforts, existing quantum circuit simulation systems usually rely on a single method to improve poor data locality caused by complex quantum entanglement. However, we observe that existing simulation methods show significantly different performance for different circuit patterns. The optimal performance cannot be obtained only with any single method. To address these challenges, we propose HyQuas, a textbf{Hy}brid partitioner based textbf{Qua}ntum circuit textbf{S}imulation system on GPU, which can automatically select the suitable simulation method for different parts of a given quantum circuit according to its pattern. Moreover, to make better support for HyQuas, we also propose two highly optimized methods, OShareMem and TransMM, as optional choices of HyQuas. We further propose a GPU-centric communication pipelining approach for effective distributed simulation. Experimental results show that HyQuas can achieve up to 10.71 x speedup on a single GPU and 227 x speedup on a GPU cluster over state-of-the-art quantum circuit simulation systems.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75151567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Delay sensitivity-driven congestion mitigation for HPC systems HPC系统延迟敏感性驱动的拥塞缓解
Archit Patke, Saurabh Jha, Haoran Qiu, J. Brandt, A. Gentile, Joe Greenseid, Z. Kalbarczyk, R. Iyer
Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric. Delay sensitivity of an application is used to quantify the impact of congestion on its runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3x while improving the median system utility by 12%.
现代高性能计算(HPC)系统同时执行多个分布式应用程序,这些应用程序争夺高速网络,导致拥塞。因此,在生产系统中可以观察到应用程序运行时可变性和次优系统利用率。为了解决这些问题,我们提出了Netscope,这是一个基于新型延迟灵敏度度量的拥塞缓解框架。应用程序的延迟敏感性用于量化拥塞对其运行时的影响。Netscope使用延迟敏感性估计来驱动拥塞缓解机制,从而有选择地限制不易受拥塞影响的应用程序。我们在两个Cray Aries系统上对Netscope进行了评估,其中包括一台生产超级计算机,以及常见的科学应用。我们的评估表明,Netscope具有较低的培训成本,并且准确地估计了拥塞对应用程序运行时的影响,相关性在0.7和0.9之间。此外,Netscope将应用程序尾部运行时的增长减少了16.3倍,同时将系统效用中值提高了12%。
{"title":"Delay sensitivity-driven congestion mitigation for HPC systems","authors":"Archit Patke, Saurabh Jha, Haoran Qiu, J. Brandt, A. Gentile, Joe Greenseid, Z. Kalbarczyk, R. Iyer","doi":"10.1145/3447818.3460362","DOIUrl":"https://doi.org/10.1145/3447818.3460362","url":null,"abstract":"Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric. Delay sensitivity of an application is used to quantify the impact of congestion on its runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3x while improving the median system utility by 12%.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"116 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88108882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
PLANAR: a programmable accelerator for near-memory data rearrangement 用于近内存数据重排的可编程加速器
Adrián Barredo, Adrià Armejach, J. Beard, Miquel Moretó
Many applications employ irregular and sparse memory accesses that cannot take advantage of existing cache hierarchies in high performance processors. To solve this problem, Data Layout Transformation (DLT) techniques rearrange sparse data into a dense representation, improving locality and cache utilization. However, prior proposals in this space fail to provide a design that (i) scales with multi-core systems, (ii) hides rearrangement latency, and (iii) provides the necessary interfaces to ease programmability. In this work we present PLANAR, a programmable near-memory accelerator that rearranges sparse data into dense. By placing PLANAR devices at the memory controller level we enable a design that scales well with multi-core systems, hides operation latency by performing non-blocking fine-grain data rearrangements, and eases programmability by supporting virtual memory and conventional memory allocation mechanisms. Our evaluation shows that PLANAR leads to significant reductions in data movement and dynamic energy, providing an average 4.58× speedup.
许多应用程序使用不规则和稀疏的内存访问,无法利用高性能处理器中现有的缓存层次结构。为了解决这个问题,数据布局转换(DLT)技术将稀疏数据重新排列成密集的表示,提高了局部性和缓存利用率。然而,在这个领域之前的建议未能提供一种设计(i)可扩展多核系统,(ii)隐藏重排延迟,以及(iii)提供必要的接口来简化可编程性。在这项工作中,我们提出了PLANAR,一个可编程的近内存加速器,它可以将稀疏数据重新排列成密集数据。通过将PLANAR器件置于内存控制器级别,我们使设计能够在多核系统中很好地扩展,通过执行非阻塞细粒度数据重排来隐藏操作延迟,并通过支持虚拟内存和传统内存分配机制来简化可编程性。我们的评估表明,PLANAR显著减少了数据移动和动态能量,提供了平均4.58倍的加速。
{"title":"PLANAR: a programmable accelerator for near-memory data rearrangement","authors":"Adrián Barredo, Adrià Armejach, J. Beard, Miquel Moretó","doi":"10.1145/3447818.3460368","DOIUrl":"https://doi.org/10.1145/3447818.3460368","url":null,"abstract":"Many applications employ irregular and sparse memory accesses that cannot take advantage of existing cache hierarchies in high performance processors. To solve this problem, Data Layout Transformation (DLT) techniques rearrange sparse data into a dense representation, improving locality and cache utilization. However, prior proposals in this space fail to provide a design that (i) scales with multi-core systems, (ii) hides rearrangement latency, and (iii) provides the necessary interfaces to ease programmability. In this work we present PLANAR, a programmable near-memory accelerator that rearranges sparse data into dense. By placing PLANAR devices at the memory controller level we enable a design that scales well with multi-core systems, hides operation latency by performing non-blocking fine-grain data rearrangements, and eases programmability by supporting virtual memory and conventional memory allocation mechanisms. Our evaluation shows that PLANAR leads to significant reductions in data movement and dynamic energy, providing an average 4.58× speedup.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86872744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
DSGEN
Xiaofan Sun, Rajiv Gupta
Concolic testing combines concrete execution with symbolic execution along the executed path to automatically generate new test inputs that exercise program paths and deliver high code coverage during testing. The GKLEE tool uses this approach to expose data races in CUDA programs written for execution of GPGPUs. In programs employing concurrent dynamic data structures, automatic generation of data structures with appropriate shapes that cause threads to follow selected, possibly divergent, paths is a challenge. Moreover, a single non-conflicting data structure must be generated for multiple threads, that is, a single shape must be found that simultaneously causes all threads to follow their respective chosen paths. When an execution exposes a bug (e.g., a data race), the generated data structure shape helps the programmer understand the cause of the bug. Because GKLEE does not permit pointers that construct dynamic data structures to be made symbolic, it cannot automatically generate data structures of different shapes and must rely on the user to write code that constructs them to exercise desired paths. We have developed DSGEN for automatically generating non-conflicting dynamic data structures with different shapes and integrated it with GKLEE to uncover and facilitate understanding of data races in programs that employ complex concurrent dynamic data structures. In comparison to GKLEE, DSGEN increases the number of races detected from 10 to 25 by automatically generating a total of 1,897 shapes in implementations of four complex concurrent dynamic data structures -- B-Tree, Hash-Array Mapped Trie, RRB-Tree, and Skip List.
{"title":"DSGEN","authors":"Xiaofan Sun, Rajiv Gupta","doi":"10.1145/3447818.3460962","DOIUrl":"https://doi.org/10.1145/3447818.3460962","url":null,"abstract":"Concolic testing combines concrete execution with symbolic execution along the executed path to automatically generate new test inputs that exercise program paths and deliver high code coverage during testing. The GKLEE tool uses this approach to expose data races in CUDA programs written for execution of GPGPUs. In programs employing concurrent dynamic data structures, automatic generation of data structures with appropriate shapes that cause threads to follow selected, possibly divergent, paths is a challenge. Moreover, a single non-conflicting data structure must be generated for multiple threads, that is, a single shape must be found that simultaneously causes all threads to follow their respective chosen paths. When an execution exposes a bug (e.g., a data race), the generated data structure shape helps the programmer understand the cause of the bug. Because GKLEE does not permit pointers that construct dynamic data structures to be made symbolic, it cannot automatically generate data structures of different shapes and must rely on the user to write code that constructs them to exercise desired paths. We have developed DSGEN for automatically generating non-conflicting dynamic data structures with different shapes and integrated it with GKLEE to uncover and facilitate understanding of data races in programs that employ complex concurrent dynamic data structures. In comparison to GKLEE, DSGEN increases the number of races detected from 10 to 25 by automatically generating a total of 1,897 shapes in implementations of four complex concurrent dynamic data structures -- B-Tree, Hash-Array Mapped Trie, RRB-Tree, and Skip List.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"120 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75884086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1