Zhen Xie, Wenqian Dong, Jie Liu, I. Peng, Yanbao Ma, Dong Li
Molecular dynamics (MD) simulation is a fundamental method for modeling ensembles of particles. In this paper, we introduce a new method to improve the performance of MD by leveraging the emerging TB-scale big memory system. In particular, we trade memory capacity for computation capability to improve MD performance by the lookup table-based memoization technique. The traditional memoization technique for the MD simulation uses relatively small DRAM, bases on a suboptimal data structure, and replaces pair-wise computation, which leads to limited performance benefit in the big memory system. We introduce MD-HM, a memoization-based MD simulation framework customized for the big memory system. MD-HM partitions the simulation field into subgrids, and replaces computation in each subgrid as a whole based on a lightweight pattern-match algorithm to recognize computation in the subgrid. MD-HM uses a new two-phase LSM-tree to optimize read/write performance. Evaluating with nine MD simulations, we show that MD-HM outperforms the state-of-the-art LAMMPS simulation framework with an average speedup of 7.6x based on the Intel Optane-based big memory system.
{"title":"MD-HM: memoization-based molecular dynamics simulations on big memory system","authors":"Zhen Xie, Wenqian Dong, Jie Liu, I. Peng, Yanbao Ma, Dong Li","doi":"10.1145/3447818.3460365","DOIUrl":"https://doi.org/10.1145/3447818.3460365","url":null,"abstract":"Molecular dynamics (MD) simulation is a fundamental method for modeling ensembles of particles. In this paper, we introduce a new method to improve the performance of MD by leveraging the emerging TB-scale big memory system. In particular, we trade memory capacity for computation capability to improve MD performance by the lookup table-based memoization technique. The traditional memoization technique for the MD simulation uses relatively small DRAM, bases on a suboptimal data structure, and replaces pair-wise computation, which leads to limited performance benefit in the big memory system. We introduce MD-HM, a memoization-based MD simulation framework customized for the big memory system. MD-HM partitions the simulation field into subgrids, and replaces computation in each subgrid as a whole based on a lightweight pattern-match algorithm to recognize computation in the subgrid. MD-HM uses a new two-phase LSM-tree to optimize read/write performance. Evaluating with nine MD simulations, we show that MD-HM outperforms the state-of-the-art LAMMPS simulation framework with an average speedup of 7.6x based on the Intel Optane-based big memory system.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72780823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallelizing loops with subscripted subscript patterns at compile-time has long been a challenge for automatic parallelizers. In the class of irregular applications that we have analyzed, the presence of subscripted subscript patterns was one of the primary reasons why a significant number of loops could not be automatically parallelized. Loops with such patterns can be parallelized, if the subscript array or the expression in which the subscript array appears possess certain properties, such as monotonicity. The information required to prove the existence of these properties is often present in the application code itself. This suggests that their automatic detection may be feasible. In this paper, we present an algebra for representing and reasoning about subscript array properties, and we discuss a compile-time algorithm, based on symbolic range aggregation, that can prove monotonicity and parallelize key loops. We show that this algorithm can produce significant performance gains, not only in the parallelized loops, but also in the overall applications.
{"title":"On the automatic parallelization of subscripted subscript patterns using array property analysis","authors":"Akshay Bhosale, R. Eigenmann","doi":"10.1145/3447818.3460424","DOIUrl":"https://doi.org/10.1145/3447818.3460424","url":null,"abstract":"Parallelizing loops with subscripted subscript patterns at compile-time has long been a challenge for automatic parallelizers. In the class of irregular applications that we have analyzed, the presence of subscripted subscript patterns was one of the primary reasons why a significant number of loops could not be automatically parallelized. Loops with such patterns can be parallelized, if the subscript array or the expression in which the subscript array appears possess certain properties, such as monotonicity. The information required to prove the existence of these properties is often present in the application code itself. This suggests that their automatic detection may be feasible. In this paper, we present an algebra for representing and reasoning about subscript array properties, and we discuss a compile-time algorithm, based on symbolic range aggregation, that can prove monotonicity and parallelize key loops. We show that this algorithm can produce significant performance gains, not only in the parallelized loops, but also in the overall applications.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82678889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Loop tiling is a key high-level transformation which is known to maximize locality in loop intensive programs. It has been successfully applied to a number of applications including tensor contractions, iterative stencils and machine learning. This technique has also been extended to a wide variety of computational domains and architectures. The performance achieved with this critical transformation largely depends on a set of inputs given, the tile sizes, due to the complex trade-off between locality and parallelism. This problem is exacerbated in GPGPU architectures due to limited hardware resources such as the available shared-memory. In this paper we present a new technique to compute resource conscious tile sizes for affine programs. We use Integer Linear Programming (ILP) constraints and objectives in a cross-compiler fashion to faithfully and effectively mimic the transformations applied in a polyhedral GPU compiler (PPCG). Our approach significantly reduces the need for experimental auto-tuning by generating only two tile size configurations that achieve strong out-of-the-box performance. We evaluate the effectiveness of our technique using the Polybench benchmark suite on two GPGPUs, an AMD Radeon VII and an NVIDIA Tesla V100, using OpenCL and CUDA programming models. Experimental validation reveals that our approach achieves nearly 75% of the best empirically found tile configuration across both architectures.
在循环密集程序中,循环平铺是一种关键的高层转换,它被认为是最大化局部性的。它已经成功地应用于许多应用,包括张量收缩、迭代模板和机器学习。该技术还被扩展到各种各样的计算领域和体系结构。由于局部性和并行性之间的复杂权衡,这种关键转换的性能在很大程度上取决于给定的一组输入,即贴图大小。由于可用的共享内存等硬件资源有限,这个问题在GPGPU架构中更加严重。本文提出了一种计算仿射程序的资源意识瓦片大小的新技术。我们以交叉编译器的方式使用整数线性规划(ILP)约束和目标,忠实而有效地模拟在多面体GPU编译器(PPCG)中应用的转换。我们的方法通过只生成两个块大小配置来实现强大的开箱即用性能,从而大大减少了对实验自动调优的需求。我们使用Polybench基准套件在两个gpgpu上评估我们技术的有效性,AMD Radeon VII和NVIDIA Tesla V100,使用OpenCL和CUDA编程模型。实验验证表明,我们的方法在两种架构中实现了近75%的最佳经验发现的瓷砖配置。
{"title":"Tile size selection of affine programs for GPGPUs using polyhedral cross-compilation","authors":"K. Abdelaal, Martin Kong","doi":"10.1145/3447818.3460369","DOIUrl":"https://doi.org/10.1145/3447818.3460369","url":null,"abstract":"Loop tiling is a key high-level transformation which is known to maximize locality in loop intensive programs. It has been successfully applied to a number of applications including tensor contractions, iterative stencils and machine learning. This technique has also been extended to a wide variety of computational domains and architectures. The performance achieved with this critical transformation largely depends on a set of inputs given, the tile sizes, due to the complex trade-off between locality and parallelism. This problem is exacerbated in GPGPU architectures due to limited hardware resources such as the available shared-memory. In this paper we present a new technique to compute resource conscious tile sizes for affine programs. We use Integer Linear Programming (ILP) constraints and objectives in a cross-compiler fashion to faithfully and effectively mimic the transformations applied in a polyhedral GPU compiler (PPCG). Our approach significantly reduces the need for experimental auto-tuning by generating only two tile size configurations that achieve strong out-of-the-box performance. We evaluate the effectiveness of our technique using the Polybench benchmark suite on two GPGPUs, an AMD Radeon VII and an NVIDIA Tesla V100, using OpenCL and CUDA programming models. Experimental validation reveals that our approach achieves nearly 75% of the best empirically found tile configuration across both architectures.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88397871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuliana Zamora, Logan T. Ward, G. Sivaraman, I. Foster, H. Hoffmann
Atomistic-scale simulations are prominent scientific applications that require the repetitive execution of a computationally expensive routine to calculate a system's potential energy. Prior work shows that these expensive routines can be replaced with a machine-learned surrogate approximation to accelerate the simulation at the expense of the overall accuracy. The exact balance of speed and accuracy depends on the specific configuration of the surrogate-modeling workflow and the science itself, and prior work leaves it up to the scientist to find a configuration that delivers the required accuracy for their science problem. Unfortunately, due to the underlying system dynamics, it is rare that a single surrogate configuration presents an optimal accuracy/latency trade-off for the entire simulation. In practice, scientists must choose conservative configurations so that accuracy is always acceptable, forgoing possible acceleration. As an alternative, we propose Proxima, a systematic and automated method for dynamically tuning a surrogate-modeling configuration in response to real-time feedback from the ongoing simulation. Proxima estimates the uncertainty of applying a surrogate approximation in each step of an iterative simulation. Using this information, the specific surrogate configuration can be adjusted dynamically to ensure maximum speedup while sustaining a required accuracy metric. We evaluate Proxima using a Monte Carlo sampling application and find that Proxima respects a wide range of user-defined accuracy goals while achieving speedups of 1.02--5.5X relative to a standard
{"title":"Proxima","authors":"Yuliana Zamora, Logan T. Ward, G. Sivaraman, I. Foster, H. Hoffmann","doi":"10.1145/3447818.3460370","DOIUrl":"https://doi.org/10.1145/3447818.3460370","url":null,"abstract":"Atomistic-scale simulations are prominent scientific applications that require the repetitive execution of a computationally expensive routine to calculate a system's potential energy. Prior work shows that these expensive routines can be replaced with a machine-learned surrogate approximation to accelerate the simulation at the expense of the overall accuracy. The exact balance of speed and accuracy depends on the specific configuration of the surrogate-modeling workflow and the science itself, and prior work leaves it up to the scientist to find a configuration that delivers the required accuracy for their science problem. Unfortunately, due to the underlying system dynamics, it is rare that a single surrogate configuration presents an optimal accuracy/latency trade-off for the entire simulation. In practice, scientists must choose conservative configurations so that accuracy is always acceptable, forgoing possible acceleration. As an alternative, we propose Proxima, a systematic and automated method for dynamically tuning a surrogate-modeling configuration in response to real-time feedback from the ongoing simulation. Proxima estimates the uncertainty of applying a surrogate approximation in each step of an iterative simulation. Using this information, the specific surrogate configuration can be adjusted dynamically to ensure maximum speedup while sustaining a required accuracy metric. We evaluate Proxima using a Monte Carlo sampling application and find that Proxima respects a wide range of user-defined accuracy goals while achieving speedups of 1.02--5.5X relative to a standard","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"106 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80731926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Zhao, Jin Zhou, Hui Guan, Wei Wang, Xu Liu, Tongping Liu
It is extremely challenging to achieve optimal performance of parallel applications on a NUMA architecture, which necessitates the assistance of profiling tools. However, existing NUMA-profiling tools share some similar shortcomings, such as portability, effectiveness, and helpfulness issues. This paper proposes a novel profiling tool–NumaPerf–that overcomes these issues. NumaPerf aims to identify potential performance issues for any NUMA architecture, instead of only on the current hardware. To achieve this, NumaPerf focuses on memory sharing patterns between threads, instead of real remote accesses. NumaPerf further detects potential thread migrations and load imbalance issues that could significantly affect the performance but are omitted by existing profilers. NumaPerf also identifies cache coherence issues separately that may require different fix strategies. Based on our extensive evaluation, NumaPerf can identify more performance issues than any existing tool, while fixing them leads to significant performance speedup.
{"title":"NumaPerf","authors":"Xin Zhao, Jin Zhou, Hui Guan, Wei Wang, Xu Liu, Tongping Liu","doi":"10.1145/3447818.3460361","DOIUrl":"https://doi.org/10.1145/3447818.3460361","url":null,"abstract":"It is extremely challenging to achieve optimal performance of parallel applications on a NUMA architecture, which necessitates the assistance of profiling tools. However, existing NUMA-profiling tools share some similar shortcomings, such as portability, effectiveness, and helpfulness issues. This paper proposes a novel profiling tool–NumaPerf–that overcomes these issues. NumaPerf aims to identify potential performance issues for any NUMA architecture, instead of only on the current hardware. To achieve this, NumaPerf focuses on memory sharing patterns between threads, instead of real remote accesses. NumaPerf further detects potential thread migrations and load imbalance issues that could significantly affect the performance but are omitted by existing profilers. NumaPerf also identifies cache coherence issues separately that may require different fix strategies. Based on our extensive evaluation, NumaPerf can identify more performance issues than any existing tool, while fixing them leads to significant performance speedup.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81273583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current computer systems are vulnerable to a wide range of attacks caused by the proliferation of accelerators, and the fact that current system comprise multiple SoCs provided from different vendors. Thus, major processor vendors are moving towards limiting the trust boundary to the processor chip only as in Intel's SGX, AMD's SME, and ARM's TrustZone. This secure boundary limitation requires protecting the memory content against data remanence attacks, which were performed against DRAM in the form of cold-boot attack and are more successful against NVM due to NVM's data persistency feature. However, implementing secure memory features, such as memory encryption and integrity verification has a non-trivial performance overhead, and can significantly reduce the emerging NVM's expected lifetime. Previous work looked at reducing the overheads of the secure memory implementation by packing more counters into a cache line, increasing the cacheability of security metadata, slightly reducing the size of the integrity tree, or using the ECC chip to store the MAC values. However, the root update process is barely studied, which requires a sequential update of the MAC values in all the integrity tree levels. In this paper, we propose ProMT, a novel memory controller design that ensures a persistently secure system with minimal overheads. ProMT protects the data confidentiality and ensures the data integrity with minimal overheads. ProMT reduces the performance overhead of secure memory implementation to 11.7%, extends the NVM's life time by 3.59x, and enables the system recovery in a fraction of a second.
{"title":"ProMT","authors":"Mazen Alwadi, Aziz Mohaisen, Amr Awad","doi":"10.1145/3447818.3460377","DOIUrl":"https://doi.org/10.1145/3447818.3460377","url":null,"abstract":"Current computer systems are vulnerable to a wide range of attacks caused by the proliferation of accelerators, and the fact that current system comprise multiple SoCs provided from different vendors. Thus, major processor vendors are moving towards limiting the trust boundary to the processor chip only as in Intel's SGX, AMD's SME, and ARM's TrustZone. This secure boundary limitation requires protecting the memory content against data remanence attacks, which were performed against DRAM in the form of cold-boot attack and are more successful against NVM due to NVM's data persistency feature. However, implementing secure memory features, such as memory encryption and integrity verification has a non-trivial performance overhead, and can significantly reduce the emerging NVM's expected lifetime. Previous work looked at reducing the overheads of the secure memory implementation by packing more counters into a cache line, increasing the cacheability of security metadata, slightly reducing the size of the integrity tree, or using the ECC chip to store the MAC values. However, the root update process is barely studied, which requires a sequential update of the MAC values in all the integrity tree levels. In this paper, we propose ProMT, a novel memory controller design that ensures a persistently secure system with minimal overheads. ProMT protects the data confidentiality and ensures the data integrity with minimal overheads. ProMT reduces the performance overhead of secure memory implementation to 11.7%, extends the NVM's life time by 3.59x, and enables the system recovery in a fraction of a second.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80292786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quantum computing has shown its strong potential in solving certain important problems. Due to the intrinsic limitations of current real quantum computers, quantum circuit simulation still plays an important role in both research and development of quantum computing. GPU-based quantum circuit simulation has been explored due to GPU's high computation capability. Despite previous efforts, existing quantum circuit simulation systems usually rely on a single method to improve poor data locality caused by complex quantum entanglement. However, we observe that existing simulation methods show significantly different performance for different circuit patterns. The optimal performance cannot be obtained only with any single method. To address these challenges, we propose HyQuas, a textbf{Hy}brid partitioner based textbf{Qua}ntum circuit textbf{S}imulation system on GPU, which can automatically select the suitable simulation method for different parts of a given quantum circuit according to its pattern. Moreover, to make better support for HyQuas, we also propose two highly optimized methods, OShareMem and TransMM, as optional choices of HyQuas. We further propose a GPU-centric communication pipelining approach for effective distributed simulation. Experimental results show that HyQuas can achieve up to 10.71 x speedup on a single GPU and 227 x speedup on a GPU cluster over state-of-the-art quantum circuit simulation systems.
{"title":"HyQuas","authors":"Chen Zhang, Zeyu Song, Haojie Wang, Kaiyuan Rong, Jidong Zhai","doi":"10.1145/3447818.3460357","DOIUrl":"https://doi.org/10.1145/3447818.3460357","url":null,"abstract":"Quantum computing has shown its strong potential in solving certain important problems. Due to the intrinsic limitations of current real quantum computers, quantum circuit simulation still plays an important role in both research and development of quantum computing. GPU-based quantum circuit simulation has been explored due to GPU's high computation capability. Despite previous efforts, existing quantum circuit simulation systems usually rely on a single method to improve poor data locality caused by complex quantum entanglement. However, we observe that existing simulation methods show significantly different performance for different circuit patterns. The optimal performance cannot be obtained only with any single method. To address these challenges, we propose HyQuas, a textbf{Hy}brid partitioner based textbf{Qua}ntum circuit textbf{S}imulation system on GPU, which can automatically select the suitable simulation method for different parts of a given quantum circuit according to its pattern. Moreover, to make better support for HyQuas, we also propose two highly optimized methods, OShareMem and TransMM, as optional choices of HyQuas. We further propose a GPU-centric communication pipelining approach for effective distributed simulation. Experimental results show that HyQuas can achieve up to 10.71 x speedup on a single GPU and 227 x speedup on a GPU cluster over state-of-the-art quantum circuit simulation systems.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75151567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Archit Patke, Saurabh Jha, Haoran Qiu, J. Brandt, A. Gentile, Joe Greenseid, Z. Kalbarczyk, R. Iyer
Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric. Delay sensitivity of an application is used to quantify the impact of congestion on its runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3x while improving the median system utility by 12%.
{"title":"Delay sensitivity-driven congestion mitigation for HPC systems","authors":"Archit Patke, Saurabh Jha, Haoran Qiu, J. Brandt, A. Gentile, Joe Greenseid, Z. Kalbarczyk, R. Iyer","doi":"10.1145/3447818.3460362","DOIUrl":"https://doi.org/10.1145/3447818.3460362","url":null,"abstract":"Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric. Delay sensitivity of an application is used to quantify the impact of congestion on its runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3x while improving the median system utility by 12%.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"116 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88108882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adrián Barredo, Adrià Armejach, J. Beard, Miquel Moretó
Many applications employ irregular and sparse memory accesses that cannot take advantage of existing cache hierarchies in high performance processors. To solve this problem, Data Layout Transformation (DLT) techniques rearrange sparse data into a dense representation, improving locality and cache utilization. However, prior proposals in this space fail to provide a design that (i) scales with multi-core systems, (ii) hides rearrangement latency, and (iii) provides the necessary interfaces to ease programmability. In this work we present PLANAR, a programmable near-memory accelerator that rearranges sparse data into dense. By placing PLANAR devices at the memory controller level we enable a design that scales well with multi-core systems, hides operation latency by performing non-blocking fine-grain data rearrangements, and eases programmability by supporting virtual memory and conventional memory allocation mechanisms. Our evaluation shows that PLANAR leads to significant reductions in data movement and dynamic energy, providing an average 4.58× speedup.
{"title":"PLANAR: a programmable accelerator for near-memory data rearrangement","authors":"Adrián Barredo, Adrià Armejach, J. Beard, Miquel Moretó","doi":"10.1145/3447818.3460368","DOIUrl":"https://doi.org/10.1145/3447818.3460368","url":null,"abstract":"Many applications employ irregular and sparse memory accesses that cannot take advantage of existing cache hierarchies in high performance processors. To solve this problem, Data Layout Transformation (DLT) techniques rearrange sparse data into a dense representation, improving locality and cache utilization. However, prior proposals in this space fail to provide a design that (i) scales with multi-core systems, (ii) hides rearrangement latency, and (iii) provides the necessary interfaces to ease programmability. In this work we present PLANAR, a programmable near-memory accelerator that rearranges sparse data into dense. By placing PLANAR devices at the memory controller level we enable a design that scales well with multi-core systems, hides operation latency by performing non-blocking fine-grain data rearrangements, and eases programmability by supporting virtual memory and conventional memory allocation mechanisms. Our evaluation shows that PLANAR leads to significant reductions in data movement and dynamic energy, providing an average 4.58× speedup.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86872744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Concolic testing combines concrete execution with symbolic execution along the executed path to automatically generate new test inputs that exercise program paths and deliver high code coverage during testing. The GKLEE tool uses this approach to expose data races in CUDA programs written for execution of GPGPUs. In programs employing concurrent dynamic data structures, automatic generation of data structures with appropriate shapes that cause threads to follow selected, possibly divergent, paths is a challenge. Moreover, a single non-conflicting data structure must be generated for multiple threads, that is, a single shape must be found that simultaneously causes all threads to follow their respective chosen paths. When an execution exposes a bug (e.g., a data race), the generated data structure shape helps the programmer understand the cause of the bug. Because GKLEE does not permit pointers that construct dynamic data structures to be made symbolic, it cannot automatically generate data structures of different shapes and must rely on the user to write code that constructs them to exercise desired paths. We have developed DSGEN for automatically generating non-conflicting dynamic data structures with different shapes and integrated it with GKLEE to uncover and facilitate understanding of data races in programs that employ complex concurrent dynamic data structures. In comparison to GKLEE, DSGEN increases the number of races detected from 10 to 25 by automatically generating a total of 1,897 shapes in implementations of four complex concurrent dynamic data structures -- B-Tree, Hash-Array Mapped Trie, RRB-Tree, and Skip List.
{"title":"DSGEN","authors":"Xiaofan Sun, Rajiv Gupta","doi":"10.1145/3447818.3460962","DOIUrl":"https://doi.org/10.1145/3447818.3460962","url":null,"abstract":"Concolic testing combines concrete execution with symbolic execution along the executed path to automatically generate new test inputs that exercise program paths and deliver high code coverage during testing. The GKLEE tool uses this approach to expose data races in CUDA programs written for execution of GPGPUs. In programs employing concurrent dynamic data structures, automatic generation of data structures with appropriate shapes that cause threads to follow selected, possibly divergent, paths is a challenge. Moreover, a single non-conflicting data structure must be generated for multiple threads, that is, a single shape must be found that simultaneously causes all threads to follow their respective chosen paths. When an execution exposes a bug (e.g., a data race), the generated data structure shape helps the programmer understand the cause of the bug. Because GKLEE does not permit pointers that construct dynamic data structures to be made symbolic, it cannot automatically generate data structures of different shapes and must rely on the user to write code that constructs them to exercise desired paths. We have developed DSGEN for automatically generating non-conflicting dynamic data structures with different shapes and integrated it with GKLEE to uncover and facilitate understanding of data races in programs that employ complex concurrent dynamic data structures. In comparison to GKLEE, DSGEN increases the number of races detected from 10 to 25 by automatically generating a total of 1,897 shapes in implementations of four complex concurrent dynamic data structures -- B-Tree, Hash-Array Mapped Trie, RRB-Tree, and Skip List.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"120 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75884086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}