Proceedings of the 36th ACM International Conference on Supercomputing最新文献

英文中文

Performance-detective: automatic deduction of cheap and accurate performance models 性能检测:自动扣除廉价和准确的性能模型

Proceedings of the 36th ACM International Conference on Supercomputing

Pub Date : 2022-06-28 DOI: 10.1145/3524059.3532391

Larissa Schmid, Marcin Copik, A. Calotoiu, Dominik Werle, Andreas Reiter, M. Selzer, A. Koziolek, T. Hoefler

The many configuration options of modern applications make it difficult for users to select a performance-optimal configuration. Performance models help users in understanding system performance and choosing a fast configuration. Existing performance modeling approaches for applications and configurable systems either require a full-factorial experiment design or a sampling design based on heuristics. This results in high costs for achieving accurate models. Furthermore, they require repeated execution of experiments to account for measurement noise. We propose Performance-Detective, a novel code analysis tool that deduces insights on the interactions of program parameters. We use the insights to derive the smallest necessary experiment design and avoiding repetitions of measurements when possible, significantly lowering the cost of performance modeling. We evaluate Performance-Detective using two case studies where we reduce the number of measurements from up to 3125 to only 25, decreasing cost to only 2.9% of the previously needed core hours, while maintaining accuracy of the resulting model with 91.5% compared to 93.8% using all 3125 measurements.

现代应用程序的许多配置选项使得用户很难选择性能最优的配置。性能模型帮助用户了解系统性能并选择快速配置。现有的应用程序和可配置系统的性能建模方法要么需要全因子实验设计，要么需要基于启发式的抽样设计。这导致实现精确模型的高成本。此外，它们需要重复执行实验来解释测量噪声。我们提出了Performance-Detective，这是一种新的代码分析工具，可以推断程序参数之间的相互作用。我们使用这些见解来推导最小的必要实验设计，并在可能的情况下避免重复测量，从而显着降低性能建模的成本。我们使用两个案例研究来评估Performance-Detective，在这两个案例中，我们将测量次数从多达3125次减少到只有25次，将成本降低到之前所需核心小时的2.9%，同时将结果模型的准确率保持在91.5%，而使用所有3125次测量时的准确率为93.8%。

{"title":"Performance-detective: automatic deduction of cheap and accurate performance models","authors":"Larissa Schmid, Marcin Copik, A. Calotoiu, Dominik Werle, Andreas Reiter, M. Selzer, A. Koziolek, T. Hoefler","doi":"10.1145/3524059.3532391","DOIUrl":"https://doi.org/10.1145/3524059.3532391","url":null,"abstract":"The many configuration options of modern applications make it difficult for users to select a performance-optimal configuration. Performance models help users in understanding system performance and choosing a fast configuration. Existing performance modeling approaches for applications and configurable systems either require a full-factorial experiment design or a sampling design based on heuristics. This results in high costs for achieving accurate models. Furthermore, they require repeated execution of experiments to account for measurement noise. We propose Performance-Detective, a novel code analysis tool that deduces insights on the interactions of program parameters. We use the insights to derive the smallest necessary experiment design and avoiding repetitions of measurements when possible, significantly lowering the cost of performance modeling. We evaluate Performance-Detective using two case studies where we reduce the number of measurements from up to 3125 to only 25, decreasing cost to only 2.9% of the previously needed core hours, while maintaining accuracy of the resulting model with 91.5% compared to 93.8% using all 3125 measurements.","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125221109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

GAPS 差距

Proceedings of the 36th ACM International Conference on Supercomputing

Pub Date : 2022-06-28 DOI: 10.1007/3-540-29623-9_179

Bagus Hanindhito, Dimitrios Gourounas, Arash Fathi, Dimitar Trenev, A. Gerstlauer, L. John

引用次数: 0

Dense dynamic blocks: optimizing SpMM for processors with vector and matrix units using machine learning techniques 密集动态块:使用机器学习技术优化带有向量和矩阵单元的处理器的SpMM

Proceedings of the 36th ACM International Conference on Supercomputing

Pub Date : 2022-06-28 DOI: 10.1145/3524059.3532369

Serif Yesil, J. Moreira, J. Torrellas

Recent processors have been augmented with matrix-multiply units that operate on small matrices, creating a functional unit-rich environment. These units have been successfully employed on dense matrix operations such as those found in the Basic Linear Algebra Subprograms (BLAS). In this work, we exploit these new matrix-multiply facilities to speed up Sparse Matrix Dense Matrix Multiplications (SpMM) for highly sparse matrices. SpMM is hard to optimize. The sparsity patterns lead to a highly irregular memory access behavior. Additionally, each sparse matrix has unique characteristics, making it hard to find a single SpMM strategy that works well for all sparse matrices. The addition of matrix-multiply units makes this even more challenging. In this paper, we address these challenges. First, we design Dense Dynamic Blocks (DDB), a method to utilize the new matrix units. DDB has two specialized versions: DDB-MM and DDB-HYB. DDB-MM is a strategy that only utilizes the matrix-multiply facilities. DDB-HYB is a hybrid approach that maximizes the floating-point throughput by utilizing both vector and matrix units. Furthermore, we design a prediction mechanism for identifying the best SpMM strategy for a given sparse matrix and dense matrix pair: SpMM-OPT. SpMM-OPT selects among vector unit oriented, matrix unit oriented, and hybrid strategies for the highest floating-point throughput while taking cache optimizations into account. We experiment with 440 matrices from the well-known SuiteSparse matrix collection on a POWER10 system with vector and matrix units. We show that DDB-MM and DDB-HYB can achieve a floating-point throughput of up to 1.1 and 2.5 TFLOPs/s on a POWER10 single-chip module for double- and single-precision SpMM, respectively. Our analysis also shows that SpMM-OPT effectively chooses the best SpMM strategy and can achieve an average speedup of up to 2X compared to an optimized CSR baseline.

最近的处理器已经增加了在小矩阵上操作的矩阵乘法单元，创建了一个功能单元丰富的环境。这些单元已成功地应用于密集矩阵运算，如在基本线性代数子程序(BLAS)中发现的那些。在这项工作中，我们利用这些新的矩阵乘法工具来加速高度稀疏矩阵的稀疏矩阵密集矩阵乘法(SpMM)。SpMM很难优化。稀疏模式导致高度不规则的内存访问行为。此外，每个稀疏矩阵都有独特的特征，因此很难找到适用于所有稀疏矩阵的单一SpMM策略。矩阵乘单位的加入使这更具挑战性。在本文中，我们将解决这些挑战。首先，我们设计了密集动态块(DDB)，一种利用新矩阵单元的方法。DDB有两个专门的版本:DDB- mm和DDB- hyb。DDB-MM是一种仅利用矩阵乘法功能的策略。dbb - hyb是一种混合方法，通过同时利用矢量和矩阵单元来最大化浮点吞吐量。此外，我们设计了一种预测机制，用于识别给定稀疏矩阵和密集矩阵对的最佳SpMM策略:SpMM- opt。SpMM-OPT在考虑缓存优化的同时，在面向矢量单元、面向矩阵单元和混合策略中选择最高的浮点吞吐量。我们在一个具有向量和矩阵单元的POWER10系统上对来自著名的SuiteSparse矩阵集合的440个矩阵进行了实验。我们证明了DDB-MM和DDB-HYB可以在POWER10单芯片模块上分别实现高达1.1和2.5 TFLOPs/s的浮点吞吐量，用于双精度和单精度SpMM。我们的分析还表明，SpMM- opt有效地选择了最佳的SpMM策略，与优化的CSR基线相比，可以实现高达2倍的平均加速。

{"title":"Dense dynamic blocks: optimizing SpMM for processors with vector and matrix units using machine learning techniques","authors":"Serif Yesil, J. Moreira, J. Torrellas","doi":"10.1145/3524059.3532369","DOIUrl":"https://doi.org/10.1145/3524059.3532369","url":null,"abstract":"Recent processors have been augmented with matrix-multiply units that operate on small matrices, creating a functional unit-rich environment. These units have been successfully employed on dense matrix operations such as those found in the Basic Linear Algebra Subprograms (BLAS). In this work, we exploit these new matrix-multiply facilities to speed up Sparse Matrix Dense Matrix Multiplications (SpMM) for highly sparse matrices. SpMM is hard to optimize. The sparsity patterns lead to a highly irregular memory access behavior. Additionally, each sparse matrix has unique characteristics, making it hard to find a single SpMM strategy that works well for all sparse matrices. The addition of matrix-multiply units makes this even more challenging. In this paper, we address these challenges. First, we design Dense Dynamic Blocks (DDB), a method to utilize the new matrix units. DDB has two specialized versions: DDB-MM and DDB-HYB. DDB-MM is a strategy that only utilizes the matrix-multiply facilities. DDB-HYB is a hybrid approach that maximizes the floating-point throughput by utilizing both vector and matrix units. Furthermore, we design a prediction mechanism for identifying the best SpMM strategy for a given sparse matrix and dense matrix pair: SpMM-OPT. SpMM-OPT selects among vector unit oriented, matrix unit oriented, and hybrid strategies for the highest floating-point throughput while taking cache optimizations into account. We experiment with 440 matrices from the well-known SuiteSparse matrix collection on a POWER10 system with vector and matrix units. We show that DDB-MM and DDB-HYB can achieve a floating-point throughput of up to 1.1 and 2.5 TFLOPs/s on a POWER10 single-chip module for double- and single-precision SpMM, respectively. Our analysis also shows that SpMM-OPT effectively chooses the best SpMM strategy and can achieve an average speedup of up to 2X compared to an optimized CSR baseline.","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124755173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

ASAP: automatic synthesis of area-efficient and precision-aware CGRAs ASAP:自动合成面积高效和精确感知的CGRAs

Proceedings of the 36th ACM International Conference on Supercomputing

Pub Date : 2022-06-28 DOI: 10.1145/3524059.3532359

Cheng Tan, Thierry Tambe, J. Zhang, B. Fang, Tong Geng, Gu-Yeon Wei, D. Brooks, Antonino Tumeo, G. Gopalakrishnan, Ang Li

Coarse-grained reconfigurable accelerators (CGRAs) are a promising accelerator design choice that strikes a balance between performance and adaptability to different computing patterns across various applications domains. Designing a CGRA for a specific application domain involves enormous software/hardware engineering effort. Recent research works explore loop transformations, functional unit types, network topology, and memory size to identify optimal CGRA designs given a set of kernels from a specific application domain. Unfortunately, the impact of functional units with different precision support has rarely been investigated. To address this gap, we propose ASAP - a hardware/software co-design framework that automatically identifies and synthesizes optimal precision-aware CGRA for a set of applications of interest. Our evaluation shows that ASAP generates specialized designs 3.2X, 4.21X, and 5.8X more efficient (in terms of performance per unit of energy or area) than non-specialized homogeneous CGRAs, for the scientific computing, embedded, and edge machine learning domains, respectively, with limited accuracy loss. Moreover, ASAP provides more efficient designs than other state-of-the-art synthesis frameworks for specialized CGRAs.

粗粒度可重构加速器(CGRAs)是一种很有前途的加速器设计选择，它在性能和对不同应用程序领域的不同计算模式的适应性之间取得了平衡。为特定的应用领域设计CGRA涉及大量的软件/硬件工程工作。最近的研究工作探索了循环转换、功能单元类型、网络拓扑和内存大小，以确定来自特定应用领域的一组内核的最佳CGRA设计。遗憾的是，具有不同精度支持的功能单元的影响很少被研究。为了解决这一差距，我们提出了ASAP——一个硬件/软件协同设计框架，它可以自动识别和合成一组感兴趣的应用程序的最佳精度感知CGRA。我们的评估表明，在科学计算、嵌入式和边缘机器学习领域，ASAP比非专业的同质CGRAs产生的专业设计效率分别提高3.2倍、4.21倍和5.8倍(就每单位能量或面积的性能而言)，并且精度损失有限。此外，ASAP提供了比其他最先进的专业CGRAs合成框架更有效的设计。

{"title":"ASAP: automatic synthesis of area-efficient and precision-aware CGRAs","authors":"Cheng Tan, Thierry Tambe, J. Zhang, B. Fang, Tong Geng, Gu-Yeon Wei, D. Brooks, Antonino Tumeo, G. Gopalakrishnan, Ang Li","doi":"10.1145/3524059.3532359","DOIUrl":"https://doi.org/10.1145/3524059.3532359","url":null,"abstract":"Coarse-grained reconfigurable accelerators (CGRAs) are a promising accelerator design choice that strikes a balance between performance and adaptability to different computing patterns across various applications domains. Designing a CGRA for a specific application domain involves enormous software/hardware engineering effort. Recent research works explore loop transformations, functional unit types, network topology, and memory size to identify optimal CGRA designs given a set of kernels from a specific application domain. Unfortunately, the impact of functional units with different precision support has rarely been investigated. To address this gap, we propose ASAP - a hardware/software co-design framework that automatically identifies and synthesizes optimal precision-aware CGRA for a set of applications of interest. Our evaluation shows that ASAP generates specialized designs 3.2X, 4.21X, and 5.8X more efficient (in terms of performance per unit of energy or area) than non-specialized homogeneous CGRAs, for the scientific computing, embedded, and edge machine learning domains, respectively, with limited accuracy loss. Moreover, ASAP provides more efficient designs than other state-of-the-art synthesis frameworks for specialized CGRAs.","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116938358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

LITE

Proceedings of the 36th ACM International Conference on Supercomputing

Pub Date : 2022-06-28 DOI: 10.1145/3524059.3532361

Ardhi Wiratama Baskara Yudha, J. Meyer, Shougang Yuan, Huiyang Zhou, Yan Solihin

There is a strong need for GPU trusted execution environments (TEEs) as GPU is increasingly used in the cloud environment. However, current proposals either ignore memory security (i.e., not encrypting memory) or impose a separate memory encryption domain from the host TEE, causing a very substantial slowdown for communicating data from/to the host. In this paper, we propose a flexible GPU memory encryption design called LITE that relies on software memory encryption aided by small architecture support. LITE's flexibility allows GPU TEE to be co-designed with CPU to create a unified encryption domain. We show that GPU applications can be adapted to the use of LITE encryption APIs without major changes. Through various optimizations, we show that software memory encryption in LITE can produce negligible performance overheads (1.1%) for regular benchmarks and still-acceptable overheads (56%) for irregular benchmarks.

引用次数: 1

Low overhead and context sensitive profiling of GPU-accelerated applications gpu加速应用程序的低开销和上下文敏感分析

Proceedings of the 36th ACM International Conference on Supercomputing

Pub Date : 2022-06-28 DOI: 10.1145/3524059.3532388

K. Zhou, Jonathon M. Anderson, Xiaozhu Meng, J. Mellor-Crummey

As we near the end of Moore's law scaling, the next-generation computing platforms are increasingly exploring heterogeneous processors for acceleration. Graphics Processing Units (GPUs) are the most widely used accelerators. Meanwhile, applications are evolving by adopting new programming models and algorithms for emerging platforms. To harness the full power of GPUs, performance tools serve a critical role in understanding and tuning application performance, especially for those that involve complex executions spanning both CPU and GPU. To help developers analyze and tune applications, performance tools need to associate performance metrics with calling contexts. However, existing performance tools incur high overhead collecting and attributing performance metrics to full calling contexts. To address the problem, we developed a tool that constructs both CPU and GPU calling contexts with low overhead and high accuracy. With an innovative call path memoization mechanism, our tool can obtain call paths for GPU operations with negligible cost. For GPU calling contexts, our tool uses an adaptive epoch profiling method to collect GPU instruction samples to reduce the synchronization cost and reconstruct the calling contexts using postmortem analysis. We have evaluated our tool on nine HPC and machine learning applications on a machine equipped with an NVIDIA GPU. Compared with the state-of-the-art GPU profilers, our tool reduces the overhead for coarse-grained profiling of GPU operations from 2.07X to 1.42X and the overhead for fine-grained profiling of GPU instructions from 27.51X to 4.61X with an accuracy of 99.93% and 96.16% in each mode.

随着摩尔定律的终结，下一代计算平台正在越来越多地探索异构处理器的加速。图形处理单元(gpu)是应用最广泛的加速器。与此同时，应用程序通过采用新兴平台的新编程模型和算法而不断发展。为了充分利用GPU的强大功能，性能工具在理解和调优应用程序性能方面发挥着关键作用，特别是对于那些涉及跨越CPU和GPU的复杂执行的应用程序。为了帮助开发人员分析和调优应用程序，性能工具需要将性能指标与调用上下文关联起来。但是，现有的性能工具在收集和将性能指标归因于完整调用上下文时会产生很高的开销。为了解决这个问题，我们开发了一个工具，可以同时构建CPU和GPU调用上下文，开销低，精度高。通过创新的调用路径记忆机制，我们的工具可以以微不足道的成本获得GPU操作的调用路径。对于GPU调用上下文，我们的工具使用自适应纪元分析方法来收集GPU指令样本，以减少同步成本，并使用事后分析重建调用上下文。我们已经在一台配备NVIDIA GPU的机器上对我们的工具在9个HPC和机器学习应用程序上进行了评估。与最先进的GPU分析器相比，我们的工具将GPU操作的粗粒度分析的开销从2.07X减少到1.42X，将GPU指令的细粒度分析的开销从27.51X减少到4.61X，每种模式下的准确率分别为99.93%和96.16%。

{"title":"Low overhead and context sensitive profiling of GPU-accelerated applications","authors":"K. Zhou, Jonathon M. Anderson, Xiaozhu Meng, J. Mellor-Crummey","doi":"10.1145/3524059.3532388","DOIUrl":"https://doi.org/10.1145/3524059.3532388","url":null,"abstract":"As we near the end of Moore's law scaling, the next-generation computing platforms are increasingly exploring heterogeneous processors for acceleration. Graphics Processing Units (GPUs) are the most widely used accelerators. Meanwhile, applications are evolving by adopting new programming models and algorithms for emerging platforms. To harness the full power of GPUs, performance tools serve a critical role in understanding and tuning application performance, especially for those that involve complex executions spanning both CPU and GPU. To help developers analyze and tune applications, performance tools need to associate performance metrics with calling contexts. However, existing performance tools incur high overhead collecting and attributing performance metrics to full calling contexts. To address the problem, we developed a tool that constructs both CPU and GPU calling contexts with low overhead and high accuracy. With an innovative call path memoization mechanism, our tool can obtain call paths for GPU operations with negligible cost. For GPU calling contexts, our tool uses an adaptive epoch profiling method to collect GPU instruction samples to reduce the synchronization cost and reconstruct the calling contexts using postmortem analysis. We have evaluated our tool on nine HPC and machine learning applications on a machine equipped with an NVIDIA GPU. Compared with the state-of-the-art GPU profilers, our tool reduces the overhead for coarse-grained profiling of GPU operations from 2.07X to 1.42X and the overhead for fine-grained profiling of GPU instructions from 27.51X to 4.61X with an accuracy of 99.93% and 96.16% in each mode.","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131655535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Cloak: tolerating non-volatile cache read latency 斗篷:容忍非易失缓存读取延迟

Proceedings of the 36th ACM International Conference on Supercomputing

Pub Date : 2022-06-28 DOI: 10.1145/3524059.3532381

Apostolos Kokolis, Namrata Mantri, Shrikanth Ganapathy, J. Torrellas, J. Kalamatianos

The increased memory demands of workloads are putting high pressure on Last Level Caches (LLCs). In general, there is limited opportunity to increase the capacity of LLCs due to the area and power requirements of the underlying SRAM technology. Interestingly, emerging Non-Volatile Memory (NVM) technologies promise a feasible alternative to SRAM for LLCs due to their higher area density. However, NVMs have substantially higher read and write latencies, which offset their density benefit. Although researchers have proposed methods to tolerate NVM's higher write latency, little emphasis has been placed on the critical NVM read latency. To address this problem, this paper proposes Cloak. Cloak exploits page-level data reuse in the LLC, to hide NVM read latency. Specifically, on certain L1 DTLB misses, Cloak transfers LLC-resident data belonging to the TLB-missing page from the LLC NVM array to a set of small SRAM Page Buffers that will service subsequent requests to this page. Further, to enable the high-bandwidth, low-latency transfer of lines of a page to the page buffers, Cloak uses an LLC layout that accelerates the discovery of LLC-resident cache lines from the page. We evaluate Cloak with full-system simulations of a 4-core processor across 14 workloads. We find that, on average, a machine with Cloak is faster than one with an SRAM LLC by 23.8% and one with an NVM-only LLC by 8.9%---in both cases, with negligible change in area. Further, Cloak reduces the ED2 metric relative to these designs by 39.9% and 17.5%, respectively.

工作负载对内存需求的增加给最后一级缓存(Last Level cache, llc)带来了很大的压力。一般来说，由于底层SRAM技术的面积和功率要求，有限责任公司增加容量的机会有限。有趣的是，新兴的非易失性存储器(NVM)技术由于其更高的面积密度，有望成为有限责任公司SRAM的可行替代品。然而，nvm具有更高的读写延迟，这抵消了它们的密度优势。尽管研究人员提出了一些方法来容忍NVM更高的写延迟，但很少强调临界NVM读延迟。为了解决这个问题，本文提出了Cloak。Cloak利用LLC中的页面级数据重用来隐藏NVM读取延迟。具体来说，在某些L1 DTLB缺失时，Cloak将属于tlb缺失页的LLC驻留数据从LLC NVM阵列传输到一组小型SRAM页面缓冲区，该缓冲区将为后续对该页的请求提供服务。此外，为了实现高带宽、低延迟的页面行到页面缓冲区的传输，Cloak使用了LLC布局，可以加速从页面中发现LLC驻留的缓存行。我们用一个4核处理器在14个工作负载下的全系统模拟来评估Cloak。我们发现，平均而言，使用Cloak的机器比使用SRAM LLC的机器快23.8%，比仅使用nvm LLC的机器快8.9%——在这两种情况下，面积的变化都可以忽略不计。此外，Cloak将ED2指标相对于这些设计分别降低了39.9%和17.5%。

{"title":"Cloak: tolerating non-volatile cache read latency","authors":"Apostolos Kokolis, Namrata Mantri, Shrikanth Ganapathy, J. Torrellas, J. Kalamatianos","doi":"10.1145/3524059.3532381","DOIUrl":"https://doi.org/10.1145/3524059.3532381","url":null,"abstract":"The increased memory demands of workloads are putting high pressure on Last Level Caches (LLCs). In general, there is limited opportunity to increase the capacity of LLCs due to the area and power requirements of the underlying SRAM technology. Interestingly, emerging Non-Volatile Memory (NVM) technologies promise a feasible alternative to SRAM for LLCs due to their higher area density. However, NVMs have substantially higher read and write latencies, which offset their density benefit. Although researchers have proposed methods to tolerate NVM's higher write latency, little emphasis has been placed on the critical NVM read latency. To address this problem, this paper proposes Cloak. Cloak exploits page-level data reuse in the LLC, to hide NVM read latency. Specifically, on certain L1 DTLB misses, Cloak transfers LLC-resident data belonging to the TLB-missing page from the LLC NVM array to a set of small SRAM Page Buffers that will service subsequent requests to this page. Further, to enable the high-bandwidth, low-latency transfer of lines of a page to the page buffers, Cloak uses an LLC layout that accelerates the discovery of LLC-resident cache lines from the page. We evaluate Cloak with full-system simulations of a 4-core processor across 14 workloads. We find that, on average, a machine with Cloak is faster than one with an SRAM LLC by 23.8% and one with an NVM-only LLC by 8.9%---in both cases, with negligible change in area. Further, Cloak reduces the ED2 metric relative to these designs by 39.9% and 17.5%, respectively.","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133038583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

VICO

Proceedings of the 36th ACM International Conference on Supercomputing

Pub Date : 2022-06-28 DOI: 10.1093/nq/s6-xii.293.118d

Sharjeel Khan, Bodhisatwa Chatterjee, S. Pande

引用次数: 1

Cloak 斗篷

Proceedings of the 36th ACM International Conference on Supercomputing

Pub Date : 2022-06-28 DOI: 10.2307/j.ctt20p58fq.43

Apostolos Kokolis, Namrata Mantri, Shrikanth Ganapathy, J. Torrellas, John Kalamatianos

引用次数: 0

Seamless optimization of the GEMM kernel for task-based programming models 基于任务的编程模型的GEMM内核的无缝优化

Proceedings of the 36th ACM International Conference on Supercomputing

Pub Date : 2022-06-28 DOI: 10.1145/3524059.3532385

A. Lorenzon, Sandro M. Marques, Antoni C. Navarro, Vicencc Beltran

The general matrix-matrix multiplication (GEMM) kernel is a fundamental building block of many scientific applications. Many libraries such as Intel MKL and BLIS provide highly optimized sequential and parallel versions of this kernel. The parallel implementations of the GEMM kernel rely on the well-known fork-join execution model to exploit multi-core systems efficiently. However, these implementations are not well suited for task-based applications as they break the data-flow execution model. In this paper, we present a task-based implementation of the GEMM kernel that can be seamlessly leveraged by task-based applications while providing better performance than the fork-join version. Our implementation leverages several advanced features of the OmpSs-2 programming model and a new heuristic to select the best parallelization strategy and blocking parameters based on the matrix and hardware characteristics. When evaluating the performance and energy consumption on two modern multi-core systems, we show that our implementations provide significant performance improvements over an optimized OpenMP fork-join implementation, and can beat vendor implementations of the GEMM (e.g., Intel MKL and AMD AOCL). We also demonstrate that a real application can leverage our optimized task-based implementation to enhance performance.

通用矩阵-矩阵乘法(GEMM)内核是许多科学应用的基本构建块。许多库(如Intel MKL和BLIS)提供了该内核的高度优化的顺序和并行版本。GEMM内核的并行实现依赖于众所周知的fork-join执行模型来有效地利用多核系统。然而，这些实现并不适合基于任务的应用程序，因为它们破坏了数据流执行模型。在本文中，我们提出了GEMM内核的一个基于任务的实现，它可以被基于任务的应用程序无缝地利用，同时提供比fork-join版本更好的性能。我们的实现利用了OmpSs-2编程模型的几个高级特性，并基于矩阵和硬件特性选择了最佳并行化策略和阻塞参数。在评估两个现代多核系统的性能和能耗时，我们表明我们的实现比优化的OpenMP fork-join实现提供了显着的性能改进，并且可以击败供应商的GEMM实现(例如，Intel MKL和AMD AOCL)。我们还演示了一个真实的应用程序可以利用我们优化的基于任务的实现来提高性能。

{"title":"Seamless optimization of the GEMM kernel for task-based programming models","authors":"A. Lorenzon, Sandro M. Marques, Antoni C. Navarro, Vicencc Beltran","doi":"10.1145/3524059.3532385","DOIUrl":"https://doi.org/10.1145/3524059.3532385","url":null,"abstract":"The general matrix-matrix multiplication (GEMM) kernel is a fundamental building block of many scientific applications. Many libraries such as Intel MKL and BLIS provide highly optimized sequential and parallel versions of this kernel. The parallel implementations of the GEMM kernel rely on the well-known fork-join execution model to exploit multi-core systems efficiently. However, these implementations are not well suited for task-based applications as they break the data-flow execution model. In this paper, we present a task-based implementation of the GEMM kernel that can be seamlessly leveraged by task-based applications while providing better performance than the fork-join version. Our implementation leverages several advanced features of the OmpSs-2 programming model and a new heuristic to select the best parallelization strategy and blocking parameters based on the matrix and hardware characteristics. When evaluating the performance and energy consumption on two modern multi-core systems, we show that our implementations provide significant performance improvements over an optimized OpenMP fork-join implementation, and can beat vendor implementations of the GEMM (e.g., Intel MKL and AMD AOCL). We also demonstrate that a real application can leverage our optimized task-based implementation to enhance performance.","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127173534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 36th ACM International Conference on Supercomputing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀