Larissa Schmid, Marcin Copik, A. Calotoiu, Dominik Werle, Andreas Reiter, M. Selzer, A. Koziolek, T. Hoefler
The many configuration options of modern applications make it difficult for users to select a performance-optimal configuration. Performance models help users in understanding system performance and choosing a fast configuration. Existing performance modeling approaches for applications and configurable systems either require a full-factorial experiment design or a sampling design based on heuristics. This results in high costs for achieving accurate models. Furthermore, they require repeated execution of experiments to account for measurement noise. We propose Performance-Detective, a novel code analysis tool that deduces insights on the interactions of program parameters. We use the insights to derive the smallest necessary experiment design and avoiding repetitions of measurements when possible, significantly lowering the cost of performance modeling. We evaluate Performance-Detective using two case studies where we reduce the number of measurements from up to 3125 to only 25, decreasing cost to only 2.9% of the previously needed core hours, while maintaining accuracy of the resulting model with 91.5% compared to 93.8% using all 3125 measurements.
{"title":"Performance-detective: automatic deduction of cheap and accurate performance models","authors":"Larissa Schmid, Marcin Copik, A. Calotoiu, Dominik Werle, Andreas Reiter, M. Selzer, A. Koziolek, T. Hoefler","doi":"10.1145/3524059.3532391","DOIUrl":"https://doi.org/10.1145/3524059.3532391","url":null,"abstract":"The many configuration options of modern applications make it difficult for users to select a performance-optimal configuration. Performance models help users in understanding system performance and choosing a fast configuration. Existing performance modeling approaches for applications and configurable systems either require a full-factorial experiment design or a sampling design based on heuristics. This results in high costs for achieving accurate models. Furthermore, they require repeated execution of experiments to account for measurement noise. We propose Performance-Detective, a novel code analysis tool that deduces insights on the interactions of program parameters. We use the insights to derive the smallest necessary experiment design and avoiding repetitions of measurements when possible, significantly lowering the cost of performance modeling. We evaluate Performance-Detective using two case studies where we reduce the number of measurements from up to 3125 to only 25, decreasing cost to only 2.9% of the previously needed core hours, while maintaining accuracy of the resulting model with 91.5% compared to 93.8% using all 3125 measurements.","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125221109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-28DOI: 10.1007/3-540-29623-9_179
Bagus Hanindhito, Dimitrios Gourounas, Arash Fathi, Dimitar Trenev, A. Gerstlauer, L. John
{"title":"GAPS","authors":"Bagus Hanindhito, Dimitrios Gourounas, Arash Fathi, Dimitar Trenev, A. Gerstlauer, L. John","doi":"10.1007/3-540-29623-9_179","DOIUrl":"https://doi.org/10.1007/3-540-29623-9_179","url":null,"abstract":"","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"65 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114060576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent processors have been augmented with matrix-multiply units that operate on small matrices, creating a functional unit-rich environment. These units have been successfully employed on dense matrix operations such as those found in the Basic Linear Algebra Subprograms (BLAS). In this work, we exploit these new matrix-multiply facilities to speed up Sparse Matrix Dense Matrix Multiplications (SpMM) for highly sparse matrices. SpMM is hard to optimize. The sparsity patterns lead to a highly irregular memory access behavior. Additionally, each sparse matrix has unique characteristics, making it hard to find a single SpMM strategy that works well for all sparse matrices. The addition of matrix-multiply units makes this even more challenging. In this paper, we address these challenges. First, we design Dense Dynamic Blocks (DDB), a method to utilize the new matrix units. DDB has two specialized versions: DDB-MM and DDB-HYB. DDB-MM is a strategy that only utilizes the matrix-multiply facilities. DDB-HYB is a hybrid approach that maximizes the floating-point throughput by utilizing both vector and matrix units. Furthermore, we design a prediction mechanism for identifying the best SpMM strategy for a given sparse matrix and dense matrix pair: SpMM-OPT. SpMM-OPT selects among vector unit oriented, matrix unit oriented, and hybrid strategies for the highest floating-point throughput while taking cache optimizations into account. We experiment with 440 matrices from the well-known SuiteSparse matrix collection on a POWER10 system with vector and matrix units. We show that DDB-MM and DDB-HYB can achieve a floating-point throughput of up to 1.1 and 2.5 TFLOPs/s on a POWER10 single-chip module for double- and single-precision SpMM, respectively. Our analysis also shows that SpMM-OPT effectively chooses the best SpMM strategy and can achieve an average speedup of up to 2X compared to an optimized CSR baseline.
{"title":"Dense dynamic blocks: optimizing SpMM for processors with vector and matrix units using machine learning techniques","authors":"Serif Yesil, J. Moreira, J. Torrellas","doi":"10.1145/3524059.3532369","DOIUrl":"https://doi.org/10.1145/3524059.3532369","url":null,"abstract":"Recent processors have been augmented with matrix-multiply units that operate on small matrices, creating a functional unit-rich environment. These units have been successfully employed on dense matrix operations such as those found in the Basic Linear Algebra Subprograms (BLAS). In this work, we exploit these new matrix-multiply facilities to speed up Sparse Matrix Dense Matrix Multiplications (SpMM) for highly sparse matrices. SpMM is hard to optimize. The sparsity patterns lead to a highly irregular memory access behavior. Additionally, each sparse matrix has unique characteristics, making it hard to find a single SpMM strategy that works well for all sparse matrices. The addition of matrix-multiply units makes this even more challenging. In this paper, we address these challenges. First, we design Dense Dynamic Blocks (DDB), a method to utilize the new matrix units. DDB has two specialized versions: DDB-MM and DDB-HYB. DDB-MM is a strategy that only utilizes the matrix-multiply facilities. DDB-HYB is a hybrid approach that maximizes the floating-point throughput by utilizing both vector and matrix units. Furthermore, we design a prediction mechanism for identifying the best SpMM strategy for a given sparse matrix and dense matrix pair: SpMM-OPT. SpMM-OPT selects among vector unit oriented, matrix unit oriented, and hybrid strategies for the highest floating-point throughput while taking cache optimizations into account. We experiment with 440 matrices from the well-known SuiteSparse matrix collection on a POWER10 system with vector and matrix units. We show that DDB-MM and DDB-HYB can achieve a floating-point throughput of up to 1.1 and 2.5 TFLOPs/s on a POWER10 single-chip module for double- and single-precision SpMM, respectively. Our analysis also shows that SpMM-OPT effectively chooses the best SpMM strategy and can achieve an average speedup of up to 2X compared to an optimized CSR baseline.","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124755173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cheng Tan, Thierry Tambe, J. Zhang, B. Fang, Tong Geng, Gu-Yeon Wei, D. Brooks, Antonino Tumeo, G. Gopalakrishnan, Ang Li
Coarse-grained reconfigurable accelerators (CGRAs) are a promising accelerator design choice that strikes a balance between performance and adaptability to different computing patterns across various applications domains. Designing a CGRA for a specific application domain involves enormous software/hardware engineering effort. Recent research works explore loop transformations, functional unit types, network topology, and memory size to identify optimal CGRA designs given a set of kernels from a specific application domain. Unfortunately, the impact of functional units with different precision support has rarely been investigated. To address this gap, we propose ASAP - a hardware/software co-design framework that automatically identifies and synthesizes optimal precision-aware CGRA for a set of applications of interest. Our evaluation shows that ASAP generates specialized designs 3.2X, 4.21X, and 5.8X more efficient (in terms of performance per unit of energy or area) than non-specialized homogeneous CGRAs, for the scientific computing, embedded, and edge machine learning domains, respectively, with limited accuracy loss. Moreover, ASAP provides more efficient designs than other state-of-the-art synthesis frameworks for specialized CGRAs.
{"title":"ASAP: automatic synthesis of area-efficient and precision-aware CGRAs","authors":"Cheng Tan, Thierry Tambe, J. Zhang, B. Fang, Tong Geng, Gu-Yeon Wei, D. Brooks, Antonino Tumeo, G. Gopalakrishnan, Ang Li","doi":"10.1145/3524059.3532359","DOIUrl":"https://doi.org/10.1145/3524059.3532359","url":null,"abstract":"Coarse-grained reconfigurable accelerators (CGRAs) are a promising accelerator design choice that strikes a balance between performance and adaptability to different computing patterns across various applications domains. Designing a CGRA for a specific application domain involves enormous software/hardware engineering effort. Recent research works explore loop transformations, functional unit types, network topology, and memory size to identify optimal CGRA designs given a set of kernels from a specific application domain. Unfortunately, the impact of functional units with different precision support has rarely been investigated. To address this gap, we propose ASAP - a hardware/software co-design framework that automatically identifies and synthesizes optimal precision-aware CGRA for a set of applications of interest. Our evaluation shows that ASAP generates specialized designs 3.2X, 4.21X, and 5.8X more efficient (in terms of performance per unit of energy or area) than non-specialized homogeneous CGRAs, for the scientific computing, embedded, and edge machine learning domains, respectively, with limited accuracy loss. Moreover, ASAP provides more efficient designs than other state-of-the-art synthesis frameworks for specialized CGRAs.","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116938358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ardhi Wiratama Baskara Yudha, J. Meyer, Shougang Yuan, Huiyang Zhou, Yan Solihin
There is a strong need for GPU trusted execution environments (TEEs) as GPU is increasingly used in the cloud environment. However, current proposals either ignore memory security (i.e., not encrypting memory) or impose a separate memory encryption domain from the host TEE, causing a very substantial slowdown for communicating data from/to the host. In this paper, we propose a flexible GPU memory encryption design called LITE that relies on software memory encryption aided by small architecture support. LITE's flexibility allows GPU TEE to be co-designed with CPU to create a unified encryption domain. We show that GPU applications can be adapted to the use of LITE encryption APIs without major changes. Through various optimizations, we show that software memory encryption in LITE can produce negligible performance overheads (1.1%) for regular benchmarks and still-acceptable overheads (56%) for irregular benchmarks.
{"title":"LITE","authors":"Ardhi Wiratama Baskara Yudha, J. Meyer, Shougang Yuan, Huiyang Zhou, Yan Solihin","doi":"10.1145/3524059.3532361","DOIUrl":"https://doi.org/10.1145/3524059.3532361","url":null,"abstract":"There is a strong need for GPU trusted execution environments (TEEs) as GPU is increasingly used in the cloud environment. However, current proposals either ignore memory security (i.e., not encrypting memory) or impose a separate memory encryption domain from the host TEE, causing a very substantial slowdown for communicating data from/to the host. In this paper, we propose a flexible GPU memory encryption design called LITE that relies on software memory encryption aided by small architecture support. LITE's flexibility allows GPU TEE to be co-designed with CPU to create a unified encryption domain. We show that GPU applications can be adapted to the use of LITE encryption APIs without major changes. Through various optimizations, we show that software memory encryption in LITE can produce negligible performance overheads (1.1%) for regular benchmarks and still-acceptable overheads (56%) for irregular benchmarks.","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127024362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Zhou, Jonathon M. Anderson, Xiaozhu Meng, J. Mellor-Crummey
As we near the end of Moore's law scaling, the next-generation computing platforms are increasingly exploring heterogeneous processors for acceleration. Graphics Processing Units (GPUs) are the most widely used accelerators. Meanwhile, applications are evolving by adopting new programming models and algorithms for emerging platforms. To harness the full power of GPUs, performance tools serve a critical role in understanding and tuning application performance, especially for those that involve complex executions spanning both CPU and GPU. To help developers analyze and tune applications, performance tools need to associate performance metrics with calling contexts. However, existing performance tools incur high overhead collecting and attributing performance metrics to full calling contexts. To address the problem, we developed a tool that constructs both CPU and GPU calling contexts with low overhead and high accuracy. With an innovative call path memoization mechanism, our tool can obtain call paths for GPU operations with negligible cost. For GPU calling contexts, our tool uses an adaptive epoch profiling method to collect GPU instruction samples to reduce the synchronization cost and reconstruct the calling contexts using postmortem analysis. We have evaluated our tool on nine HPC and machine learning applications on a machine equipped with an NVIDIA GPU. Compared with the state-of-the-art GPU profilers, our tool reduces the overhead for coarse-grained profiling of GPU operations from 2.07X to 1.42X and the overhead for fine-grained profiling of GPU instructions from 27.51X to 4.61X with an accuracy of 99.93% and 96.16% in each mode.
{"title":"Low overhead and context sensitive profiling of GPU-accelerated applications","authors":"K. Zhou, Jonathon M. Anderson, Xiaozhu Meng, J. Mellor-Crummey","doi":"10.1145/3524059.3532388","DOIUrl":"https://doi.org/10.1145/3524059.3532388","url":null,"abstract":"As we near the end of Moore's law scaling, the next-generation computing platforms are increasingly exploring heterogeneous processors for acceleration. Graphics Processing Units (GPUs) are the most widely used accelerators. Meanwhile, applications are evolving by adopting new programming models and algorithms for emerging platforms. To harness the full power of GPUs, performance tools serve a critical role in understanding and tuning application performance, especially for those that involve complex executions spanning both CPU and GPU. To help developers analyze and tune applications, performance tools need to associate performance metrics with calling contexts. However, existing performance tools incur high overhead collecting and attributing performance metrics to full calling contexts. To address the problem, we developed a tool that constructs both CPU and GPU calling contexts with low overhead and high accuracy. With an innovative call path memoization mechanism, our tool can obtain call paths for GPU operations with negligible cost. For GPU calling contexts, our tool uses an adaptive epoch profiling method to collect GPU instruction samples to reduce the synchronization cost and reconstruct the calling contexts using postmortem analysis. We have evaluated our tool on nine HPC and machine learning applications on a machine equipped with an NVIDIA GPU. Compared with the state-of-the-art GPU profilers, our tool reduces the overhead for coarse-grained profiling of GPU operations from 2.07X to 1.42X and the overhead for fine-grained profiling of GPU instructions from 27.51X to 4.61X with an accuracy of 99.93% and 96.16% in each mode.","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131655535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Apostolos Kokolis, Namrata Mantri, Shrikanth Ganapathy, J. Torrellas, J. Kalamatianos
The increased memory demands of workloads are putting high pressure on Last Level Caches (LLCs). In general, there is limited opportunity to increase the capacity of LLCs due to the area and power requirements of the underlying SRAM technology. Interestingly, emerging Non-Volatile Memory (NVM) technologies promise a feasible alternative to SRAM for LLCs due to their higher area density. However, NVMs have substantially higher read and write latencies, which offset their density benefit. Although researchers have proposed methods to tolerate NVM's higher write latency, little emphasis has been placed on the critical NVM read latency. To address this problem, this paper proposes Cloak. Cloak exploits page-level data reuse in the LLC, to hide NVM read latency. Specifically, on certain L1 DTLB misses, Cloak transfers LLC-resident data belonging to the TLB-missing page from the LLC NVM array to a set of small SRAM Page Buffers that will service subsequent requests to this page. Further, to enable the high-bandwidth, low-latency transfer of lines of a page to the page buffers, Cloak uses an LLC layout that accelerates the discovery of LLC-resident cache lines from the page. We evaluate Cloak with full-system simulations of a 4-core processor across 14 workloads. We find that, on average, a machine with Cloak is faster than one with an SRAM LLC by 23.8% and one with an NVM-only LLC by 8.9%---in both cases, with negligible change in area. Further, Cloak reduces the ED2 metric relative to these designs by 39.9% and 17.5%, respectively.
{"title":"Cloak: tolerating non-volatile cache read latency","authors":"Apostolos Kokolis, Namrata Mantri, Shrikanth Ganapathy, J. Torrellas, J. Kalamatianos","doi":"10.1145/3524059.3532381","DOIUrl":"https://doi.org/10.1145/3524059.3532381","url":null,"abstract":"The increased memory demands of workloads are putting high pressure on Last Level Caches (LLCs). In general, there is limited opportunity to increase the capacity of LLCs due to the area and power requirements of the underlying SRAM technology. Interestingly, emerging Non-Volatile Memory (NVM) technologies promise a feasible alternative to SRAM for LLCs due to their higher area density. However, NVMs have substantially higher read and write latencies, which offset their density benefit. Although researchers have proposed methods to tolerate NVM's higher write latency, little emphasis has been placed on the critical NVM read latency. To address this problem, this paper proposes Cloak. Cloak exploits page-level data reuse in the LLC, to hide NVM read latency. Specifically, on certain L1 DTLB misses, Cloak transfers LLC-resident data belonging to the TLB-missing page from the LLC NVM array to a set of small SRAM Page Buffers that will service subsequent requests to this page. Further, to enable the high-bandwidth, low-latency transfer of lines of a page to the page buffers, Cloak uses an LLC layout that accelerates the discovery of LLC-resident cache lines from the page. We evaluate Cloak with full-system simulations of a 4-core processor across 14 workloads. We find that, on average, a machine with Cloak is faster than one with an SRAM LLC by 23.8% and one with an NVM-only LLC by 8.9%---in both cases, with negligible change in area. Further, Cloak reduces the ED2 metric relative to these designs by 39.9% and 17.5%, respectively.","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133038583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-28DOI: 10.1093/nq/s6-xii.293.118d
Sharjeel Khan, Bodhisatwa Chatterjee, S. Pande
{"title":"VICO","authors":"Sharjeel Khan, Bodhisatwa Chatterjee, S. Pande","doi":"10.1093/nq/s6-xii.293.118d","DOIUrl":"https://doi.org/10.1093/nq/s6-xii.293.118d","url":null,"abstract":"","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130578848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Apostolos Kokolis, Namrata Mantri, Shrikanth Ganapathy, J. Torrellas, John Kalamatianos
{"title":"Cloak","authors":"Apostolos Kokolis, Namrata Mantri, Shrikanth Ganapathy, J. Torrellas, John Kalamatianos","doi":"10.2307/j.ctt20p58fq.43","DOIUrl":"https://doi.org/10.2307/j.ctt20p58fq.43","url":null,"abstract":"","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126383008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Lorenzon, Sandro M. Marques, Antoni C. Navarro, Vicencc Beltran
The general matrix-matrix multiplication (GEMM) kernel is a fundamental building block of many scientific applications. Many libraries such as Intel MKL and BLIS provide highly optimized sequential and parallel versions of this kernel. The parallel implementations of the GEMM kernel rely on the well-known fork-join execution model to exploit multi-core systems efficiently. However, these implementations are not well suited for task-based applications as they break the data-flow execution model. In this paper, we present a task-based implementation of the GEMM kernel that can be seamlessly leveraged by task-based applications while providing better performance than the fork-join version. Our implementation leverages several advanced features of the OmpSs-2 programming model and a new heuristic to select the best parallelization strategy and blocking parameters based on the matrix and hardware characteristics. When evaluating the performance and energy consumption on two modern multi-core systems, we show that our implementations provide significant performance improvements over an optimized OpenMP fork-join implementation, and can beat vendor implementations of the GEMM (e.g., Intel MKL and AMD AOCL). We also demonstrate that a real application can leverage our optimized task-based implementation to enhance performance.
{"title":"Seamless optimization of the GEMM kernel for task-based programming models","authors":"A. Lorenzon, Sandro M. Marques, Antoni C. Navarro, Vicencc Beltran","doi":"10.1145/3524059.3532385","DOIUrl":"https://doi.org/10.1145/3524059.3532385","url":null,"abstract":"The general matrix-matrix multiplication (GEMM) kernel is a fundamental building block of many scientific applications. Many libraries such as Intel MKL and BLIS provide highly optimized sequential and parallel versions of this kernel. The parallel implementations of the GEMM kernel rely on the well-known fork-join execution model to exploit multi-core systems efficiently. However, these implementations are not well suited for task-based applications as they break the data-flow execution model. In this paper, we present a task-based implementation of the GEMM kernel that can be seamlessly leveraged by task-based applications while providing better performance than the fork-join version. Our implementation leverages several advanced features of the OmpSs-2 programming model and a new heuristic to select the best parallelization strategy and blocking parameters based on the matrix and hardware characteristics. When evaluating the performance and energy consumption on two modern multi-core systems, we show that our implementations provide significant performance improvements over an optimized OpenMP fork-join implementation, and can beat vendor implementations of the GEMM (e.g., Intel MKL and AMD AOCL). We also demonstrate that a real application can leverage our optimized task-based implementation to enhance performance.","PeriodicalId":229772,"journal":{"name":"Proceedings of the 36th ACM International Conference on Supercomputing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127173534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}