Jinsung Kim, Aravind Sukumaran-Rajam, Changwan Hong, Ajay Panyala, Rohit Kumar Srivastava, S. Krishnamoorthy, P. Sadayappan
Tensor contractions are higher dimensional analogs of matrix multiplications, used in many computational contexts such as high order models in quantum chemistry, deep learning, finite element methods etc. In contrast to the wide availability of high-performance libraries for matrix multiplication on GPUs, the same is not true for tensor contractions. In this paper, we address the optimization of a set of symmetrized tensor contractions that form the computational bottleneck in the CCSD(T) coupled-cluster method in computational chemistry suites like NWChem. Some of the challenges in optimizing tensor contractions that arise in practice from the variety of dimensionalities and shapes for tensors include effective mapping of the high-dimensional iteration space to threads, choice of data buffering in shared-memory and registers, and tile sizes for multi-level tiling. Furthermore, in the case of symmetrized tensor contractions in CCSD(T), it is also a challenge to fuse contractions to reduce data movement cost by exploiting reuse of intermediate tensors. In this paper, we develop an efficient GPU implementation of the tensor contractions in CCSD(T) using shared-memory buffering, register tiling, loop fusion and register transpose. Experimental results demonstrate significant improvement over the current state-of-the-art.
{"title":"Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs","authors":"Jinsung Kim, Aravind Sukumaran-Rajam, Changwan Hong, Ajay Panyala, Rohit Kumar Srivastava, S. Krishnamoorthy, P. Sadayappan","doi":"10.1145/3205289.3205296","DOIUrl":"https://doi.org/10.1145/3205289.3205296","url":null,"abstract":"Tensor contractions are higher dimensional analogs of matrix multiplications, used in many computational contexts such as high order models in quantum chemistry, deep learning, finite element methods etc. In contrast to the wide availability of high-performance libraries for matrix multiplication on GPUs, the same is not true for tensor contractions. In this paper, we address the optimization of a set of symmetrized tensor contractions that form the computational bottleneck in the CCSD(T) coupled-cluster method in computational chemistry suites like NWChem. Some of the challenges in optimizing tensor contractions that arise in practice from the variety of dimensionalities and shapes for tensors include effective mapping of the high-dimensional iteration space to threads, choice of data buffering in shared-memory and registers, and tile sizes for multi-level tiling. Furthermore, in the case of symmetrized tensor contractions in CCSD(T), it is also a challenge to fuse contractions to reduce data movement cost by exploiting reuse of intermediate tensors. In this paper, we develop an efficient GPU implementation of the tensor contractions in CCSD(T) using shared-memory buffering, register tiling, loop fusion and register transpose. Experimental results demonstrate significant improvement over the current state-of-the-art.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125412039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Superscalar processors take advantage of speculative execution to improve performance. When the speculation turns out to be incorrect, a recovery procedure is initiated. The back-end of the processor cannot be flushed due to having a mixture of both valid and invalid instructions. A basic solution is to wait for all valid instructions to retire and then purge the invalid instructions. However, if a long latency operation, such as a Last-level Cache (LLC) miss appears before the misspeculation point, the back-end recovery time significantly increases. Many proposed mechanisms selectively flush invalid instructions in order to speed up the back-end recovery. In general, these mechanisms rely on broadcasting some misprediction related tags to remove the instructions from any backend structures, such as ROB, LSQ, RS, etc. The hardware overhead in these mechanisms is nontrivial and can potentially affect the processor clock cycle time if they are on the critical path. Moreover, a checkpointing mechanism or a walker needs to be added to accelerate the recovery of the front-end register alias table (F-RAT). We propose a two-phase recovery mechanism which does not need any walking or broadcasting process and can still match the performance of the state-of-the-art recovery approaches. The first phase works similar to a typical basic recovery mechanism and the second phase is not triggered until the backend is stalled by an LLC miss load. In that case, the second phase treats the load as a misspeculation and recovers from this load. Since the LLC miss response time is usually much longer than the time to fill the entire pipeline with new instructions, in most cases our mechanism can completely overlap the branch misprediction recovery penalty with the cache miss penalty.
{"title":"A two-phase recovery mechanism","authors":"Zhaoxiang Jin, Soner Önder","doi":"10.1145/3205289.3205300","DOIUrl":"https://doi.org/10.1145/3205289.3205300","url":null,"abstract":"Superscalar processors take advantage of speculative execution to improve performance. When the speculation turns out to be incorrect, a recovery procedure is initiated. The back-end of the processor cannot be flushed due to having a mixture of both valid and invalid instructions. A basic solution is to wait for all valid instructions to retire and then purge the invalid instructions. However, if a long latency operation, such as a Last-level Cache (LLC) miss appears before the misspeculation point, the back-end recovery time significantly increases. Many proposed mechanisms selectively flush invalid instructions in order to speed up the back-end recovery. In general, these mechanisms rely on broadcasting some misprediction related tags to remove the instructions from any backend structures, such as ROB, LSQ, RS, etc. The hardware overhead in these mechanisms is nontrivial and can potentially affect the processor clock cycle time if they are on the critical path. Moreover, a checkpointing mechanism or a walker needs to be added to accelerate the recovery of the front-end register alias table (F-RAT). We propose a two-phase recovery mechanism which does not need any walking or broadcasting process and can still match the performance of the state-of-the-art recovery approaches. The first phase works similar to a typical basic recovery mechanism and the second phase is not triggered until the backend is stalled by an LLC miss load. In that case, the second phase treats the load as a misspeculation and recovers from this load. Since the LLC miss response time is usually much longer than the time to fill the entire pipeline with new instructions, in most cases our mechanism can completely overlap the branch misprediction recovery penalty with the cache miss penalty.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124048356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ke Zhou, Si Sun, Hua Wang, Ping-Hsiu Huang, Xubin He, Rui Lan, Wenyan Li, Wenjie Liu, Tianming Yang
Photo service providers are facing critical challenges of dealing with the huge amount of photo storage, typically in a magnitude of billions of photos, while ensuring national-wide or world-wide satisfactory user experiences. Distributed photo caching architecture is widely deployed to meet high performance expectations, where efficient still mysterious caching policies play essential roles. In this work, we present a comprehensive study on internet-scale photo caching algorithms in the case of QQPhoto from Tencent Inc., the largest social network service company in China. We unveil that even advanced cache algorithms can only perform at a similar level as simple baseline algorithms and there still exists a large performance gap between these cache algorithms and the theoretically optimal algorithm due to the complicated access behaviors in such a large multi-tenant environment. We then expound the behind reasons for that phenomenon via extensively investigating the characteristics of QQPhoto workloads. Finally, in order to realistically further improve QQPhoto cache efficiency, we propose to incorporate a prefetcher in the cache stack based on the observed immediacy feature that is unique to the QQPhoto workload. Evaluation results show that with appropriate prefetching we improve the cache hit ratio by up to 7.4%, while reducing the average access latency by 6.9% at a marginal cost of 4.14% backend network traffic compared to the original system that performs no prefetching.
{"title":"Demystifying Cache Policies for Photo Stores at Scale: A Tencent Case Study","authors":"Ke Zhou, Si Sun, Hua Wang, Ping-Hsiu Huang, Xubin He, Rui Lan, Wenyan Li, Wenjie Liu, Tianming Yang","doi":"10.1145/3205289.3205299","DOIUrl":"https://doi.org/10.1145/3205289.3205299","url":null,"abstract":"Photo service providers are facing critical challenges of dealing with the huge amount of photo storage, typically in a magnitude of billions of photos, while ensuring national-wide or world-wide satisfactory user experiences. Distributed photo caching architecture is widely deployed to meet high performance expectations, where efficient still mysterious caching policies play essential roles. In this work, we present a comprehensive study on internet-scale photo caching algorithms in the case of QQPhoto from Tencent Inc., the largest social network service company in China. We unveil that even advanced cache algorithms can only perform at a similar level as simple baseline algorithms and there still exists a large performance gap between these cache algorithms and the theoretically optimal algorithm due to the complicated access behaviors in such a large multi-tenant environment. We then expound the behind reasons for that phenomenon via extensively investigating the characteristics of QQPhoto workloads. Finally, in order to realistically further improve QQPhoto cache efficiency, we propose to incorporate a prefetcher in the cache stack based on the observed immediacy feature that is unique to the QQPhoto workload. Evaluation results show that with appropriate prefetching we improve the cache hit ratio by up to 7.4%, while reducing the average access latency by 6.9% at a marginal cost of 4.14% backend network traffic compared to the original system that performs no prefetching.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122187986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel programming is hard, and it is even harder to analyze parallel programs and identify specific performance bottlenecks. Chapel is an emerging Partitioned-Global-Address-Space (PGAS) language that provides productive parallel programming. Most established profilers either completely lack the capacity to profile Chapel programs or generate information that cannot provide insightful guidance in a user-level context. To address this issue, we developed ChplBlamer to pinpoint performance losses due to data distribution and remote data accesses. We use a data-centric and code-centric combined approach to help Chapel users quickly identify performance bottlenecks in the source. To demonstrate the utility of ChplBlamer, we studied three multi-locale Chapel benchmarks. For each benchmark, ChplBlamer found the causes of the performance losses. With the optimization guidance provided by ChplBlamer, we significantly improved the performance by up to 4x with little code modification.
{"title":"ChplBlamer","authors":"Hui Zhang, Jeffrey K. Hollingsworth","doi":"10.1145/3205289.3205314","DOIUrl":"https://doi.org/10.1145/3205289.3205314","url":null,"abstract":"Parallel programming is hard, and it is even harder to analyze parallel programs and identify specific performance bottlenecks. Chapel is an emerging Partitioned-Global-Address-Space (PGAS) language that provides productive parallel programming. Most established profilers either completely lack the capacity to profile Chapel programs or generate information that cannot provide insightful guidance in a user-level context. To address this issue, we developed ChplBlamer to pinpoint performance losses due to data distribution and remote data accesses. We use a data-centric and code-centric combined approach to help Chapel users quickly identify performance bottlenecks in the source. To demonstrate the utility of ChplBlamer, we studied three multi-locale Chapel benchmarks. For each benchmark, ChplBlamer found the causes of the performance losses. With the optimization guidance provided by ChplBlamer, we significantly improved the performance by up to 4x with little code modification.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128260687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Kerbl, Michael Kenzel, J. H. Mueller, D. Schmalstieg, M. Steinberger
Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithms require scalable, concurrent queues to collect and distribute work. We show that previous queuing approaches are unfit for this task, as they either (1) do not work well in a massively parallel environment, or (2) obstruct the use of individual threads on top of single-instruction-multiple-data (SIMD) cores, or (3) block during access, thus prohibiting multi-queue setups. With these issues in mind, we present the Broker Queue, a highly efficient, fully linearizable FIFO queue for fine-granular parallel work distribution on the GPU. We evaluate its performance and usability on modern GPU models against a wide range of existing algorithms. The Broker Queue is up to three orders of magnitude faster than nonblocking queues and can even outperform significantly simpler techniques that lack desired properties for fine-granular work distribution.
{"title":"The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU","authors":"B. Kerbl, Michael Kenzel, J. H. Mueller, D. Schmalstieg, M. Steinberger","doi":"10.1145/3205289.3205291","DOIUrl":"https://doi.org/10.1145/3205289.3205291","url":null,"abstract":"Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithms require scalable, concurrent queues to collect and distribute work. We show that previous queuing approaches are unfit for this task, as they either (1) do not work well in a massively parallel environment, or (2) obstruct the use of individual threads on top of single-instruction-multiple-data (SIMD) cores, or (3) block during access, thus prohibiting multi-queue setups. With these issues in mind, we present the Broker Queue, a highly efficient, fully linearizable FIFO queue for fine-granular parallel work distribution on the GPU. We evaluate its performance and usability on modern GPU models against a wide range of existing algorithms. The Broker Queue is up to three orders of magnitude faster than nonblocking queues and can even outperform significantly simpler techniques that lack desired properties for fine-granular work distribution.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121359414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lluc Alvarez, Marc Casas, Jesús Labarta, E. Ayguadé, M. Valero, Miquel Moretó
Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked DRAM and off-chip memories are expected to co-exist in HPC architectures, giving raise to different approaches for architecting the stacked DRAM in the system. This paper proposes a runtime approach to transparently manage stacked DRAM memories in task-based programming models. In this approach the runtime system is in charge of copying the data accessed by the tasks to the stacked DRAM, without any complex hardware support nor modifications to the application code. To mitigate the cost of copying data between the stacked DRAM and the off-chip memory, the proposal includes an optimization to parallelize the copies across idle or additional helper threads. In addition, the runtime system is aware of the reuse pattern of the data accessed by the tasks, and can exploit this information to avoid unworthy copies of data to the stacked DRAM. Results on the Intel Knights Landing processor show that the proposed techniques achieve an average speedup of 14% against the state-of-the-art library to manage the stacked DRAM and 29% against a stacked DRAM architected as a hardware cache.
{"title":"Runtime-Guided Management of Stacked DRAM Memories in Task Parallel Programs","authors":"Lluc Alvarez, Marc Casas, Jesús Labarta, E. Ayguadé, M. Valero, Miquel Moretó","doi":"10.1145/3205289.3205312","DOIUrl":"https://doi.org/10.1145/3205289.3205312","url":null,"abstract":"Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked DRAM and off-chip memories are expected to co-exist in HPC architectures, giving raise to different approaches for architecting the stacked DRAM in the system. This paper proposes a runtime approach to transparently manage stacked DRAM memories in task-based programming models. In this approach the runtime system is in charge of copying the data accessed by the tasks to the stacked DRAM, without any complex hardware support nor modifications to the application code. To mitigate the cost of copying data between the stacked DRAM and the off-chip memory, the proposal includes an optimization to parallelize the copies across idle or additional helper threads. In addition, the runtime system is aware of the reuse pattern of the data accessed by the tasks, and can exploit this information to avoid unworthy copies of data to the stacked DRAM. Results on the Intel Knights Landing processor show that the proposed techniques achieve an average speedup of 14% against the state-of-the-art library to manage the stacked DRAM and 29% against a stacked DRAM architected as a hardware cache.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128018884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 2018 International Conference on Supercomputing","authors":"","doi":"10.1145/3205289","DOIUrl":"https://doi.org/10.1145/3205289","url":null,"abstract":"","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133972295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Peters, Sangyoung Park, Daniel Clifford, S. Kyostila, R. McIlroy, B. Meurer, H. Payer, S. Chakraborty
Over the last years, web browsing has been steadily shifting from desktop computers to mobile devices like smartphones and tablets. However, mobile browsers available today have mainly focused on performance rather than power consumption, although the battery life of a mobile device is one of the most important usability metrics. This is because many of these browsers have originated in the desktop domain and have been ported to the mobile domain. Such browsers have multiple power hungry components such as the rendering engine, and the JavaScript engine, and generate high workload without considering the capabilities and the power consumption characteristics of the underlying hardware platform. Also, the lack of coordination between a browser application and the power manager in the operating system (such as Android) results in poor power savings. In this paper, we propose a power manager that takes into account the internal state of a browser -- that we refer to as a phase -- and show with Google's Chrome running on Android that up to 57.4% more energy can be saved over Android's default power managers. We implemented and evaluated our technique on a heterogeneous multiprocessing (HMP) ARM big.LITTLE platform such as the ones found in most modern smartphones.
{"title":"Phase-Aware Web Browser Power Management on HMP Platforms","authors":"N. Peters, Sangyoung Park, Daniel Clifford, S. Kyostila, R. McIlroy, B. Meurer, H. Payer, S. Chakraborty","doi":"10.1145/3205289.3205293","DOIUrl":"https://doi.org/10.1145/3205289.3205293","url":null,"abstract":"Over the last years, web browsing has been steadily shifting from desktop computers to mobile devices like smartphones and tablets. However, mobile browsers available today have mainly focused on performance rather than power consumption, although the battery life of a mobile device is one of the most important usability metrics. This is because many of these browsers have originated in the desktop domain and have been ported to the mobile domain. Such browsers have multiple power hungry components such as the rendering engine, and the JavaScript engine, and generate high workload without considering the capabilities and the power consumption characteristics of the underlying hardware platform. Also, the lack of coordination between a browser application and the power manager in the operating system (such as Android) results in poor power savings. In this paper, we propose a power manager that takes into account the internal state of a browser -- that we refer to as a phase -- and show with Google's Chrome running on Android that up to 57.4% more energy can be saved over Android's default power managers. We implemented and evaluated our technique on a heterogeneous multiprocessing (HMP) ARM big.LITTLE platform such as the ones found in most modern smartphones.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133979846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As DNNs gain popularity in modern datacenters, it becomes imperative to revisit compiler optimizations for DNNs in a colocation scenario. Loop tiling turns out to be the most significant compiler optimization, since DNNs typically apply a series of matrix computations iteratively to a massive amount of data. We introduce a reuse-pattern-centric approach to obtaining a peer-aware TSS (Tile Size Selection) model for a matrix-based application A. Our key insight is that the co-running cache behavior of A (once tiled) can be determined by its data reuse patterns, together with the cache pressure exerted by its co-running peers, without actually the need for analyzing the code of its co-runners. Compared with static tiling (that determines a tile size for A statically without considering its co-running peers), our peer-aware tiling enables compilers to generate either faster peer-aware efficient code for A (by optimizing the performance of A) or faster peer-aware nice code for A (by optimizing the performance of its co-runners). In addition, our peer-aware tiling also enables library developers to improve the performance of library routines (more effectively than static tiling).
{"title":"Revisiting Loop Tiling for Datacenters: Live and Let Live","authors":"Jiacheng Zhao, Huimin Cui, Yalin Zhang, Jingling Xue, Xiaobing Feng","doi":"10.1145/3205289.3205306","DOIUrl":"https://doi.org/10.1145/3205289.3205306","url":null,"abstract":"As DNNs gain popularity in modern datacenters, it becomes imperative to revisit compiler optimizations for DNNs in a colocation scenario. Loop tiling turns out to be the most significant compiler optimization, since DNNs typically apply a series of matrix computations iteratively to a massive amount of data. We introduce a reuse-pattern-centric approach to obtaining a peer-aware TSS (Tile Size Selection) model for a matrix-based application A. Our key insight is that the co-running cache behavior of A (once tiled) can be determined by its data reuse patterns, together with the cache pressure exerted by its co-running peers, without actually the need for analyzing the code of its co-runners. Compared with static tiling (that determines a tile size for A statically without considering its co-running peers), our peer-aware tiling enables compilers to generate either faster peer-aware efficient code for A (by optimizing the performance of A) or faster peer-aware nice code for A (by optimizing the performance of its co-runners). In addition, our peer-aware tiling also enables library developers to improve the performance of library routines (more effectively than static tiling).","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133035299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jacob Lambert, Seyong Lee, Jungwon Kim, J. Vetter, A. Malony
Reconfigurable architectures like Field Programmable Gate Arrays (FPGAs) have been used for accelerating computations from several domains because of their unique combination of flexibility, performance, and power efficiency. However, FPGAs have not been widely used for high-performance computing, primarily because of their programming complexity and difficulties in optimizing performance. In this paper, we present a directive-based, high-level optimization framework for high-performance computing with FPGAs, built on top of an OpenACC-to-FPGA translation framework called OpenARC. We propose directive extensions and corresponding compile-time optimization techniques to enable the compiler to generate more efficient FPGA hardware configuration files. Empirical evaluation of the proposed framework on an Intel Stratix V with five OpenACC benchmarks from various application domains shows that FPGA-specific optimizations can lead to significant increases in performance across all tested applications. We also demonstrate that applying these high-level directive-based optimizations can allow OpenACC applications to perform similarly to lower-level OpenCL applications with hand-written FPGA-specific optimizations, and offer runtime and power performance benefits compared to CPUs and GPUs.
{"title":"Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs","authors":"Jacob Lambert, Seyong Lee, Jungwon Kim, J. Vetter, A. Malony","doi":"10.1145/3205289.3205324","DOIUrl":"https://doi.org/10.1145/3205289.3205324","url":null,"abstract":"Reconfigurable architectures like Field Programmable Gate Arrays (FPGAs) have been used for accelerating computations from several domains because of their unique combination of flexibility, performance, and power efficiency. However, FPGAs have not been widely used for high-performance computing, primarily because of their programming complexity and difficulties in optimizing performance. In this paper, we present a directive-based, high-level optimization framework for high-performance computing with FPGAs, built on top of an OpenACC-to-FPGA translation framework called OpenARC. We propose directive extensions and corresponding compile-time optimization techniques to enable the compiler to generate more efficient FPGA hardware configuration files. Empirical evaluation of the proposed framework on an Intel Stratix V with five OpenACC benchmarks from various application domains shows that FPGA-specific optimizations can lead to significant increases in performance across all tested applications. We also demonstrate that applying these high-level directive-based optimizations can allow OpenACC applications to perform similarly to lower-level OpenCL applications with hand-written FPGA-specific optimizations, and offer runtime and power performance benefits compared to CPUs and GPUs.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129052578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}