E. Garcia, Daniel A. Orozco, R. Khan, Ioannis E. Venetis, Kelly Livingston, G. Gao
This paper provides a discussion on the shortcomings of traditional static optimization techniques when used in the context of many-core architectures. We argue that these shortcomings are a result of the significantly different environment found in many-cores. We analyze previous attempts at optimization of Dense Matrix Multiplication (DMM) that failed to achieve high performance despite extensive efforts towards optimization. We have found that percolation (prefetching data) and scheduling play a central role in the performance of applications. To overcome those difficulties, we have (1) fused dynamic scheduling and percolation into a dynamic percolation approach and (2) we have added additional percolation operations. Our new techniques enabled us to increase the performance of the application in our study from 44 GFLOPS (out of 80 GFLOPS possible) to 70.0 GFLOPS (operands in SRAM) or 65.6 GFLOPS (operands in DRAM).
{"title":"Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures","authors":"E. Garcia, Daniel A. Orozco, R. Khan, Ioannis E. Venetis, Kelly Livingston, G. Gao","doi":"10.1145/2212908.2212944","DOIUrl":"https://doi.org/10.1145/2212908.2212944","url":null,"abstract":"This paper provides a discussion on the shortcomings of traditional static optimization techniques when used in the context of many-core architectures. We argue that these shortcomings are a result of the significantly different environment found in many-cores. We analyze previous attempts at optimization of Dense Matrix Multiplication (DMM) that failed to achieve high performance despite extensive efforts towards optimization.\u0000 We have found that percolation (prefetching data) and scheduling play a central role in the performance of applications. To overcome those difficulties, we have (1) fused dynamic scheduling and percolation into a dynamic percolation approach and (2) we have added additional percolation operations. Our new techniques enabled us to increase the performance of the application in our study from 44 GFLOPS (out of 80 GFLOPS possible) to 70.0 GFLOPS (operands in SRAM) or 65.6 GFLOPS (operands in DRAM).","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131367150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Rahman, Jichi Guo, Akshatha Bhat, Carlos D. Garcia, Majedul Haque Sujon, Qing Yi, C. Liao, D. Quinlan
This paper studies the overall system power variations of two multi-core architectures, an 8-core Intel and a 32-core AMD workstation, while using these machines to execute a wide variety of sequential and multi-threaded benchmarks using varying compiler optimization settings and runtime configurations. Our extensive experimental study provides insights for answering two questions: 1) what degrees of impact can application level optimizations have on reducing the overall system power consumption of modern CMP architectures; and 2) what strategies can compilers and application developers adopt to achieve a balanced performance and power efficiency for applications from a variety of science and embedded systems domains.
{"title":"Studying the impact of application-level optimizations on the power consumption of multi-core architectures","authors":"S. Rahman, Jichi Guo, Akshatha Bhat, Carlos D. Garcia, Majedul Haque Sujon, Qing Yi, C. Liao, D. Quinlan","doi":"10.1145/2212908.2212927","DOIUrl":"https://doi.org/10.1145/2212908.2212927","url":null,"abstract":"This paper studies the overall system power variations of two multi-core architectures, an 8-core Intel and a 32-core AMD workstation, while using these machines to execute a wide variety of sequential and multi-threaded benchmarks using varying compiler optimization settings and runtime configurations. Our extensive experimental study provides insights for answering two questions: 1) what degrees of impact can application level optimizations have on reducing the overall system power consumption of modern CMP architectures; and 2) what strategies can compilers and application developers adopt to achieve a balanced performance and power efficiency for applications from a variety of science and embedded systems domains.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130966073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The CRESTA project is one of three complementary exascale software projects funded by the European Commission. The three-year project is employing a novel approach to exascale system co-design which focuses on the use of a small, representative set of applications to inform and guide software and systemware developments. This methodology is designed to identify where problem areas exist in applications and to use that knowledge to consider different solutions to those problems which inform software and hardware advances. CRESTA uses a methodology of either incremental or disruptive advances to move towards solutions across the whole of the exascale software stack.
{"title":"CRESTA: a software focussed approach to exascale co-design","authors":"Mark I. Parsons","doi":"10.1145/2212908.2212958","DOIUrl":"https://doi.org/10.1145/2212908.2212958","url":null,"abstract":"The CRESTA project is one of three complementary exascale software projects funded by the European Commission. The three-year project is employing a novel approach to exascale system co-design which focuses on the use of a small, representative set of applications to inform and guide software and systemware developments. This methodology is designed to identify where problem areas exist in applications and to use that knowledge to consider different solutions to those problems which inform software and hardware advances. CRESTA uses a methodology of either incremental or disruptive advances to move towards solutions across the whole of the exascale software stack.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116129236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikola Vujic, Lluc Alvarez, Marc González, X. Martorell, E. Ayguadé
This paper presents DMA-circular, a novel DMA controller for optimized memory management for on-chip local memories. DMA-circular embeds the functionality of caches into the DMA controller and applies aggressive optimizations using novel hardware. DMA-circular anticipates the computation requirements in terms of data transfers and performs buffer management for data that is mapped to the local memory. The explicit hardware support accelerates the most common actions related to the management of a local memory while the cache functionalities enable a high level of programmability for the DMA-circular. The evaluation is done on several high performance kernels from the NAS benchmark suite. Compared to traditional DMA controllers, results show speedups from 1.20x to 2x, keeping the control code overhead under 15% of the kernels' execution time and also reducing the energy consumption up to 40%.
{"title":"DMA-circular: an enhanced high level programmable DMA controller for optimized management of on-chip local memories","authors":"Nikola Vujic, Lluc Alvarez, Marc González, X. Martorell, E. Ayguadé","doi":"10.1145/2212908.2212925","DOIUrl":"https://doi.org/10.1145/2212908.2212925","url":null,"abstract":"This paper presents DMA-circular, a novel DMA controller for optimized memory management for on-chip local memories. DMA-circular embeds the functionality of caches into the DMA controller and applies aggressive optimizations using novel hardware. DMA-circular anticipates the computation requirements in terms of data transfers and performs buffer management for data that is mapped to the local memory. The explicit hardware support accelerates the most common actions related to the management of a local memory while the cache functionalities enable a high level of programmability for the DMA-circular. The evaluation is done on several high performance kernels from the NAS benchmark suite. Compared to traditional DMA controllers, results show speedups from 1.20x to 2x, keeping the control code overhead under 15% of the kernels' execution time and also reducing the energy consumption up to 40%.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128208495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generating immersive game content is one of the ultimate goals for a game designer. This goal can be achieved by realizing the fact that players' perception of the same game differ according to a number of factors including: players' personality, playing styles, expertise and culture background. While one player might find the game immersive, others may quit playing as a result of encountering a seemingly insoluble problem. One promising avenue towards optimizing the gameplay experience for individual game players is to tailor player experience in real-time via automatic game content generation. Specifying the aspects of the game that have the major influence on the gameplay experience, identifying the relationship between these aspect and each individual experience and defining a mechanism for tailoring the game content according to each individual needs are important steps towards player-driven content generation.
{"title":"Towards player-driven procedural content generation","authors":"Noor Shaker, Georgios N. Yannakakis, J. Togelius","doi":"10.1145/2212908.2212942","DOIUrl":"https://doi.org/10.1145/2212908.2212942","url":null,"abstract":"Generating immersive game content is one of the ultimate goals for a game designer. This goal can be achieved by realizing the fact that players' perception of the same game differ according to a number of factors including: players' personality, playing styles, expertise and culture background. While one player might find the game immersive, others may quit playing as a result of encountering a seemingly insoluble problem. One promising avenue towards optimizing the gameplay experience for individual game players is to tailor player experience in real-time via automatic game content generation. Specifying the aspects of the game that have the major influence on the gameplay experience, identifying the relationship between these aspect and each individual experience and defining a mechanism for tailoring the game content according to each individual needs are important steps towards player-driven content generation.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134368773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Lucas, Philipp Rohlfshagen, Diego Perez Liebana
This paper provides a computational intelligence perspective on the design of intelligent video game agents. The paper explains why this is an interesting area to research, and outlines the most promising approaches to date, including evolution, temporal difference learning and Monte Carlo Tree Search. Strengths and weaknesses of each approach are identified, and some research directions are outlined that may soon lead to significantly improved video game agents with lower development costs.
{"title":"Towards more intelligent adaptive video game agents: a computational intelligence perspective","authors":"S. Lucas, Philipp Rohlfshagen, Diego Perez Liebana","doi":"10.1145/2212908.2212955","DOIUrl":"https://doi.org/10.1145/2212908.2212955","url":null,"abstract":"This paper provides a computational intelligence perspective on the design of intelligent video game agents. The paper explains why this is an interesting area to research, and outlines the most promising approaches to date, including evolution, temporal difference learning and Monte Carlo Tree Search. Strengths and weaknesses of each approach are identified, and some research directions are outlined that may soon lead to significantly improved video game agents with lower development costs.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129624216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bharghava Rajaram, V. Nagarajan, Andrew J. McPherson, Marcelo H. Cintra
Supervised memory systems maintain additional metadata for each memory address accessed by the program, to control and monitor accesses to the program data. Supervised systems find use in several applications including memory checking, synchronization, race detection, and transactional memory. Conventional memory instructions are replaced by supervised memory instructions (SMIs) which operate on both data and metadata atomically. Existing proposals for supervised memory systems assume sequential consistency. Recently, Bobba et al. [4] demonstrated the correctness issues (imprecise exceptions and metadata read reordering) in naively applying supervision to Total-Store-Order, and proposed two solutions - TSOall and TSOdata - for overcoming the correctness issues. TSOall solves correctness issues by forcing SMIs to perform in order, but performs similar to SC, since supervised writes cannot retire into the write-buffer. TSOdata, while allowing supervised writes to retire into the write-buffer, works correctly for only a subset of supervision schemes. In this paper we observe that correctness is ensured as long as SMIs read and process their metadata in order. We propose SuperCoP, a supervised memory system for relaxed memory models in which SMIs read and process metadata before retirement, while allowing data and metadata writes to retire into the write-buffer. Since SuperCoP separates metadata reads and their processing from the writes, we propose a simple mechanism - in the form of cache block level locking at the directory - to ensure atomicity. Our experimental results show that SuperCoP performs better than TSOall by 16.8%. SuperCoP also performs better than TSOdata by 6%, even though TSOdata is not general.
{"title":"SuperCoP: a general, correct, and performance-efficient supervised memory system","authors":"Bharghava Rajaram, V. Nagarajan, Andrew J. McPherson, Marcelo H. Cintra","doi":"10.1145/2212908.2212922","DOIUrl":"https://doi.org/10.1145/2212908.2212922","url":null,"abstract":"Supervised memory systems maintain additional metadata for each memory address accessed by the program, to control and monitor accesses to the program data. Supervised systems find use in several applications including memory checking, synchronization, race detection, and transactional memory. Conventional memory instructions are replaced by supervised memory instructions (SMIs) which operate on both data and metadata atomically. Existing proposals for supervised memory systems assume sequential consistency. Recently, Bobba et al. [4] demonstrated the correctness issues (imprecise exceptions and metadata read reordering) in naively applying supervision to Total-Store-Order, and proposed two solutions - TSOall and TSOdata - for overcoming the correctness issues. TSOall solves correctness issues by forcing SMIs to perform in order, but performs similar to SC, since supervised writes cannot retire into the write-buffer. TSOdata, while allowing supervised writes to retire into the write-buffer, works correctly for only a subset of supervision schemes. In this paper we observe that correctness is ensured as long as SMIs read and process their metadata in order. We propose SuperCoP, a supervised memory system for relaxed memory models in which SMIs read and process metadata before retirement, while allowing data and metadata writes to retire into the write-buffer. Since SuperCoP separates metadata reads and their processing from the writes, we propose a simple mechanism - in the form of cache block level locking at the directory - to ensure atomicity. Our experimental results show that SuperCoP performs better than TSOall by 16.8%. SuperCoP also performs better than TSOdata by 6%, even though TSOdata is not general.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131851118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Yoon, T. Gonzalez, Parthasarathy Ranganathan, R. Schreiber
To handle the demand for very large main memory, we are likely to use nonvolatile memory (NVM) as main memory. NVM main memory will have higher latency than DRAM. To cope with this, we advocate a less-deep cache hierarchy based on a large last-level, NVM cache. We develop a model that estimates average memory access time and power of a cache hierarchy. The model is based on captured application behavior, an analytical power and performance model, and circuit-level memory models such as CACTI and NVSim. We use the model to explore the cache hierarchy design space and present latency-power tradeoffs for memory intensive SPEC benchmarks and scientific applications. The results indicate that a flattened hierarchy lowers power and improves average memory access time.
{"title":"Exploring latency-power tradeoffs in deep nonvolatile memory hierarchies","authors":"D. Yoon, T. Gonzalez, Parthasarathy Ranganathan, R. Schreiber","doi":"10.1145/2212908.2212923","DOIUrl":"https://doi.org/10.1145/2212908.2212923","url":null,"abstract":"To handle the demand for very large main memory, we are likely to use nonvolatile memory (NVM) as main memory. NVM main memory will have higher latency than DRAM. To cope with this, we advocate a less-deep cache hierarchy based on a large last-level, NVM cache. We develop a model that estimates average memory access time and power of a cache hierarchy. The model is based on captured application behavior, an analytical power and performance model, and circuit-level memory models such as CACTI and NVSim. We use the model to explore the cache hierarchy design space and present latency-power tradeoffs for memory intensive SPEC benchmarks and scientific applications. The results indicate that a flattened hierarchy lowers power and improves average memory access time.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134473960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hybrid optical/electrical interconnects, using commercially available optical circuit switches at the core part of the network, have been recently proposed as an attractive alternative to fully-connected electronically-switched networks in terms of port density, bandwidth/port, cabling and energy efficiency. Although the shift from a traditionally packet-switched core to switching between server aggregations (or servers) at circuit granularity requires system redesign, the approach has been shown to fit well to the traffic requirements of certain classes of high-performance computing applications, as well as to the traffic patterns exhibited by typical data center workloads. Recent proposals for such system designs have looked at small/medium scale hybrid interconnects. In this paper, we present a hybrid optical/electrical interconnect architecture intended for large-scale deployments of high-performance computing systems and server co-locations. To reduce complexity, our architecture employs a regular shuffle network topology that allows for simple management and cabling. Thanks to using a single-stage core interconnect and multiple optical planes, our design can be both incrementally scaled up (in capacity) and scaled out (in the number of racks) without requiring major re-cabling and network re-configuration. Also, we are the first to our knowledge to explore the benefit of using multi-hopping in the optical domain as a means to avoid constant reconfiguration of optical circuit switches. We have prototyped our architecture at packet-level detail in a simulation framework to evaluate this concept. Our results demonstrate that our hybrid interconnect, by adapting to the changing nature of application traffic, can significantly exceed the throughput of a static interconnect of equal degree, while at times attaining a throughput comparable to that of a costly fully-connected network. We also show a further benefit brought by multi-hopping, that it reduces the performance drops by reducing the frequency of reconfiguration.
{"title":"A reconfigurable optical/electrical interconnect architecture for large-scale clusters and datacenters","authors":"D. Lugones, K. Katrinis, M. Collier","doi":"10.1145/2212908.2212913","DOIUrl":"https://doi.org/10.1145/2212908.2212913","url":null,"abstract":"Hybrid optical/electrical interconnects, using commercially available optical circuit switches at the core part of the network, have been recently proposed as an attractive alternative to fully-connected electronically-switched networks in terms of port density, bandwidth/port, cabling and energy efficiency. Although the shift from a traditionally packet-switched core to switching between server aggregations (or servers) at circuit granularity requires system redesign, the approach has been shown to fit well to the traffic requirements of certain classes of high-performance computing applications, as well as to the traffic patterns exhibited by typical data center workloads. Recent proposals for such system designs have looked at small/medium scale hybrid interconnects. In this paper, we present a hybrid optical/electrical interconnect architecture intended for large-scale deployments of high-performance computing systems and server co-locations. To reduce complexity, our architecture employs a regular shuffle network topology that allows for simple management and cabling. Thanks to using a single-stage core interconnect and multiple optical planes, our design can be both incrementally scaled up (in capacity) and scaled out (in the number of racks) without requiring major re-cabling and network re-configuration. Also, we are the first to our knowledge to explore the benefit of using multi-hopping in the optical domain as a means to avoid constant reconfiguration of optical circuit switches. We have prototyped our architecture at packet-level detail in a simulation framework to evaluate this concept. Our results demonstrate that our hybrid interconnect, by adapting to the changing nature of application traffic, can significantly exceed the throughput of a static interconnect of equal degree, while at times attaining a throughput comparable to that of a costly fully-connected network. We also show a further benefit brought by multi-hopping, that it reduces the performance drops by reducing the frequency of reconfiguration.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127218807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since each process executes the same program under SPMD, every process mapped to a CPU core also needs the GPU availability. Therefore SPMD demands a symmetric CPU/GPU distribution. However, since modern HPC systems feature a large number of CPU cores that outnumber the number of GPUs, computing resources are generally underutilized with SPMD. Our previous efforts have focused on GPU virtualization that enables efficient sharing of GPU among multiple CPU processes. Nevertheless, a formal method to evaluate and choose the appropriate GPU sharing approach is still lacking. In this paper, based on SPMD GPU kernel profiles, we propose different multi-process GPU sharing scenarios under virtualization. We introduce an analytical model that captures these sharing scenarios and provides a theoretical performance gain estimation. Benchmarks validate our analyses and achievable performance gains. While our analytical study provides a suitable theoretical foundation for GPU sharing, the experimental results demonstrate that GPU virtualization affords significant performance improvements over the non-virtualized solutions for all proposed sharing scenarios.
{"title":"Accelerated high-performance computing through efficient multi-process GPU resource sharing","authors":"Teng Li, Vikram K. Narayana, T. El-Ghazawi","doi":"10.1145/2212908.2212950","DOIUrl":"https://doi.org/10.1145/2212908.2212950","url":null,"abstract":"The HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since each process executes the same program under SPMD, every process mapped to a CPU core also needs the GPU availability. Therefore SPMD demands a symmetric CPU/GPU distribution. However, since modern HPC systems feature a large number of CPU cores that outnumber the number of GPUs, computing resources are generally underutilized with SPMD. Our previous efforts have focused on GPU virtualization that enables efficient sharing of GPU among multiple CPU processes. Nevertheless, a formal method to evaluate and choose the appropriate GPU sharing approach is still lacking. In this paper, based on SPMD GPU kernel profiles, we propose different multi-process GPU sharing scenarios under virtualization. We introduce an analytical model that captures these sharing scenarios and provides a theoretical performance gain estimation. Benchmarks validate our analyses and achievable performance gains. While our analytical study provides a suitable theoretical foundation for GPU sharing, the experimental results demonstrate that GPU virtualization affords significant performance improvements over the non-virtualized solutions for all proposed sharing scenarios.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"81 16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127390868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}