Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116712
J. Daily, Abhinav Vishnu, B. Palmer, H. V. Dam, D. Kerbyson
Partitioned Global Address Space (PGAS) models are emerging as a popular alternative to MPI models for designing scalable applications. At the same time, MPI remains a ubiquitous communication subsystem due to its standardization, high performance, and availability on leading platforms. In this paper, we explore the suitability of using MPI as a scalable PGAS communication subsystem. We focus on the Remote Memory Access (RMA) communication in PGAS models which typically includes get, put, and atomic memory operations. We perform an in-depth exploration of design alternatives based on MPI. These alternatives include using a semantically-matching interface such as MPI-RMA, as well as not-so-intuitive interfaces such as MPI two-sided with a combination of multi-threading and dynamic process management. With an in-depth exploration of these alternatives and their shortcomings, we propose a novel design which is facilitated by the data-centric view in PGAS models. This design leverages a combination of highly tuned MPI two-sided semantics and an automatic, user-transparent split of MPI communicators to provide asynchronous progress. We implement the asynchronous progress ranks approach and other approaches within the Communication Runtime for Exascale which is a communication subsystem for Global Arrays. Our performance evaluation spans pure communication benchmarks, graph community detection and sparse matrix-vector multiplication kernels, and a computational chemistry application. The utility of our proposed PR-based approach is demonstrated by a 2.17x speedup on 1008 processors over the other MPI-based designs.
{"title":"On the suitability of MPI as a PGAS runtime","authors":"J. Daily, Abhinav Vishnu, B. Palmer, H. V. Dam, D. Kerbyson","doi":"10.1109/HiPC.2014.7116712","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116712","url":null,"abstract":"Partitioned Global Address Space (PGAS) models are emerging as a popular alternative to MPI models for designing scalable applications. At the same time, MPI remains a ubiquitous communication subsystem due to its standardization, high performance, and availability on leading platforms. In this paper, we explore the suitability of using MPI as a scalable PGAS communication subsystem. We focus on the Remote Memory Access (RMA) communication in PGAS models which typically includes get, put, and atomic memory operations. We perform an in-depth exploration of design alternatives based on MPI. These alternatives include using a semantically-matching interface such as MPI-RMA, as well as not-so-intuitive interfaces such as MPI two-sided with a combination of multi-threading and dynamic process management. With an in-depth exploration of these alternatives and their shortcomings, we propose a novel design which is facilitated by the data-centric view in PGAS models. This design leverages a combination of highly tuned MPI two-sided semantics and an automatic, user-transparent split of MPI communicators to provide asynchronous progress. We implement the asynchronous progress ranks approach and other approaches within the Communication Runtime for Exascale which is a communication subsystem for Global Arrays. Our performance evaluation spans pure communication benchmarks, graph community detection and sparse matrix-vector multiplication kernels, and a computational chemistry application. The utility of our proposed PR-based approach is demonstrated by a 2.17x speedup on 1008 processors over the other MPI-based designs.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126354190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116874
P. Yébenes, J. Escudero-Sahuquillo, Crispín Gómez Requena, P. García, F. J. Alfaro, F. Quiles, J. Duato
Current high-performance platforms such as Datacenters or High-Performance Computing systems rely on highspeed interconnection networks able to cope with the ever-increasing communication requirements of modern applications. In particular, in high-performance systems that must offer differentiated services to applications which involve traffic prioritization, it is almost mandatory that the interconnection network provides some type of Quality-of-Service (QoS) and Congestion-Management mechanism in order to achieve the required network performance. Most current QoS and Congestion-Management mechanisms for high-speed interconnects are based on using the same kind of resources, but with different criteria, resulting in disjoint types of mechanisms. By contrast, we propose in this paper a novel, straightforward solution that leverages the resources already available in InfiniBand components (basically Service Levels and Virtual Lanes) to provide both QoS and Congestion Management at the same time. This proposal is called CHADS (Combined HoL-blocking Avoidance and Differentiated Services), and it could be applied to any network topology. From the results shown in this paper for networks configured with the novel, cost-efficient KNS hybrid topology, we can conclude that CHADS is more efficient than other schemes in reducing the interferences among packet flows that have the same or different priorities.
当前的高性能平台,如数据中心或高性能计算系统,依赖于高速互连网络来应对现代应用程序日益增长的通信需求。特别是,在高性能系统中,必须向涉及流量优先级的应用程序提供差异化服务,为了实现所需的网络性能,互连网络提供某种类型的服务质量(QoS)和拥塞管理机制几乎是强制性的。目前大多数高速互连的QoS和拥塞管理机制都是基于使用相同类型的资源,但标准不同,导致机制类型不一致。相比之下,我们在本文中提出了一种新颖、直接的解决方案,它利用InfiniBand组件(基本上是服务级别和虚拟通道)中已有的资源,同时提供QoS和拥塞管理。该方案被称为CHADS (Combined HoL-blocking Avoidance and Differentiated Services),可以应用于任何网络拓扑结构。从本文中显示的具有新颖,成本效益的KNS混合拓扑配置的网络的结果中,我们可以得出结论,CHADS在减少具有相同或不同优先级的分组流之间的干扰方面比其他方案更有效。
{"title":"Combining HoL-blocking avoidance and differentiated services in high-speed interconnects","authors":"P. Yébenes, J. Escudero-Sahuquillo, Crispín Gómez Requena, P. García, F. J. Alfaro, F. Quiles, J. Duato","doi":"10.1109/HiPC.2014.7116874","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116874","url":null,"abstract":"Current high-performance platforms such as Datacenters or High-Performance Computing systems rely on highspeed interconnection networks able to cope with the ever-increasing communication requirements of modern applications. In particular, in high-performance systems that must offer differentiated services to applications which involve traffic prioritization, it is almost mandatory that the interconnection network provides some type of Quality-of-Service (QoS) and Congestion-Management mechanism in order to achieve the required network performance. Most current QoS and Congestion-Management mechanisms for high-speed interconnects are based on using the same kind of resources, but with different criteria, resulting in disjoint types of mechanisms. By contrast, we propose in this paper a novel, straightforward solution that leverages the resources already available in InfiniBand components (basically Service Levels and Virtual Lanes) to provide both QoS and Congestion Management at the same time. This proposal is called CHADS (Combined HoL-blocking Avoidance and Differentiated Services), and it could be applied to any network topology. From the results shown in this paper for networks configured with the novel, cost-efficient KNS hybrid topology, we can conclude that CHADS is more efficient than other schemes in reducing the interferences among packet flows that have the same or different priorities.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"391 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116664449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116878
Astrid Casadei, P. Ramet, J. Roman
In the context of hybrid sparse linear solvers based on domain decomposition and Schur complement approaches, getting a domain decomposition tool leading to a good balancing of both the internal node set size and the interface node set size for all the domains is a critical point for load balancing and efficiency issues in a parallel computation context. For this purpose, we revisit the original algorithm introduced by Lipton, Rose and Tarjan [1] in 1979 which performed the recursion for nested dissection in a particular manner. From this specific recursive strategy, we propose in this paper several variations of the existing algorithms in the multilevel Scotch partitioner that take into account these multiple criteria and we illustrate the improved results on a collection of graphs corresponding to finite element meshes used in numerical scientific applications.
{"title":"An improved recursive graph bipartitioning algorithm for well balanced domain decomposition","authors":"Astrid Casadei, P. Ramet, J. Roman","doi":"10.1109/HiPC.2014.7116878","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116878","url":null,"abstract":"In the context of hybrid sparse linear solvers based on domain decomposition and Schur complement approaches, getting a domain decomposition tool leading to a good balancing of both the internal node set size and the interface node set size for all the domains is a critical point for load balancing and efficiency issues in a parallel computation context. For this purpose, we revisit the original algorithm introduced by Lipton, Rose and Tarjan [1] in 1979 which performed the recursion for nested dissection in a particular manner. From this specific recursive strategy, we propose in this paper several variations of the existing algorithms in the multilevel Scotch partitioner that take into account these multiple criteria and we illustrate the improved results on a collection of graphs corresponding to finite element meshes used in numerical scientific applications.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122610144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116896
Maleen Abeydeera, S. Samaranayake
The Stochastic On-Time Arrival (SOTA) problem has recently been studied as an alternative to traditional shortest-path formulations in situations with hard deadlines. The goal is to find a routing strategy that maximizes the probability of reaching the destination within a pre-specified time budget, with the edge weights of the graph being random variables with arbitrary distributions. While this is a practically useful formulation for vehicle routing, the commercial deployment of such methods is not currently feasible due to the high computational complexity of existing solutions. We present a parallelization strategy for improving the computation times by multiple orders of magnitude compared to the single threaded CPU implementations, using a CUDA GPU implementation. A single order of magnitude is achieved via naive parallelization of the problem, and another order of magnitude via optimal utilization of the GPU resources. We also show that the runtime can be further reduced in certain cases using dynamic thread assignment and an edge clustering method for accelerating queries with a small time budget.
{"title":"GPU parallelization of the stochastic on-time arrival problem","authors":"Maleen Abeydeera, S. Samaranayake","doi":"10.1109/HiPC.2014.7116896","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116896","url":null,"abstract":"The Stochastic On-Time Arrival (SOTA) problem has recently been studied as an alternative to traditional shortest-path formulations in situations with hard deadlines. The goal is to find a routing strategy that maximizes the probability of reaching the destination within a pre-specified time budget, with the edge weights of the graph being random variables with arbitrary distributions. While this is a practically useful formulation for vehicle routing, the commercial deployment of such methods is not currently feasible due to the high computational complexity of existing solutions. We present a parallelization strategy for improving the computation times by multiple orders of magnitude compared to the single threaded CPU implementations, using a CUDA GPU implementation. A single order of magnitude is achieved via naive parallelization of the problem, and another order of magnitude via optimal utilization of the GPU resources. We also show that the runtime can be further reduced in certain cases using dynamic thread assignment and an edge clustering method for accelerating queries with a small time budget.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133435200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116877
Sujesha Sudevalayam, Purushottam Kulkarni
Co-hosting of virtualized applications results in similar content across multiple blocks on disk, which are fetched into memory (the host's page cache). Content similarity can be harnessed both to avoid duplicate disk I/O requests that fetch the same content repeatedly, as well as to prevent multiple occurrences of duplicate content in cache. Typically, caches store the most recently or frequently accessed blocks to reduce the number of disk read accesses. These caches are referenced by block number, and can not recognize content similarity across multiple blocks. Existing work in memory deduplication merges cache pages after multiple identical blocks have already been fetched from disk into cache, while existing work in I/O deduplication reserves a portion of the host-cache to be maintained as a content-aware cache. We propose a disk I/O reduction system for the virtualization environment that addresses the dual problems of duplicate I/O and duplicate content in the host-cache, without being invasive. We build a disk read-access optimization called DRIVE, that identifies content similarity across multiple blocks, and performs hint-based read I/O redirection to improve cache effectiveness, thus reducing the number of disk reads further. A metadata store is maintained based on the virtual machine's disk accesses and implicit caching hints are collected for future read I/O redirection. The read I/O redirection is performed from within the virtual block device in the virtualized system, to manipulate the entire host-cache as a content-deduplicated cache implicitly. Our trace-based evaluation using a custom simulator, reveals that DRIVE always performs equal to or better than the Vanilla system, achieving up to 20% better cache-hit ratios and reducing the number of disk reads by up to 80%. The results also indicate that our system is able to achieve up to 97% content deduplication in the host-cache.
{"title":"DRIVE: Using implicit caching hints to achieve disk I/O reduction in virtualized environments","authors":"Sujesha Sudevalayam, Purushottam Kulkarni","doi":"10.1109/HiPC.2014.7116877","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116877","url":null,"abstract":"Co-hosting of virtualized applications results in similar content across multiple blocks on disk, which are fetched into memory (the host's page cache). Content similarity can be harnessed both to avoid duplicate disk I/O requests that fetch the same content repeatedly, as well as to prevent multiple occurrences of duplicate content in cache. Typically, caches store the most recently or frequently accessed blocks to reduce the number of disk read accesses. These caches are referenced by block number, and can not recognize content similarity across multiple blocks. Existing work in memory deduplication merges cache pages after multiple identical blocks have already been fetched from disk into cache, while existing work in I/O deduplication reserves a portion of the host-cache to be maintained as a content-aware cache. We propose a disk I/O reduction system for the virtualization environment that addresses the dual problems of duplicate I/O and duplicate content in the host-cache, without being invasive. We build a disk read-access optimization called DRIVE, that identifies content similarity across multiple blocks, and performs hint-based read I/O redirection to improve cache effectiveness, thus reducing the number of disk reads further. A metadata store is maintained based on the virtual machine's disk accesses and implicit caching hints are collected for future read I/O redirection. The read I/O redirection is performed from within the virtual block device in the virtualized system, to manipulate the entire host-cache as a content-deduplicated cache implicitly. Our trace-based evaluation using a custom simulator, reveals that DRIVE always performs equal to or better than the Vanilla system, achieving up to 20% better cache-hit ratios and reducing the number of disk reads by up to 80%. The results also indicate that our system is able to achieve up to 97% content deduplication in the host-cache.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123540867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116911
T. Phuong, Jeong-Gun Lee
In the paper, we design and optimize an ultrasound B-mode imaging including a high-computationally demanding beamformer on a commercial GPU. For the performance optimization, we explore the design space spanned with the use of different memory types, instruction scheduling and thread mapping strategies, etc. Then, with the developed B-mode imaging code, we conduct performance evaluations on various GPUs having different architectural features (e.g., the number of cores and core frequency). Through the experiments on various different GPU devices, we search “performance-significant-factors” which are hardware features of affecting B-mode imaging performance. Then, the analytical relationship between these GPU architectural design factors and the B-mode imaging performance is derived for our target application. At the commercial aspect of developing a product, we can select GPU architectures which are best suitable for the ultrasound applications through the prediction model. In the future, using the predictions, it would be also possible to customize a “cost-minimal” GPU architecture which satisfies a given performance constraint. In addition, the prediction model can be used to dynamically control the activity of GPU components according to the temporal requirement of performance and power/energy consumptions in portable ultrasound diagnosis systems.
{"title":"Software based ultrasound B-mode/beamforming optimization on GPU and its performance prediction","authors":"T. Phuong, Jeong-Gun Lee","doi":"10.1109/HiPC.2014.7116911","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116911","url":null,"abstract":"In the paper, we design and optimize an ultrasound B-mode imaging including a high-computationally demanding beamformer on a commercial GPU. For the performance optimization, we explore the design space spanned with the use of different memory types, instruction scheduling and thread mapping strategies, etc. Then, with the developed B-mode imaging code, we conduct performance evaluations on various GPUs having different architectural features (e.g., the number of cores and core frequency). Through the experiments on various different GPU devices, we search “performance-significant-factors” which are hardware features of affecting B-mode imaging performance. Then, the analytical relationship between these GPU architectural design factors and the B-mode imaging performance is derived for our target application. At the commercial aspect of developing a product, we can select GPU architectures which are best suitable for the ultrasound applications through the prediction model. In the future, using the predictions, it would be also possible to customize a “cost-minimal” GPU architecture which satisfies a given performance constraint. In addition, the prediction model can be used to dynamically control the activity of GPU components according to the temporal requirement of performance and power/energy consumptions in portable ultrasound diagnosis systems.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122528109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116882
H. Kabir, J. Booth, P. Raghavan
We seek to improve the performance of sparse matrix computations on multicore processors with non-uniform memory access (NUMA). Typical implementations use a bandwidth reducing ordering of the matrix to increase locality of accesses with a compressed storage format to store and operate only on the non-zero values. We propose a new multilevel storage format and a companion ordering scheme as an explicit adaptation to map to NUMA hierarchies. More specifically, we propose CSR-k, a multilevel form of the popular compressed sparse row (CSR) format for a multicore processor with k > 1 well-differentiated levels in the memory subsystem. Additionally, we develop Band-k, a modified form of a traditional bandwidth reduction scheme, to convert a matrix represented in CSRto our proposed CSR-k. We evaluate the performance of the widely-used and important sparse matrix-vector multiplication (SpMV) kernel using CSR-2 on Intel Westmere processors for a test suite of 12 large sparse matrices with row densities in the range 3 to 45. On 32 cores, on average across all matrices in the test suite, the execution time for SpMV with CSR-2is less than 42% of the time taken by the state-of-the-art automatically tuned SpMV resulting in energy savings of approximately 56%. Additionally, on average, the parallel speed-up on 32 cores of the automatically tuned SpMV relative to its 1-core performance is 8.18 compared to a value of 19.71 for CSR-2. Our analysis indicates that the higher performance of SpMV with CSR-2 comes from achieving higher reuse of x in the shared L3 cache without incurring overheads from fill-in of original zeroes. Furthermore, the pre-processing costs of SpMV with CSR-2 can be amortized on average over 97 iterations of SpMV using CSR and are substantially lower than the 513 iterations required for the automatically tuned implementation. Based on these results, CSR-k appears to be a promising multilevel formulation of CSR for adapting sparse computations to multicore processors with NUMA memory hierarchies.
{"title":"A multilevel compressed sparse row format for efficient sparse computations on multicore processors","authors":"H. Kabir, J. Booth, P. Raghavan","doi":"10.1109/HiPC.2014.7116882","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116882","url":null,"abstract":"We seek to improve the performance of sparse matrix computations on multicore processors with non-uniform memory access (NUMA). Typical implementations use a bandwidth reducing ordering of the matrix to increase locality of accesses with a compressed storage format to store and operate only on the non-zero values. We propose a new multilevel storage format and a companion ordering scheme as an explicit adaptation to map to NUMA hierarchies. More specifically, we propose CSR-k, a multilevel form of the popular compressed sparse row (CSR) format for a multicore processor with k > 1 well-differentiated levels in the memory subsystem. Additionally, we develop Band-k, a modified form of a traditional bandwidth reduction scheme, to convert a matrix represented in CSRto our proposed CSR-k. We evaluate the performance of the widely-used and important sparse matrix-vector multiplication (SpMV) kernel using CSR-2 on Intel Westmere processors for a test suite of 12 large sparse matrices with row densities in the range 3 to 45. On 32 cores, on average across all matrices in the test suite, the execution time for SpMV with CSR-2is less than 42% of the time taken by the state-of-the-art automatically tuned SpMV resulting in energy savings of approximately 56%. Additionally, on average, the parallel speed-up on 32 cores of the automatically tuned SpMV relative to its 1-core performance is 8.18 compared to a value of 19.71 for CSR-2. Our analysis indicates that the higher performance of SpMV with CSR-2 comes from achieving higher reuse of x in the shared L3 cache without incurring overheads from fill-in of original zeroes. Furthermore, the pre-processing costs of SpMV with CSR-2 can be amortized on average over 97 iterations of SpMV using CSR and are substantially lower than the 513 iterations required for the automatically tuned implementation. Based on these results, CSR-k appears to be a promising multilevel formulation of CSR for adapting sparse computations to multicore processors with NUMA memory hierarchies.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125857279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116710
Rohan Bhalla, Prathmesh Kallurkar, Nitin Gupta, S. Sarangi
Virtualization is increasingly being deployed to run applications in a cloud computing environment. Sadly, there are overheads associated with hypervisors that can prohibitively reduce application performance. A major source of the overheads is the destructive interference between the application, OS, and hypervisor in the memory system. We characterize such overheads in this paper, and propose the design of a novel Triangle cache that can effectively mitigate destructive interference across these three classes of workloads. We subsequently, proceed to design the TriKon manycore processor that consists of a set of heterogeneous cores with caches of different sizes, and Triangle caches. To maximize the throughput of the system as a whole, we propose a dynamic scheduling algorithm for scheduling a class of system and CPU intensive applications on the set of heterogeneous cores. The area of the TriKon processor is within 2% of a baseline processor, and with such a system, we could achieve a performance gain of 12% for a suite of benchmarks. Within this suite, the system intensive benchmarks show a performance gain of 20% while the performance of the compute intensive ones remains unaffected. Also, by allocating extra area for cores with sophisticated cache designs, we further improved the performance of the system intensive benchmarks to 30%.
{"title":"TriKon: A hypervisor aware manycore processor","authors":"Rohan Bhalla, Prathmesh Kallurkar, Nitin Gupta, S. Sarangi","doi":"10.1109/HiPC.2014.7116710","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116710","url":null,"abstract":"Virtualization is increasingly being deployed to run applications in a cloud computing environment. Sadly, there are overheads associated with hypervisors that can prohibitively reduce application performance. A major source of the overheads is the destructive interference between the application, OS, and hypervisor in the memory system. We characterize such overheads in this paper, and propose the design of a novel Triangle cache that can effectively mitigate destructive interference across these three classes of workloads. We subsequently, proceed to design the TriKon manycore processor that consists of a set of heterogeneous cores with caches of different sizes, and Triangle caches. To maximize the throughput of the system as a whole, we propose a dynamic scheduling algorithm for scheduling a class of system and CPU intensive applications on the set of heterogeneous cores. The area of the TriKon processor is within 2% of a baseline processor, and with such a system, we could achieve a performance gain of 12% for a suite of benchmarks. Within this suite, the system intensive benchmarks show a performance gain of 20% while the performance of the compute intensive ones remains unaffected. Also, by allocating extra area for cores with sophisticated cache designs, we further improved the performance of the system intensive benchmarks to 30%.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129384944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116907
Srikanth Yalavarthi, A. Kaginalkar
Ocean modelling is an inherently complex phenomenon within the earth system framework which poses a challenge to the earth and computational scientists. Simulation of wide temporal and spatial scales in real-time / near real-time necessitates computational scientists to explore new performance enhancing architectures and simulation methods. Due to the spectra of the scales of motion, the computational requirements of ocean forecasting are relatively higher than that of numerical weather prediction. To some extent, the rapidly evolving computer technology provide solutions to this. In this paper, we present initial attempts in porting a high resolution regional ocean model on an Intel MIC based hybrid system. We discuss the challenges and issues to be addressed in achieving an efficient implementation of an ocean modelling system.
{"title":"An early experience of regional ocean modelling on intel many integrated core architecture","authors":"Srikanth Yalavarthi, A. Kaginalkar","doi":"10.1109/HiPC.2014.7116907","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116907","url":null,"abstract":"Ocean modelling is an inherently complex phenomenon within the earth system framework which poses a challenge to the earth and computational scientists. Simulation of wide temporal and spatial scales in real-time / near real-time necessitates computational scientists to explore new performance enhancing architectures and simulation methods. Due to the spectra of the scales of motion, the computational requirements of ocean forecasting are relatively higher than that of numerical weather prediction. To some extent, the rapidly evolving computer technology provide solutions to this. In this paper, we present initial attempts in porting a high resolution regional ocean model on an Intel MIC based hybrid system. We discuss the challenges and issues to be addressed in achieving an efficient implementation of an ocean modelling system.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127050264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116897
Geetika Malhotra, Seep Goel, S. Sarangi
In this paper, we introduce a new Java-based parallel GPGPU simulator, GpuTejas. GpuTejas is a fast trace driven simulator, which uses relaxed synchronization, and non-blocking data structures to derive its speedups. Secondly, it introduces a novel scheduling and partitioning scheme for parallelizing a GPU simulator. We evaluate the performance of our simulator with a set of Rodinia benchmarks. We demonstrate a mean speedup of 17.33x with 64 threads over sequential execution, and a speedup of 429X over the widely used simulator GPGPU-Sim. We validated our timing and simulation model by comparing our results with a native system (NVIDIA Tesla M2070). As compared to the sequential version of GpuTejas, the parallel version has an error limited to <;7.67% for our suite of benchmarks, which is similar to the numbers reported by competing parallel simulators.
本文介绍了一种新的基于java的并行GPGPU模拟器GpuTejas。GpuTejas是一种快速跟踪驱动模拟器,它使用宽松的同步和非阻塞数据结构来获得其速度。其次,介绍了一种新的GPU模拟器并行化调度和分区方案。我们用一组Rodinia基准来评估模拟器的性能。我们演示了64线程顺序执行的平均加速速度为17.33x,在广泛使用的模拟器GPGPU-Sim上的加速速度为429X。我们通过将我们的结果与本地系统(NVIDIA Tesla M2070)进行比较来验证我们的时序和仿真模型。与连续版本的GpuTejas相比,在我们的基准测试套件中,并行版本的误差限制在< 7.67%,这与竞争的并行模拟器报告的数字相似。
{"title":"GpuTejas: A parallel simulator for GPU architectures","authors":"Geetika Malhotra, Seep Goel, S. Sarangi","doi":"10.1109/HiPC.2014.7116897","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116897","url":null,"abstract":"In this paper, we introduce a new Java-based parallel GPGPU simulator, GpuTejas. GpuTejas is a fast trace driven simulator, which uses relaxed synchronization, and non-blocking data structures to derive its speedups. Secondly, it introduces a novel scheduling and partitioning scheme for parallelizing a GPU simulator. We evaluate the performance of our simulator with a set of Rodinia benchmarks. We demonstrate a mean speedup of 17.33x with 64 threads over sequential execution, and a speedup of 429X over the widely used simulator GPGPU-Sim. We validated our timing and simulation model by comparing our results with a native system (NVIDIA Tesla M2070). As compared to the sequential version of GpuTejas, the parallel version has an error limited to <;7.67% for our suite of benchmarks, which is similar to the numbers reported by competing parallel simulators.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"74 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127181559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}