Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116886
K. Dheenadayalan, V. Muralidhara, Pushpa Datla, G. Srinivasaraghavan, Maulik Shah
Tertiary storage areas are integral parts of compute environment and are primarily used to store vast amount of data that is generated from any scientific/industry workload. Modelling the possible pattern of usage of storage area helps the administrators to take preventive actions and guide users on how to use the storage areas which are tending towards slower to unresponsive state. Treating the storage performance parameters as a time series data helps to predict the possible values for the next `n' intervals using forecasting models like ARIMA. These predicted performance parameters are used to classify if the entire storage area or a logical component is tending towards unresponsiveness. Classification is performed using the proposed Skyline ranked Ensemble model with two possible classes, i.e. high response state and low response state. Heavy load scenarios were simulated and close to 95% of the behaviour were explained using the proposed model.
{"title":"Premonition of storage response class using Skyline ranked Ensemble method","authors":"K. Dheenadayalan, V. Muralidhara, Pushpa Datla, G. Srinivasaraghavan, Maulik Shah","doi":"10.1109/HiPC.2014.7116886","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116886","url":null,"abstract":"Tertiary storage areas are integral parts of compute environment and are primarily used to store vast amount of data that is generated from any scientific/industry workload. Modelling the possible pattern of usage of storage area helps the administrators to take preventive actions and guide users on how to use the storage areas which are tending towards slower to unresponsive state. Treating the storage performance parameters as a time series data helps to predict the possible values for the next `n' intervals using forecasting models like ARIMA. These predicted performance parameters are used to classify if the entire storage area or a logical component is tending towards unresponsiveness. Classification is performed using the proposed Skyline ranked Ensemble model with two possible classes, i.e. high response state and low response state. Heavy load scenarios were simulated and close to 95% of the behaviour were explained using the proposed model.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134375022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116892
Kittisak Sajjapongse, T. Agarwal, M. Becchi
In the last few years, thanks to their computational power and progressively increased programmability, GPUs have become part of HPC clusters. As a result, widely used open-source cluster resource managers (e.g. TORQUE and SLURM) have recently been extended with GPU support capabilities. These systems, however, treat GPUs as dedicated resources and provide scheduling mechanisms that often result in resource underutilization and, thereby, in suboptimal performance. We propose a cluster-level scheduler and integrate it with our previously proposed node-level GPU virtualization runtime [1, 2], thus providing a hierarchical cluster resource management framework that allows the efficient use of heterogeneous CPU-GPU clusters. The scheduling policy used by our system is configurable, and our scheduler provides administrators with a high-level API that allows easily defining custom scheduling policies. We provide two application- and hardware-heterogeneity-aware cluster-level scheduling schemes for hybrid MPI-CUDA applications: co-location- and latency-reduction-based scheduling, and use them in combination with a preemption-based GPU sharing policy implemented at the node-level. We validate our framework on two heterogeneous clusters: one consisting of commodity workstations and the other of high-end nodes with various hardware configurations, and on a mix of communication- and compute-intensive applications. Our experiments show that, by better utilizing the available resources, our scheduling framework outperforms existing batch-schedulers both in terms of throughput and application latency.
{"title":"A flexible scheduling framework for heterogeneous CPU-GPU clusters","authors":"Kittisak Sajjapongse, T. Agarwal, M. Becchi","doi":"10.1109/HiPC.2014.7116892","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116892","url":null,"abstract":"In the last few years, thanks to their computational power and progressively increased programmability, GPUs have become part of HPC clusters. As a result, widely used open-source cluster resource managers (e.g. TORQUE and SLURM) have recently been extended with GPU support capabilities. These systems, however, treat GPUs as dedicated resources and provide scheduling mechanisms that often result in resource underutilization and, thereby, in suboptimal performance. We propose a cluster-level scheduler and integrate it with our previously proposed node-level GPU virtualization runtime [1, 2], thus providing a hierarchical cluster resource management framework that allows the efficient use of heterogeneous CPU-GPU clusters. The scheduling policy used by our system is configurable, and our scheduler provides administrators with a high-level API that allows easily defining custom scheduling policies. We provide two application- and hardware-heterogeneity-aware cluster-level scheduling schemes for hybrid MPI-CUDA applications: co-location- and latency-reduction-based scheduling, and use them in combination with a preemption-based GPU sharing policy implemented at the node-level. We validate our framework on two heterogeneous clusters: one consisting of commodity workstations and the other of high-end nodes with various hardware configurations, and on a mix of communication- and compute-intensive applications. Our experiments show that, by better utilizing the available resources, our scheduling framework outperforms existing batch-schedulers both in terms of throughput and application latency.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116466397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116906
Youngmoon Eom, Jinwoong Kim, Deukyeon Hwang, J. Kwak, Minho Shin, Beomseok Nam
In distributed query processing systems where caching infrastructure is distributed and scales with the number of servers, it is becoming more important to orchestrate and leverage a large number of cached objects in distributed caching systems seamlessly as the present trend is to build large scalable distributed systems by connecting small heterogeneous machines. With a large scale distributed caching system, a scheduling policy must consider both cache hit ratio and system load balance to optimize multiple queries. A scheduling policy that considers system load but not cache hit ratio often fails to reuse cached data by not assigning a query to the sever that has data objects the query needs. On the contrary, a scheduling policy that considers cache hit ratio but not system load balance may suffer from system load imbalance. To maximize the overall system throughput and to reduce query response time, a multiple query scheduling policy must balance system load and also leverage cached objects. In this paper, we present a distributed query processing framework that exhibits high cache hit ratio while achieving good system load balance. In order to seamlessly manage our distributed scalable caching system, our framework performs autonomic cached data migrations to improve cache hit ratio. Our experiments show that our proposed query scheduling policy and data migration policy significantly improve system throughput by achieving high cache hit ratio while avoiding system load imbalance.
{"title":"Improving Multi-dimensional query processing with data migration in distributed cache infrastructure","authors":"Youngmoon Eom, Jinwoong Kim, Deukyeon Hwang, J. Kwak, Minho Shin, Beomseok Nam","doi":"10.1109/HiPC.2014.7116906","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116906","url":null,"abstract":"In distributed query processing systems where caching infrastructure is distributed and scales with the number of servers, it is becoming more important to orchestrate and leverage a large number of cached objects in distributed caching systems seamlessly as the present trend is to build large scalable distributed systems by connecting small heterogeneous machines. With a large scale distributed caching system, a scheduling policy must consider both cache hit ratio and system load balance to optimize multiple queries. A scheduling policy that considers system load but not cache hit ratio often fails to reuse cached data by not assigning a query to the sever that has data objects the query needs. On the contrary, a scheduling policy that considers cache hit ratio but not system load balance may suffer from system load imbalance. To maximize the overall system throughput and to reduce query response time, a multiple query scheduling policy must balance system load and also leverage cached objects. In this paper, we present a distributed query processing framework that exhibits high cache hit ratio while achieving good system load balance. In order to seamlessly manage our distributed scalable caching system, our framework performs autonomic cached data migrations to improve cache hit ratio. Our experiments show that our proposed query scheduling policy and data migration policy significantly improve system throughput by achieving high cache hit ratio while avoiding system load imbalance.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127920387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/HiPC.2014.7116879
Maria Predari, Aurélien Esnard
In the field of scientific computing, load balancing is a major issue that determines the performance of parallel applications. Nowadays, simulations of real-life problems are becoming more and more complex, involving numerous coupled codes, representing different models. In this context, reaching high performance can be a great challenge. In this paper, we present graph partitioning techniques, called co-partitioning, that address the problem of load balancing for two coupled codes: the key idea is to perform a “coupling-aware” partitioning, instead of partitioning these codes independently, as it is usually done. Finally, we present a preliminary experimental study which compares our methods against the usual approach.
{"title":"Coupling-aware graph partitioning algorithms: Preliminary study","authors":"Maria Predari, Aurélien Esnard","doi":"10.1109/HiPC.2014.7116879","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116879","url":null,"abstract":"In the field of scientific computing, load balancing is a major issue that determines the performance of parallel applications. Nowadays, simulations of real-life problems are becoming more and more complex, involving numerous coupled codes, representing different models. In this context, reaching high performance can be a great challenge. In this paper, we present graph partitioning techniques, called co-partitioning, that address the problem of load balancing for two coupled codes: the key idea is to perform a “coupling-aware” partitioning, instead of partitioning these codes independently, as it is usually done. Finally, we present a preliminary experimental study which compares our methods against the usual approach.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129407676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-05-29DOI: 10.1109/HiPC.2014.7116890
Louis-Claude Canon, Adel Essafi, D. Trystram
Uncertainties stemming from multiple sources affect distributed systems and jeopardize their efficient utilization. Desktop grids are especially concerned by this issue as volunteers lending their resources may have irregular and unpredictable behaviors. Efficiently exploiting the power of such systems raises theoretical issues that received little attention in the literature. In this paper, we assume that there exist predictions on the intervals during which machines are available. When these predictions have a limited estimation, it is possible to schedule a set of jobs such that the effective total execution time will not be higher than the predicted one. We formally prove that it is the case when scheduling jobs only in large intervals and when provisioning sufficient slacks to absorb uncertainties. We present multiple heuristics with various efficiencies and costs that are empirically assessed through simulations based on actual traces.
{"title":"A proactive approach for coping with uncertain resource availabilities on desktop grids","authors":"Louis-Claude Canon, Adel Essafi, D. Trystram","doi":"10.1109/HiPC.2014.7116890","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116890","url":null,"abstract":"Uncertainties stemming from multiple sources affect distributed systems and jeopardize their efficient utilization. Desktop grids are especially concerned by this issue as volunteers lending their resources may have irregular and unpredictable behaviors. Efficiently exploiting the power of such systems raises theoretical issues that received little attention in the literature. In this paper, we assume that there exist predictions on the intervals during which machines are available. When these predictions have a limited estimation, it is possible to schedule a set of jobs such that the effective total execution time will not be higher than the predicted one. We formally prove that it is the case when scheduling jobs only in large intervals and when provisioning sufficient slacks to absorb uncertainties. We present multiple heuristics with various efficiencies and costs that are empirically assessed through simulations based on actual traces.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114206972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-19DOI: 10.1109/HiPC.2014.7116894
Olivier Beaumont, J. Lorenzo, Lionel Eyraud-Dubois, Paul Renaud-Goud
We consider robust resource allocation of services in Clouds. More specifically, we consider the case of a large public or private Cloud platform such that a relatively small set of large and independent services accounts for most of the overall CPU usage of the platform. We will show, using a recent trace from Google, that this assumption is very reasonable in practice. The objective is to provide an allocation of the services onto the machines of the platform, using replication in order to be resilient to machine failures. The services are characterized by their demand along several dimensions (CPU, memory,...) and by their quality of service requirements, that have been defined through an SLA in the case of a public Cloud or fixed by the administrator in the case of a private Cloud. This quality of service defines the required robustness of the service, by setting an upper limit on the probability that the provider fails to allocate the required quantity of resources. This maximum probability of failure can be transparently turned into a set of (price, penalty) pairs. Our contribution is two-fold. First, we propose a formal model for this allocation problem, and we justify our assumptions based on an analysis of a publicly available cluster usage trace from Google. Second, we propose a resource allocation strategy whose complexity is low in the number of resources, what makes it well suited to large platforms. Finally, we provide an analysis of the proposed strategy through an extensive set of simulations, showing that it can be succesfully applied in the context of the Google trace.
{"title":"Efficient and robust allocation algorithms in clouds under memory constraints","authors":"Olivier Beaumont, J. Lorenzo, Lionel Eyraud-Dubois, Paul Renaud-Goud","doi":"10.1109/HiPC.2014.7116894","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116894","url":null,"abstract":"We consider robust resource allocation of services in Clouds. More specifically, we consider the case of a large public or private Cloud platform such that a relatively small set of large and independent services accounts for most of the overall CPU usage of the platform. We will show, using a recent trace from Google, that this assumption is very reasonable in practice. The objective is to provide an allocation of the services onto the machines of the platform, using replication in order to be resilient to machine failures. The services are characterized by their demand along several dimensions (CPU, memory,...) and by their quality of service requirements, that have been defined through an SLA in the case of a public Cloud or fixed by the administrator in the case of a private Cloud. This quality of service defines the required robustness of the service, by setting an upper limit on the probability that the provider fails to allocate the required quantity of resources. This maximum probability of failure can be transparently turned into a set of (price, penalty) pairs. Our contribution is two-fold. First, we propose a formal model for this allocation problem, and we justify our assumptions based on an analysis of a publicly available cluster usage trace from Google. Second, we propose a resource allocation strategy whose complexity is low in the number of resources, what makes it well suited to large platforms. Finally, we provide an analysis of the proposed strategy through an extensive set of simulations, showing that it can be succesfully applied in the context of the Google trace.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128189982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/HiPC.2014.7116905
Abhishek K. Gupta, Bilge Acun, O. Sarood, L. Kalé
Malleable jobs are those which can dynamically shrink or expand the number of processors on which they are executing at runtime in response to an external command. Malleable jobs can significantly improve system utilization and reduce average response time, compared to traditional jobs. To realize these benefits, three components are critical - an adaptive job scheduler, an adaptive resource manager, and an adaptive parallel runtime system. In this paper, we present a novel mechanism for enabling shrink/expand capability in the parallel runtime system using task migration and dynamic load balancing, checkpoint-restart, and Linux shared memory. Our technique performs true shrink/expand eliminating the need of any residual processes, requires little application programmer effort, and is fast. Further, we establish a bidirectional communication channel between the resource manager and the parallel runtime, and present an asynchronous split-phase mechanism for executing adaptive scheduling decisions. Performance results using Charm++ on Stampede supercomputer show the efficacy, scalability, and benefits of our approach. Shrinking from 2k to 1k cores takes 16s while expand from 1k to 2k takes 40s. Also, we demonstrate the utility of our runtime in traditional as well as emerging scenarios, e.g., proactive fault tolerance and clouds.
{"title":"Towards realizing the potential of malleable jobs","authors":"Abhishek K. Gupta, Bilge Acun, O. Sarood, L. Kalé","doi":"10.1109/HiPC.2014.7116905","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116905","url":null,"abstract":"Malleable jobs are those which can dynamically shrink or expand the number of processors on which they are executing at runtime in response to an external command. Malleable jobs can significantly improve system utilization and reduce average response time, compared to traditional jobs. To realize these benefits, three components are critical - an adaptive job scheduler, an adaptive resource manager, and an adaptive parallel runtime system. In this paper, we present a novel mechanism for enabling shrink/expand capability in the parallel runtime system using task migration and dynamic load balancing, checkpoint-restart, and Linux shared memory. Our technique performs true shrink/expand eliminating the need of any residual processes, requires little application programmer effort, and is fast. Further, we establish a bidirectional communication channel between the resource manager and the parallel runtime, and present an asynchronous split-phase mechanism for executing adaptive scheduling decisions. Performance results using Charm++ on Stampede supercomputer show the efficacy, scalability, and benefits of our approach. Shrinking from 2k to 1k cores takes 16s while expand from 1k to 2k takes 40s. Also, we demonstrate the utility of our runtime in traditional as well as emerging scenarios, e.g., proactive fault tolerance and clouds.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124387334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/HiPC.2014.7116706
A. Bhatele, Nikhil Jain, Katherine E. Isaacs, Ronak Buch, T. Gamblin, S. Langer, L. Kalé
Six of the ten fastest supercomputers in the world in 2014 use a torus interconnection network for message passing between compute nodes. Torus networks provide high bandwidth links to near-neighbors and low latencies over multiple hops on the network. However, large diameters of such networks necessitate a careful placement of parallel tasks on the compute nodes to minimize network congestion. This paper presents a methodological study of optimizing application performance on a five-dimensional torus network via the technique of topology-aware task mapping. Task mapping refers to the placement of processes on compute nodes while carefully considering the network topology between the nodes and the communication behavior of the application. We focus on the IBM Blue Gene/Q machine and two production applications - a laser-plasma interaction code called pF3D and a lattice QCD application called MILC. Optimizations presented in the paper improve the communication performance of pF3D by 90% and that of MILC by up to 47%.
2014年,世界上最快的10台超级计算机中有6台使用环面互连网络在计算节点之间传递消息。环面网络为近邻提供高带宽链路,并在网络上的多跳上提供低延迟。然而,这种网络的大直径需要在计算节点上小心地放置并行任务,以最小化网络拥塞。本文提出了一种利用拓扑感知任务映射技术优化五维环面网络应用程序性能的方法研究。任务映射是指在仔细考虑节点之间的网络拓扑和应用程序的通信行为的同时,在计算节点上放置进程。我们专注于IBM Blue Gene/Q机器和两个生产应用程序-一个称为pF3D的激光等离子体相互作用代码和一个称为MILC的晶格QCD应用程序。本文提出的优化方案使pF3D的通信性能提高了90%,MILC的通信性能提高了47%。
{"title":"Optimizing the performance of parallel applications on a 5D torus via task mapping","authors":"A. Bhatele, Nikhil Jain, Katherine E. Isaacs, Ronak Buch, T. Gamblin, S. Langer, L. Kalé","doi":"10.1109/HiPC.2014.7116706","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116706","url":null,"abstract":"Six of the ten fastest supercomputers in the world in 2014 use a torus interconnection network for message passing between compute nodes. Torus networks provide high bandwidth links to near-neighbors and low latencies over multiple hops on the network. However, large diameters of such networks necessitate a careful placement of parallel tasks on the compute nodes to minimize network congestion. This paper presents a methodological study of optimizing application performance on a five-dimensional torus network via the technique of topology-aware task mapping. Task mapping refers to the placement of processes on compute nodes while carefully considering the network topology between the nodes and the communication behavior of the application. We focus on the IBM Blue Gene/Q machine and two production applications - a laser-plasma interaction code called pF3D and a lattice QCD application called MILC. Optimizations presented in the paper improve the communication performance of pF3D by 90% and that of MILC by up to 47%.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130789573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/HIPC.2014.7116881
K. Ibrahim, Samuel Williams, E. Epifanovsky, A. Krylov
Libtensor is a framework designed to implement the tensor contractions arising form the coupled cluster and equations of motion computational quantum chemistry equations. It has been optimized for symmetry and sparsity to be memory efficient. This allows it to run efficiently on the ubiquitous and cost-effective SMP architectures. Unfortunately, movement of memory controllers on chip has endowed these SMP systems with strong NUMA properties. Moreover, the many core trend in processor architecture demands that the implementation be extremely thread-scalable on node. To date, Libtensor has been generally agnostic of these effects. To that end, in this paper, we explore a number of optimization techniques including a thread-friendly and NUMA-aware memory allocator and garbage collector, tuning the tensor tiling factor, and tuning the scheduling quanta. In the end, our optimizations can improve the performance of contractions implemented in Libtensor by up to 2× on representative Ivy Bridge, Nehalem, and Opteron SMPs.
{"title":"Analysis and tuning of libtensor framework on multicore architectures","authors":"K. Ibrahim, Samuel Williams, E. Epifanovsky, A. Krylov","doi":"10.1109/HIPC.2014.7116881","DOIUrl":"https://doi.org/10.1109/HIPC.2014.7116881","url":null,"abstract":"Libtensor is a framework designed to implement the tensor contractions arising form the coupled cluster and equations of motion computational quantum chemistry equations. It has been optimized for symmetry and sparsity to be memory efficient. This allows it to run efficiently on the ubiquitous and cost-effective SMP architectures. Unfortunately, movement of memory controllers on chip has endowed these SMP systems with strong NUMA properties. Moreover, the many core trend in processor architecture demands that the implementation be extremely thread-scalable on node. To date, Libtensor has been generally agnostic of these effects. To that end, in this paper, we explore a number of optimization techniques including a thread-friendly and NUMA-aware memory allocator and garbage collector, tuning the tensor tiling factor, and tuning the scheduling quanta. In the end, our optimizations can improve the performance of contractions implemented in Libtensor by up to 2× on representative Ivy Bridge, Nehalem, and Opteron SMPs.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114073091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}