2014 21st International Conference on High Performance Computing (HiPC)最新文献

英文中文

Premonition of storage response class using Skyline ranked Ensemble method 采用Skyline排序集成法预测存储响应等级

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116886

K. Dheenadayalan, V. Muralidhara, Pushpa Datla, G. Srinivasaraghavan, Maulik Shah

Tertiary storage areas are integral parts of compute environment and are primarily used to store vast amount of data that is generated from any scientific/industry workload. Modelling the possible pattern of usage of storage area helps the administrators to take preventive actions and guide users on how to use the storage areas which are tending towards slower to unresponsive state. Treating the storage performance parameters as a time series data helps to predict the possible values for the next `n' intervals using forecasting models like ARIMA. These predicted performance parameters are used to classify if the entire storage area or a logical component is tending towards unresponsiveness. Classification is performed using the proposed Skyline ranked Ensemble model with two possible classes, i.e. high response state and low response state. Heavy load scenarios were simulated and close to 95% of the behaviour were explained using the proposed model.

三级存储区域是计算环境的组成部分，主要用于存储任何科学/工业工作负载生成的大量数据。对存储区域的可能使用模式进行建模可以帮助管理员采取预防措施，并指导用户如何使用趋向于较慢或无响应状态的存储区域。将存储性能参数视为时间序列数据有助于使用ARIMA等预测模型预测下一个“n”间隔的可能值。这些预测的性能参数用于对整个存储区域或逻辑组件是否趋于无响应进行分类。使用提出的Skyline排序集成模型进行分类，该模型具有两种可能的类别，即高响应状态和低响应状态。模拟了重载场景，并使用所提出的模型解释了接近95%的行为。

引用次数: 4

A flexible scheduling framework for heterogeneous CPU-GPU clusters 灵活的CPU-GPU异构集群调度框架

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116892

Kittisak Sajjapongse, T. Agarwal, M. Becchi

In the last few years, thanks to their computational power and progressively increased programmability, GPUs have become part of HPC clusters. As a result, widely used open-source cluster resource managers (e.g. TORQUE and SLURM) have recently been extended with GPU support capabilities. These systems, however, treat GPUs as dedicated resources and provide scheduling mechanisms that often result in resource underutilization and, thereby, in suboptimal performance. We propose a cluster-level scheduler and integrate it with our previously proposed node-level GPU virtualization runtime [1, 2], thus providing a hierarchical cluster resource management framework that allows the efficient use of heterogeneous CPU-GPU clusters. The scheduling policy used by our system is configurable, and our scheduler provides administrators with a high-level API that allows easily defining custom scheduling policies. We provide two application- and hardware-heterogeneity-aware cluster-level scheduling schemes for hybrid MPI-CUDA applications: co-location- and latency-reduction-based scheduling, and use them in combination with a preemption-based GPU sharing policy implemented at the node-level. We validate our framework on two heterogeneous clusters: one consisting of commodity workstations and the other of high-end nodes with various hardware configurations, and on a mix of communication- and compute-intensive applications. Our experiments show that, by better utilizing the available resources, our scheduling framework outperforms existing batch-schedulers both in terms of throughput and application latency.

在过去的几年里，由于它们的计算能力和逐渐增加的可编程性，gpu已经成为高性能计算集群的一部分。因此，广泛使用的开源集群资源管理器(例如TORQUE和SLURM)最近已经扩展了GPU支持功能。然而，这些系统将gpu视为专用资源，并提供调度机制，这通常会导致资源利用率不足，从而导致性能不佳。我们提出了一个集群级调度器，并将其与我们之前提出的节点级GPU虚拟化运行时集成在一起[1,2]，从而提供了一个分层集群资源管理框架，允许高效使用异构CPU-GPU集群。我们的系统使用的调度策略是可配置的，我们的调度程序为管理员提供了一个高级API，允许轻松定义自定义调度策略。我们为混合MPI-CUDA应用程序提供了两种应用程序和硬件异构感知集群级调度方案:基于协同定位和基于延迟减少的调度，并将它们与基于抢占的GPU共享策略结合使用，该策略在节点级实现。我们在两个异构集群上验证我们的框架:一个由普通工作站组成，另一个由具有各种硬件配置的高端节点组成，并且在通信和计算密集型应用程序的混合上。我们的实验表明，通过更好地利用可用资源，我们的调度框架在吞吐量和应用程序延迟方面都优于现有的批调度程序。

{"title":"A flexible scheduling framework for heterogeneous CPU-GPU clusters","authors":"Kittisak Sajjapongse, T. Agarwal, M. Becchi","doi":"10.1109/HiPC.2014.7116892","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116892","url":null,"abstract":"In the last few years, thanks to their computational power and progressively increased programmability, GPUs have become part of HPC clusters. As a result, widely used open-source cluster resource managers (e.g. TORQUE and SLURM) have recently been extended with GPU support capabilities. These systems, however, treat GPUs as dedicated resources and provide scheduling mechanisms that often result in resource underutilization and, thereby, in suboptimal performance. We propose a cluster-level scheduler and integrate it with our previously proposed node-level GPU virtualization runtime [1, 2], thus providing a hierarchical cluster resource management framework that allows the efficient use of heterogeneous CPU-GPU clusters. The scheduling policy used by our system is configurable, and our scheduler provides administrators with a high-level API that allows easily defining custom scheduling policies. We provide two application- and hardware-heterogeneity-aware cluster-level scheduling schemes for hybrid MPI-CUDA applications: co-location- and latency-reduction-based scheduling, and use them in combination with a preemption-based GPU sharing policy implemented at the node-level. We validate our framework on two heterogeneous clusters: one consisting of commodity workstations and the other of high-end nodes with various hardware configurations, and on a mix of communication- and compute-intensive applications. Our experiments show that, by better utilizing the available resources, our scheduling framework outperforms existing batch-schedulers both in terms of throughput and application latency.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116466397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Improving Multi-dimensional query processing with data migration in distributed cache infrastructure 利用数据迁移改进分布式缓存基础设施中的多维查询处理

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116906

Youngmoon Eom, Jinwoong Kim, Deukyeon Hwang, J. Kwak, Minho Shin, Beomseok Nam

In distributed query processing systems where caching infrastructure is distributed and scales with the number of servers, it is becoming more important to orchestrate and leverage a large number of cached objects in distributed caching systems seamlessly as the present trend is to build large scalable distributed systems by connecting small heterogeneous machines. With a large scale distributed caching system, a scheduling policy must consider both cache hit ratio and system load balance to optimize multiple queries. A scheduling policy that considers system load but not cache hit ratio often fails to reuse cached data by not assigning a query to the sever that has data objects the query needs. On the contrary, a scheduling policy that considers cache hit ratio but not system load balance may suffer from system load imbalance. To maximize the overall system throughput and to reduce query response time, a multiple query scheduling policy must balance system load and also leverage cached objects. In this paper, we present a distributed query processing framework that exhibits high cache hit ratio while achieving good system load balance. In order to seamlessly manage our distributed scalable caching system, our framework performs autonomic cached data migrations to improve cache hit ratio. Our experiments show that our proposed query scheduling policy and data migration policy significantly improve system throughput by achieving high cache hit ratio while avoiding system load imbalance.

在分布式查询处理系统中，缓存基础设施是分布式的，并且随着服务器数量的增加而扩展，因此无缝地编排和利用分布式缓存系统中的大量缓存对象变得越来越重要，因为目前的趋势是通过连接小型异构机器来构建大型可扩展的分布式系统。在大规模分布式缓存系统中，调度策略必须同时考虑缓存命中率和系统负载平衡来优化多个查询。考虑系统负载而不考虑缓存命中率的调度策略通常无法重用缓存的数据，因为它没有将查询分配给具有查询所需数据对象的服务器。相反，如果调度策略只考虑缓存命中率而不考虑系统负载均衡，则可能导致系统负载不均衡。为了最大化整个系统吞吐量并减少查询响应时间，多个查询调度策略必须平衡系统负载并利用缓存对象。在本文中，我们提出了一个分布式查询处理框架，该框架在实现良好的系统负载平衡的同时具有较高的缓存命中率。为了无缝地管理我们的分布式可扩展缓存系统，我们的框架执行自主的缓存数据迁移，以提高缓存命中率。实验表明，我们提出的查询调度策略和数据迁移策略在实现高缓存命中率的同时，显著提高了系统吞吐量，避免了系统负载失衡。

{"title":"Improving Multi-dimensional query processing with data migration in distributed cache infrastructure","authors":"Youngmoon Eom, Jinwoong Kim, Deukyeon Hwang, J. Kwak, Minho Shin, Beomseok Nam","doi":"10.1109/HiPC.2014.7116906","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116906","url":null,"abstract":"In distributed query processing systems where caching infrastructure is distributed and scales with the number of servers, it is becoming more important to orchestrate and leverage a large number of cached objects in distributed caching systems seamlessly as the present trend is to build large scalable distributed systems by connecting small heterogeneous machines. With a large scale distributed caching system, a scheduling policy must consider both cache hit ratio and system load balance to optimize multiple queries. A scheduling policy that considers system load but not cache hit ratio often fails to reuse cached data by not assigning a query to the sever that has data objects the query needs. On the contrary, a scheduling policy that considers cache hit ratio but not system load balance may suffer from system load imbalance. To maximize the overall system throughput and to reduce query response time, a multiple query scheduling policy must balance system load and also leverage cached objects. In this paper, we present a distributed query processing framework that exhibits high cache hit ratio while achieving good system load balance. In order to seamlessly manage our distributed scalable caching system, our framework performs autonomic cached data migrations to improve cache hit ratio. Our experiments show that our proposed query scheduling policy and data migration policy significantly improve system throughput by achieving high cache hit ratio while avoiding system load imbalance.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127920387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Coupling-aware graph partitioning algorithms: Preliminary study 耦合感知图划分算法:初步研究

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116879

Maria Predari, Aurélien Esnard

In the field of scientific computing, load balancing is a major issue that determines the performance of parallel applications. Nowadays, simulations of real-life problems are becoming more and more complex, involving numerous coupled codes, representing different models. In this context, reaching high performance can be a great challenge. In this paper, we present graph partitioning techniques, called co-partitioning, that address the problem of load balancing for two coupled codes: the key idea is to perform a “coupling-aware” partitioning, instead of partitioning these codes independently, as it is usually done. Finally, we present a preliminary experimental study which compares our methods against the usual approach.

在科学计算领域，负载平衡是决定并行应用程序性能的主要问题。目前，对现实问题的仿真变得越来越复杂，涉及大量的耦合代码，代表不同的模型。在这种情况下，达到高性能可能是一个巨大的挑战。在本文中，我们提出了图分区技术，称为协分区，它解决了两个耦合代码的负载平衡问题:关键思想是执行“耦合感知”分区，而不是像通常那样独立地划分这些代码。最后，我们提出了一个初步的实验研究，将我们的方法与通常的方法进行了比较。

引用次数: 5

A proactive approach for coping with uncertain resource availabilities on desktop grids 一种在桌面网格上处理不确定资源可用性的主动方法

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2014-05-29 DOI: 10.1109/HiPC.2014.7116890

Louis-Claude Canon, Adel Essafi, D. Trystram

Uncertainties stemming from multiple sources affect distributed systems and jeopardize their efficient utilization. Desktop grids are especially concerned by this issue as volunteers lending their resources may have irregular and unpredictable behaviors. Efficiently exploiting the power of such systems raises theoretical issues that received little attention in the literature. In this paper, we assume that there exist predictions on the intervals during which machines are available. When these predictions have a limited estimation, it is possible to schedule a set of jobs such that the effective total execution time will not be higher than the predicted one. We formally prove that it is the case when scheduling jobs only in large intervals and when provisioning sufficient slacks to absorb uncertainties. We present multiple heuristics with various efficiencies and costs that are empirically assessed through simulations based on actual traces.

来自多个来源的不确定性影响分布式系统并危及其有效利用。桌面网格特别关注这个问题，因为志愿者出借他们的资源可能有不规则和不可预测的行为。有效地利用这些系统的力量提出了在文献中很少受到关注的理论问题。在本文中，我们假设存在机器可用间隔的预测。当这些预测具有有限的估计时，可以调度一组作业，使有效总执行时间不会高于预测时间。我们正式证明了只有在大间隔调度作业和提供足够的松弛以吸收不确定性时才会出现这种情况。我们提出了多种启发式方法，具有各种效率和成本，通过基于实际轨迹的模拟进行经验评估。

引用次数: 7

Efficient and robust allocation algorithms in clouds under memory constraints 内存约束下的高效鲁棒云分配算法

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 2013-10-19 DOI: 10.1109/HiPC.2014.7116894

Olivier Beaumont, J. Lorenzo, Lionel Eyraud-Dubois, Paul Renaud-Goud

We consider robust resource allocation of services in Clouds. More specifically, we consider the case of a large public or private Cloud platform such that a relatively small set of large and independent services accounts for most of the overall CPU usage of the platform. We will show, using a recent trace from Google, that this assumption is very reasonable in practice. The objective is to provide an allocation of the services onto the machines of the platform, using replication in order to be resilient to machine failures. The services are characterized by their demand along several dimensions (CPU, memory,...) and by their quality of service requirements, that have been defined through an SLA in the case of a public Cloud or fixed by the administrator in the case of a private Cloud. This quality of service defines the required robustness of the service, by setting an upper limit on the probability that the provider fails to allocate the required quantity of resources. This maximum probability of failure can be transparently turned into a set of (price, penalty) pairs. Our contribution is two-fold. First, we propose a formal model for this allocation problem, and we justify our assumptions based on an analysis of a publicly available cluster usage trace from Google. Second, we propose a resource allocation strategy whose complexity is low in the number of resources, what makes it well suited to large platforms. Finally, we provide an analysis of the proposed strategy through an extensive set of simulations, showing that it can be succesfully applied in the context of the Google trace.

我们考虑云中服务的健壮资源分配。更具体地说，我们考虑大型公共或私有云平台的情况，这样一组相对较小的大型独立服务就占了平台总体CPU使用的大部分。我们将使用谷歌最近的跟踪来证明，这个假设在实践中是非常合理的。目标是在平台的机器上提供服务的分配，使用复制以便对机器故障具有弹性。这些服务的特点在于它们在几个维度上的需求(CPU、内存等)和它们的服务质量需求，这些需求在公共云的情况下是通过SLA定义的，在私有云的情况下是由管理员固定的。这种服务质量通过设置提供者未能分配所需资源数量的概率上限来定义服务所需的健壮性。这种最大失败概率可以透明地转化为一组(价格、惩罚)对。我们的贡献是双重的。首先，我们为这个分配问题提出了一个正式的模型，并根据对Google公开可用的集群使用跟踪的分析来证明我们的假设是正确的。其次，我们提出了一种资源分配策略，其复杂性在资源数量上较低，这使得它非常适合大型平台。最后，我们通过一组广泛的模拟对所提出的策略进行了分析，表明它可以成功地应用于Google跟踪的上下文中。

{"title":"Efficient and robust allocation algorithms in clouds under memory constraints","authors":"Olivier Beaumont, J. Lorenzo, Lionel Eyraud-Dubois, Paul Renaud-Goud","doi":"10.1109/HiPC.2014.7116894","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116894","url":null,"abstract":"We consider robust resource allocation of services in Clouds. More specifically, we consider the case of a large public or private Cloud platform such that a relatively small set of large and independent services accounts for most of the overall CPU usage of the platform. We will show, using a recent trace from Google, that this assumption is very reasonable in practice. The objective is to provide an allocation of the services onto the machines of the platform, using replication in order to be resilient to machine failures. The services are characterized by their demand along several dimensions (CPU, memory,...) and by their quality of service requirements, that have been defined through an SLA in the case of a public Cloud or fixed by the administrator in the case of a private Cloud. This quality of service defines the required robustness of the service, by setting an upper limit on the probability that the provider fails to allocate the required quantity of resources. This maximum probability of failure can be transparently turned into a set of (price, penalty) pairs. Our contribution is two-fold. First, we propose a formal model for this allocation problem, and we justify our assumptions based on an analysis of a publicly available cluster usage trace from Google. Second, we propose a resource allocation strategy whose complexity is low in the number of resources, what makes it well suited to large platforms. Finally, we provide an analysis of the proposed strategy through an extensive set of simulations, showing that it can be succesfully applied in the context of the Google trace.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128189982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Towards realizing the potential of malleable jobs 实现可塑工作的潜力

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 1900-01-01 DOI: 10.1109/HiPC.2014.7116905

Abhishek K. Gupta, Bilge Acun, O. Sarood, L. Kalé

Malleable jobs are those which can dynamically shrink or expand the number of processors on which they are executing at runtime in response to an external command. Malleable jobs can significantly improve system utilization and reduce average response time, compared to traditional jobs. To realize these benefits, three components are critical - an adaptive job scheduler, an adaptive resource manager, and an adaptive parallel runtime system. In this paper, we present a novel mechanism for enabling shrink/expand capability in the parallel runtime system using task migration and dynamic load balancing, checkpoint-restart, and Linux shared memory. Our technique performs true shrink/expand eliminating the need of any residual processes, requires little application programmer effort, and is fast. Further, we establish a bidirectional communication channel between the resource manager and the parallel runtime, and present an asynchronous split-phase mechanism for executing adaptive scheduling decisions. Performance results using Charm++ on Stampede supercomputer show the efficacy, scalability, and benefits of our approach. Shrinking from 2k to 1k cores takes 16s while expand from 1k to 2k takes 40s. Also, we demonstrate the utility of our runtime in traditional as well as emerging scenarios, e.g., proactive fault tolerance and clouds.

可伸缩作业是那些可以动态收缩或扩展处理器数量的作业，它们在运行时响应外部命令在其上执行。与传统作业相比，可伸缩作业可以显著提高系统利用率并缩短平均响应时间。要实现这些好处，三个组件至关重要—自适应作业调度器、自适应资源管理器和自适应并行运行时系统。在本文中，我们提出了一种在并行运行时系统中启用收缩/扩展功能的新机制，该机制使用任务迁移和动态负载平衡、检查点重新启动和Linux共享内存。我们的技术执行真正的收缩/扩展，消除了对任何残余进程的需要，只需要很少的应用程序程序员的努力，而且速度很快。此外，我们在资源管理器和并行运行时之间建立了双向通信通道，并提出了执行自适应调度决策的异步分阶段机制。在Stampede超级计算机上使用Charm++的性能结果显示了我们的方法的有效性、可伸缩性和优点。从2k核缩小到1k核需要16秒，而从1k核扩展到2k核需要40秒。此外，我们还演示了我们的运行时在传统和新兴场景中的实用性，例如，主动容错和云。

{"title":"Towards realizing the potential of malleable jobs","authors":"Abhishek K. Gupta, Bilge Acun, O. Sarood, L. Kalé","doi":"10.1109/HiPC.2014.7116905","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116905","url":null,"abstract":"Malleable jobs are those which can dynamically shrink or expand the number of processors on which they are executing at runtime in response to an external command. Malleable jobs can significantly improve system utilization and reduce average response time, compared to traditional jobs. To realize these benefits, three components are critical - an adaptive job scheduler, an adaptive resource manager, and an adaptive parallel runtime system. In this paper, we present a novel mechanism for enabling shrink/expand capability in the parallel runtime system using task migration and dynamic load balancing, checkpoint-restart, and Linux shared memory. Our technique performs true shrink/expand eliminating the need of any residual processes, requires little application programmer effort, and is fast. Further, we establish a bidirectional communication channel between the resource manager and the parallel runtime, and present an asynchronous split-phase mechanism for executing adaptive scheduling decisions. Performance results using Charm++ on Stampede supercomputer show the efficacy, scalability, and benefits of our approach. Shrinking from 2k to 1k cores takes 16s while expand from 1k to 2k takes 40s. Also, we demonstrate the utility of our runtime in traditional as well as emerging scenarios, e.g., proactive fault tolerance and clouds.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124387334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Optimizing the performance of parallel applications on a 5D torus via task mapping 通过任务映射优化5D环面上并行应用程序的性能

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 1900-01-01 DOI: 10.1109/HiPC.2014.7116706

A. Bhatele, Nikhil Jain, Katherine E. Isaacs, Ronak Buch, T. Gamblin, S. Langer, L. Kalé

Six of the ten fastest supercomputers in the world in 2014 use a torus interconnection network for message passing between compute nodes. Torus networks provide high bandwidth links to near-neighbors and low latencies over multiple hops on the network. However, large diameters of such networks necessitate a careful placement of parallel tasks on the compute nodes to minimize network congestion. This paper presents a methodological study of optimizing application performance on a five-dimensional torus network via the technique of topology-aware task mapping. Task mapping refers to the placement of processes on compute nodes while carefully considering the network topology between the nodes and the communication behavior of the application. We focus on the IBM Blue Gene/Q machine and two production applications - a laser-plasma interaction code called pF3D and a lattice QCD application called MILC. Optimizations presented in the paper improve the communication performance of pF3D by 90% and that of MILC by up to 47%.

2014年，世界上最快的10台超级计算机中有6台使用环面互连网络在计算节点之间传递消息。环面网络为近邻提供高带宽链路，并在网络上的多跳上提供低延迟。然而，这种网络的大直径需要在计算节点上小心地放置并行任务，以最小化网络拥塞。本文提出了一种利用拓扑感知任务映射技术优化五维环面网络应用程序性能的方法研究。任务映射是指在仔细考虑节点之间的网络拓扑和应用程序的通信行为的同时，在计算节点上放置进程。我们专注于IBM Blue Gene/Q机器和两个生产应用程序-一个称为pF3D的激光等离子体相互作用代码和一个称为MILC的晶格QCD应用程序。本文提出的优化方案使pF3D的通信性能提高了90%，MILC的通信性能提高了47%。

引用次数: 18

Analysis and tuning of libtensor framework on multicore architectures libtensor框架在多核架构下的分析与调优

2014 21st International Conference on High Performance Computing (HiPC)

Pub Date : 1900-01-01 DOI: 10.1109/HIPC.2014.7116881

K. Ibrahim, Samuel Williams, E. Epifanovsky, A. Krylov

Libtensor is a framework designed to implement the tensor contractions arising form the coupled cluster and equations of motion computational quantum chemistry equations. It has been optimized for symmetry and sparsity to be memory efficient. This allows it to run efficiently on the ubiquitous and cost-effective SMP architectures. Unfortunately, movement of memory controllers on chip has endowed these SMP systems with strong NUMA properties. Moreover, the many core trend in processor architecture demands that the implementation be extremely thread-scalable on node. To date, Libtensor has been generally agnostic of these effects. To that end, in this paper, we explore a number of optimization techniques including a thread-friendly and NUMA-aware memory allocator and garbage collector, tuning the tensor tiling factor, and tuning the scheduling quanta. In the end, our optimizations can improve the performance of contractions implemented in Libtensor by up to 2× on representative Ivy Bridge, Nehalem, and Opteron SMPs.

Libtensor是一个框架，旨在实现由运动计算量子化学方程的耦合簇和方程产生的张量收缩。它已经针对对称性和稀疏性进行了优化，以提高内存效率。这使得它能够在无处不在且具有成本效益的SMP体系结构上高效运行。不幸的是，芯片上存储控制器的移动赋予了这些SMP系统强大的NUMA特性。此外，处理器体系结构中的多核心趋势要求实现在节点上具有极高的线程可伸缩性。迄今为止，Libtensor对这些影响基本上是不可知的。为此，在本文中，我们探索了许多优化技术，包括线程友好和numa感知的内存分配器和垃圾收集器，调优张量平铺因子和调优调度量。最后，我们的优化可以在代表性的Ivy Bridge、Nehalem和Opteron smp上将Libtensor实现的收缩性能提高2倍。

引用次数: 12

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2014 21st International Conference on High Performance Computing (HiPC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀