2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)最新文献_第5页

Time Provisioning Evaluation of KVM, Docker and Unikernels in a Cloud Platform 云平台下KVM、Docker和Unikernels的时间分配评估

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.86

Bruno Xavier, T. Ferreto, L. C. Jersak

Unikernels are a promising alternative for application deployment in cloud platforms. They comprise a very small footprint, providing better deployment agility and portability among virtualization platforms. Similar to Linux containers, they are a lightweight alternative for deploying distributed applications based on microservices. However, the comparison of unikernels with other virtualization options regarding the concurrent provisioning of instances, as in the case of microservices-based applications, is still lacking. This paper provides an evaluation of KVM (Virtual Machines), Docker (Containers), and OSv (Unikernel), when provisioning multiple instances concurrently in an OpenStack cloud platform. We confirmed that OSv outperforms the other options and also identified opportunities for optimization.

Unikernels是云平台中应用程序部署的一个很有前途的替代方案。它们占用的空间非常小，在虚拟化平台之间提供了更好的部署灵活性和可移植性。与Linux容器类似，它们是部署基于微服务的分布式应用程序的轻量级替代方案。然而，在基于微服务的应用程序的情况下，unikernels与其他虚拟化选项在实例并发供应方面的比较仍然缺乏。本文对在OpenStack云平台上并发发放多个实例时，KVM (Virtual Machines)、Docker (Containers)和OSv (Unikernel)进行了评估。我们确认OSv优于其他选项，并确定了优化的机会。

引用次数: 48

Large Scale GPU Accelerated PPMLR-MHD Simulations for Space Weather Forecast 大规模GPU加速PPMLR-MHD空间天气预报模拟

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.68

Xiangyu Guo, Binbin Tang, Jian Tao, Zhaohui Huang, Zhihui Du

PPMLR-MHD is a new magnetohydrodynamics (MHD) model used to simulate the interactions of the solar wind with the magnetosphere, which has been proved to be the key element of the space weather cause-and-effect chain process from the Sun to Earth. Compared to existing MHD methods, PPMLR-MHD achieves the advantage of high order spatial accuracy and low numerical dissipation. However, the accuracy comes at a cost. On one hand, this method requires more intensive computation. On the other hand, more boundary data is subject to be transferred during the process of simulation. In this work, we present a parallel hybrid solution of the PPMLR-MHD model implemented using the computing capabilities of both CPUs and GPUs. We demonstrate that our optimized implementation alleviates the data transfer overhead by using GPU Direct technology and can scale up to 151 processes and achieve significant performance gains by distributing the workload among the CPUs and GPUs on Titan at Oak Ridge National Laboratory. The performance results show that our implementation is fast enough to carry out highly accurate MHD simulations in real time.

PPMLR-MHD是一种新的磁流体动力学(MHD)模型，用于模拟太阳风与磁层的相互作用，它已被证明是从太阳到地球的空间天气因果链过程的关键要素。与现有的MHD方法相比，PPMLR-MHD方法具有高阶空间精度和低数值耗散的优点。然而，准确性是有代价的。一方面，这种方法需要更密集的计算。另一方面，在模拟过程中需要传递更多的边界数据。在这项工作中，我们提出了利用cpu和gpu的计算能力实现PPMLR-MHD模型的并行混合解决方案。我们证明了我们的优化实现通过使用GPU Direct技术减轻了数据传输开销，并且可以扩展到151个进程，并且通过在橡树岭国家实验室的Titan上在cpu和GPU之间分配工作负载来实现显着的性能提升。性能结果表明，我们的实现速度足够快，可以实时进行高精度的MHD模拟。

{"title":"Large Scale GPU Accelerated PPMLR-MHD Simulations for Space Weather Forecast","authors":"Xiangyu Guo, Binbin Tang, Jian Tao, Zhaohui Huang, Zhihui Du","doi":"10.1109/CCGrid.2016.68","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.68","url":null,"abstract":"PPMLR-MHD is a new magnetohydrodynamics (MHD) model used to simulate the interactions of the solar wind with the magnetosphere, which has been proved to be the key element of the space weather cause-and-effect chain process from the Sun to Earth. Compared to existing MHD methods, PPMLR-MHD achieves the advantage of high order spatial accuracy and low numerical dissipation. However, the accuracy comes at a cost. On one hand, this method requires more intensive computation. On the other hand, more boundary data is subject to be transferred during the process of simulation. In this work, we present a parallel hybrid solution of the PPMLR-MHD model implemented using the computing capabilities of both CPUs and GPUs. We demonstrate that our optimized implementation alleviates the data transfer overhead by using GPU Direct technology and can scale up to 151 processes and achieve significant performance gains by distributing the workload among the CPUs and GPUs on Titan at Oak Ridge National Laboratory. The performance results show that our implementation is fast enough to carry out highly accurate MHD simulations in real time.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116704950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

In-Memory Caching Orchestration for Hadoop Hadoop的内存缓存编排

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.73

J. Kwak, Eunji Hwang, Tae-kyung Yoo, Beomseok Nam, Young-ri Choi

In this paper, we investigate techniques to effectively orchestrate HDFS in-memory caching for Hadoop. We first evaluate a degree of benefit which each of various MapReduce applications can get from in-memory caching, i.e. cache affinity. We then propose an adaptive cache local scheduling algorithm that adaptively adjusts the waiting time of a MapReduce job in a queue for a cache local node. We set the waiting time to be proportional to the percentage of cached input data for the job. We also develop a cache affinity cache replacement algorithm that determines which block is cached and evicted based on the cache affinity of applications. Using various workloads consisting of multiple MapReduce applications, we conduct experimental study to demonstrate the effects of the proposed in-memory orchestration techniques. Our experimental results show that our enhanced Hadoop in-memory caching scheme improves the performance of the MapReduce workloads up to 18% and 10% against Hadoop that disables and enables HDFS in-memory caching, respectively.

在本文中，我们研究了为Hadoop有效编排HDFS内存缓存的技术。我们首先评估每个不同的MapReduce应用程序可以从内存缓存中获得的好处程度，即缓存关联。然后，我们提出了一种自适应缓存本地调度算法，该算法自适应地调整MapReduce作业在缓存本地节点队列中的等待时间。我们将等待时间设置为与作业缓存输入数据的百分比成比例。我们还开发了一种缓存关联缓存替换算法，该算法根据应用程序的缓存关联来确定缓存和驱逐哪个块。使用由多个MapReduce应用程序组成的各种工作负载，我们进行了实验研究，以证明所提出的内存编排技术的效果。我们的实验结果表明，与启用和禁用HDFS内存缓存的Hadoop相比，我们增强的Hadoop内存缓存方案将MapReduce工作负载的性能分别提高了18%和10%。

引用次数: 13

Automatic Communication Optimization of Parallel Applications in Public Clouds 公共云中并行应用的自动通信优化

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.59

E. Carreño, M. Diener, E. Cruz, P. Navaux

One of the most important aspects that influences the performance of parallel applications is the speed of communication between their tasks. To optimize communication, tasks that exchange lots of data should be mapped to processing units that have a high network performance. This technique is called communication-aware task mapping and requires detailed information about the underlying network topology for an accurate mapping. Previous work on task mapping focuses on network clusters or shared memory architectures, in which the topology can be determined directly from the hardware environment. Cloud computing adds significant challenges to task mapping, since information about network topologies is not available to end users. Furthermore, the communication performance might change due to external factors, such as different usage patterns of other users. In this paper, we present a novel solution to perform communication-aware task mapping in the context of commercial cloud environments with multiple instances. Our proposal consists of a short profiling phase to discover the network topology and speed between cloud instances. The profiling can be executed before each application start as it causes only a negligible overhead. This information is then used together with the communication pattern of the parallel application to group tasks based on the amount of communication and to map groups with a lot of communication between them to cloud instances with a high network performance. In this way, application performance is increased, and data traffic between instances is reduced. We evaluated our proposal in a public cloud with a variety of MPI-based parallel benchmarks from the HPC domain, as well as a large scientific application. In the experiments, we observed substantial performance improvements (up to 11 times faster) compared to the default scheduling policies.

影响并行应用程序性能的最重要方面之一是它们的任务之间的通信速度。为了优化通信，应该将交换大量数据的任务映射到具有高网络性能的处理单元。这种技术称为通信感知任务映射，需要有关底层网络拓扑的详细信息才能进行准确的映射。以前关于任务映射的工作主要集中在网络集群或共享内存体系结构上，其中拓扑结构可以直接从硬件环境中确定。云计算给任务映射增加了重大挑战，因为最终用户无法获得有关网络拓扑的信息。此外，由于外部因素，例如其他用户的不同使用模式，通信性能可能会发生变化。在本文中，我们提出了一种在具有多个实例的商业云环境中执行通信感知任务映射的新解决方案。我们的建议包括一个简短的分析阶段，以发现云实例之间的网络拓扑和速度。分析可以在每个应用程序启动之前执行，因为它只会产生微不足道的开销。然后将此信息与并行应用程序的通信模式一起使用，根据通信量对任务进行分组，并将具有大量通信的组映射到具有高网络性能的云实例。通过这种方式，可以提高应用程序性能，减少实例之间的数据流量。我们在公共云中使用来自高性能计算领域的各种基于mpi的并行基准测试以及大型科学应用来评估我们的建议。在实验中，与默认调度策略相比，我们观察到显著的性能改进(速度提高了11倍)。

{"title":"Automatic Communication Optimization of Parallel Applications in Public Clouds","authors":"E. Carreño, M. Diener, E. Cruz, P. Navaux","doi":"10.1109/CCGrid.2016.59","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.59","url":null,"abstract":"One of the most important aspects that influences the performance of parallel applications is the speed of communication between their tasks. To optimize communication, tasks that exchange lots of data should be mapped to processing units that have a high network performance. This technique is called communication-aware task mapping and requires detailed information about the underlying network topology for an accurate mapping. Previous work on task mapping focuses on network clusters or shared memory architectures, in which the topology can be determined directly from the hardware environment. Cloud computing adds significant challenges to task mapping, since information about network topologies is not available to end users. Furthermore, the communication performance might change due to external factors, such as different usage patterns of other users. In this paper, we present a novel solution to perform communication-aware task mapping in the context of commercial cloud environments with multiple instances. Our proposal consists of a short profiling phase to discover the network topology and speed between cloud instances. The profiling can be executed before each application start as it causes only a negligible overhead. This information is then used together with the communication pattern of the parallel application to group tasks based on the amount of communication and to map groups with a lot of communication between them to cloud instances with a high network performance. In this way, application performance is increased, and data traffic between instances is reduced. We evaluated our proposal in a public cloud with a variety of MPI-based parallel benchmarks from the HPC domain, as well as a large scientific application. In the experiments, we observed substantial performance improvements (up to 11 times faster) compared to the default scheduling policies.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"301 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116254966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Tigres Workflow Library: Supporting Scientific Pipelines on HPC Systems Tigres工作流库:支持HPC系统上的科学管道

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.54

V. Hendrix, James Fox, D. Ghoshal, L. Ramakrishnan

The growth in scientific data volumes has resulted in the need for new tools that enable users to operate on and analyze data on large-scale resources. In the last decade, a number of scientific workflow tools have emerged. These tools often target distributed environments, and often need expert help to compose and execute the workflows. Data-intensive workflows are often ad-hoc, they involve an iterative development process that includes users composing and testing their workflows on desktops, and scaling up to larger systems. In this paper, we present the design and implementation of Tigres, a workflow library that supports the iterative workflow development cycle of data-intensive workflows. Tigres provides an application programming interface to a set of programming templates i.e., sequence, parallel, split, merge, that can be used to compose and execute computational and data pipelines. We discuss the results of our evaluation of scientific and synthetic workflows showing Tigres performs with minimal template overheads (mean of 13 seconds over all experiments). We also discuss various factors (e.g., I/O performance, execution mechansims) that affect the performance of scientific workflows on HPC systems.

科学数据量的增长导致需要新的工具，使用户能够操作和分析大规模资源上的数据。在过去的十年中，出现了许多科学的工作流工具。这些工具通常针对分布式环境，并且通常需要专家帮助来组合和执行工作流。数据密集型工作流通常是特别的，它们涉及迭代开发过程，包括用户在桌面上组合和测试工作流，并扩展到更大的系统。在本文中，我们介绍了Tigres的设计和实现，Tigres是一个工作流库，支持数据密集型工作流的迭代工作流开发周期。Tigres为一组编程模板(序列、并行、分割、合并)提供了一个应用程序编程接口，这些模板可用于组合和执行计算和数据管道。我们讨论了我们对科学和合成工作流的评估结果，显示Tigres以最小的模板开销执行(所有实验的平均时间为13秒)。我们还讨论了影响HPC系统上科学工作流性能的各种因素(例如，I/O性能，执行机制)。

{"title":"Tigres Workflow Library: Supporting Scientific Pipelines on HPC Systems","authors":"V. Hendrix, James Fox, D. Ghoshal, L. Ramakrishnan","doi":"10.1109/CCGrid.2016.54","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.54","url":null,"abstract":"The growth in scientific data volumes has resulted in the need for new tools that enable users to operate on and analyze data on large-scale resources. In the last decade, a number of scientific workflow tools have emerged. These tools often target distributed environments, and often need expert help to compose and execute the workflows. Data-intensive workflows are often ad-hoc, they involve an iterative development process that includes users composing and testing their workflows on desktops, and scaling up to larger systems. In this paper, we present the design and implementation of Tigres, a workflow library that supports the iterative workflow development cycle of data-intensive workflows. Tigres provides an application programming interface to a set of programming templates i.e., sequence, parallel, split, merge, that can be used to compose and execute computational and data pipelines. We discuss the results of our evaluation of scientific and synthetic workflows showing Tigres performs with minimal template overheads (mean of 13 seconds over all experiments). We also discuss various factors (e.g., I/O performance, execution mechansims) that affect the performance of scientific workflows on HPC systems.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126785124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

On Efficient Hierarchical Storage for Big Data Processing 面向大数据处理的高效分层存储研究

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.61

K. Krish, Bharti Wadhwa, M. S. Iqbal, M. Mustafa Rafique, Ali R. Butt

A promising trend in storage management for big data frameworks, such as Hadoop and Spark, is the emergence of heterogeneous and hybrid storage systems that employ different types of storage devices, e.g. SSDs, RAMDisks, etc., alongside traditional HDDs. However, scheduling data accesses or requests to an appropriate storage device is non-trivial and depends on several factors such as data locality, device performance, and application compute and storage resources utilization. To this end, we present DUX, an application-attuned dynamic data management system for data processing frameworks, which aims to improve overall application I/O throughput by efficiently using SSDs only for workloads that are expected to benefit from them rather than the extant approach of storing a fraction of the overall workloads in SSDs. The novelty of DUX lies in profiling application performance on SSDs and HDDs, analyzing the resulting I/O behavior, and considering the available SSDs at runtime to dynamically place data in an appropriate storage tier. Evaluation of DUX with trace-driven simulations using synthetic Facebook workloads shows that even when using 5.5× fewer SSDs compared to a SSD-only solution, DUX incurs only a small (5%) performance overhead, and thus offers an affordable and efficient storage tier management.

在大数据框架(如Hadoop和Spark)的存储管理中，一个有希望的趋势是异构和混合存储系统的出现，这些存储系统使用不同类型的存储设备，例如ssd, RAMDisks等，以及传统的hdd。然而，将数据访问或请求调度到适当的存储设备并非易事，它取决于几个因素，例如数据位置、设备性能、应用程序计算和存储资源利用率。为此，我们提出了DUX，这是一个针对数据处理框架的应用程序调优的动态数据管理系统，其目的是通过有效地将ssd仅用于预期从中受益的工作负载，而不是现有的将一小部分工作负载存储在ssd中的方法，来提高应用程序的整体I/O吞吐量。DUX的新颖之处在于分析ssd和hdd上的应用程序性能，分析产生的I/O行为，并在运行时考虑可用的ssd以动态地将数据放置在适当的存储层中。使用合成Facebook工作负载的跟踪驱动模拟对DUX进行的评估表明，即使与仅使用ssd的解决方案相比，使用5.5倍的ssd, DUX也只会产生很小的(5%)性能开销，因此提供了负担得起的高效存储层管理。

{"title":"On Efficient Hierarchical Storage for Big Data Processing","authors":"K. Krish, Bharti Wadhwa, M. S. Iqbal, M. Mustafa Rafique, Ali R. Butt","doi":"10.1109/CCGrid.2016.61","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.61","url":null,"abstract":"A promising trend in storage management for big data frameworks, such as Hadoop and Spark, is the emergence of heterogeneous and hybrid storage systems that employ different types of storage devices, e.g. SSDs, RAMDisks, etc., alongside traditional HDDs. However, scheduling data accesses or requests to an appropriate storage device is non-trivial and depends on several factors such as data locality, device performance, and application compute and storage resources utilization. To this end, we present DUX, an application-attuned dynamic data management system for data processing frameworks, which aims to improve overall application I/O throughput by efficiently using SSDs only for workloads that are expected to benefit from them rather than the extant approach of storing a fraction of the overall workloads in SSDs. The novelty of DUX lies in profiling application performance on SSDs and HDDs, analyzing the resulting I/O behavior, and considering the available SSDs at runtime to dynamically place data in an appropriate storage tier. Evaluation of DUX with trace-driven simulations using synthetic Facebook workloads shows that even when using 5.5× fewer SSDs compared to a SSD-only solution, DUX incurs only a small (5%) performance overhead, and thus offers an affordable and efficient storage tier management.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126474733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Elastic Virtual Machine Scheduling for Continuous Air Traffic Optimization 面向连续空中交通优化的弹性虚拟机调度

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.87

Shigeru Imai, S. Patterson, Carlos A. Varela

As we are facing ever increasing air traffic demand, it is critical to enhance air traffic capacity and alleviate humancontrollers' workload by viewing air traffic optimization as acontinuous/online streaming problem. Air traffic optimizationis commonly formulated as an integer linear programming(ILP) problem. Since ILP is NP-hard, it is computationallyintractable. Moreover, a fluctuating number of flights changescomputational demand dynamically. In this paper, we presentan elastic middleware framework that is specifically designedto solve ILP problems generated from continuous air trafficstreams. Experiments show that our VM scheduling algorithmwith time-series prediction can achieve similar performanceto a static schedule while using 49% fewer VM hours for arealistic air traffic pattern.

面对日益增长的空中交通需求，将空中交通优化视为连续/在线流问题，提高空中交通容量和减轻人工管制员的工作量至关重要。空中交通优化通常被表述为整数线性规划(ILP)问题。由于ILP是np困难的，它在计算上是难以处理的。此外，飞行次数的波动会动态地改变计算需求。在本文中，我们提出了一个弹性中间件框架，专门用于解决由连续空中交通流产生的ILP问题。实验表明，我们的具有时间序列预测的虚拟机调度算法可以达到与静态调度相似的性能，同时减少49%的虚拟机时间用于实际空中交通模式。

引用次数: 8

CloudSwap: A Cloud-Assisted Swap Mechanism for Mobile Devices CloudSwap:移动设备的云辅助交换机制

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.22

Dongju Chae, Joonsung Kim, Youngsok Kim, Jangwoo Kim, Kyung-Ah Chang, Sang-Bum Suh, Hyogun Lee

Application caching is a key feature to enable fast application switches for mobile devices by caching the entire memory pages of applications in the device's physical memory. However, application caching requires a prohibitive amount of memory unless a swap feature is employed to maintain only the working sets of the applications in memory. Unfortunately, mobile devices often disable the invaluable swap feature as it can severely decrease the flash-based local storage device's already marginal lifespan due to the increased writes to the device. As a result, modern mobile devices suffering from the insufficient memory space end up killing memory-hungry applications and keeping only a few applications in the memory. In this paper, we propose CloudSwap, a fast and robust swap mechanism for mobile devices to enable the memory-oblivious application caching. The key idea of CloudSwap is to use the fast local storage as a cache of read-intensive swap pages, while storing prefetch-enabled, write-intensive swap pages in a cloud storage. To preserve the lifespan of the local storage, CloudSwap minimizes the number of writes to the local storage by storing the modified portions of the locally swapped pages in a cloud. To reduce the remote swap-in latency, CloudSwap exploits two cloud-assisted prefetch schemes, the app-aware read-ahead scheme and the access pattern-aware prefetch scheme. Our evaluation shows that the performance of CloudSwap is comparable to a fast, but lifespan-critical local swap system, with only 18% lifespan reduction, compared to the local swap system's 85% lifespan reduction.

应用程序缓存是通过在设备的物理内存中缓存应用程序的整个内存页来实现移动设备的快速应用程序切换的关键特性。但是，除非使用交换特性仅在内存中维护应用程序的工作集，否则应用程序缓存需要大量的内存。不幸的是，移动设备经常禁用宝贵的交换特性，因为由于对设备的写入增加，它可能会严重降低基于闪存的本地存储设备已经奄奄一息的寿命。因此，内存空间不足的现代移动设备最终会杀死需要大量内存的应用程序，只在内存中保留少数应用程序。在本文中，我们提出了CloudSwap，一个快速和健壮的交换机制，为移动设备实现内存无关的应用程序缓存。CloudSwap的关键思想是使用快速本地存储作为读取密集型交换页的缓存，同时在云存储中存储支持预取的写密集型交换页。为了保持本地存储的生命周期，CloudSwap将本地交换页面的修改部分存储在云中，从而最大限度地减少对本地存储的写入次数。为了减少远程交换延迟，CloudSwap利用了两种云辅助预取方案，即应用程序感知预读方案和访问模式感知预取方案。我们的评估表明，CloudSwap的性能与一个快速但寿命关键的本地交换系统相当，与本地交换系统85%的寿命缩短相比，CloudSwap的寿命仅缩短18%。

{"title":"CloudSwap: A Cloud-Assisted Swap Mechanism for Mobile Devices","authors":"Dongju Chae, Joonsung Kim, Youngsok Kim, Jangwoo Kim, Kyung-Ah Chang, Sang-Bum Suh, Hyogun Lee","doi":"10.1109/CCGrid.2016.22","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.22","url":null,"abstract":"Application caching is a key feature to enable fast application switches for mobile devices by caching the entire memory pages of applications in the device's physical memory. However, application caching requires a prohibitive amount of memory unless a swap feature is employed to maintain only the working sets of the applications in memory. Unfortunately, mobile devices often disable the invaluable swap feature as it can severely decrease the flash-based local storage device's already marginal lifespan due to the increased writes to the device. As a result, modern mobile devices suffering from the insufficient memory space end up killing memory-hungry applications and keeping only a few applications in the memory. In this paper, we propose CloudSwap, a fast and robust swap mechanism for mobile devices to enable the memory-oblivious application caching. The key idea of CloudSwap is to use the fast local storage as a cache of read-intensive swap pages, while storing prefetch-enabled, write-intensive swap pages in a cloud storage. To preserve the lifespan of the local storage, CloudSwap minimizes the number of writes to the local storage by storing the modified portions of the locally swapped pages in a cloud. To reduce the remote swap-in latency, CloudSwap exploits two cloud-assisted prefetch schemes, the app-aware read-ahead scheme and the access pattern-aware prefetch scheme. Our evaluation shows that the performance of CloudSwap is comparable to a fast, but lifespan-critical local swap system, with only 18% lifespan reduction, compared to the local swap system's 85% lifespan reduction.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"310 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132033790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

CLARISSE: A Middleware for Data-Staging Coordination and Control on Large-Scale HPC Platforms CLARISSE:用于大规模HPC平台的数据分级协调和控制的中间件

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.24

Florin Isaila, J. Carretero, R. Ross

On current large-scale HPC platforms the data path from compute nodes to final storage passes through several networks interconnecting a distributed hierarchy of nodes serving as compute nodes, I/O nodes, and file system servers. Although applications compete for resources at various system levels, the current system software offers no mechanisms for globally coordinating the data flow for attaining optimal resource usage and for reacting to overload or interference. In this paper we describe CLARISSE, a middleware designed to enhance data-staging coordination and control in the HPC software storage I/O stack. CLARISSE exposes the parallel data flows to a higher-level hierarchy of controllers, thereby opening up the possibility of developing novel cross-layer optimizations, based on the run-time information. To the best of our knowledge, CLARISSE is the first middleware that decouples the policy, control, and data layers of the software I/O stack in order to simplify the task of globally coordinating the data staging on large-scale HPC platforms. To demonstrate how CLARISSE can be used for performance enhancement, we present two case studies: an elastic load-aware collective I/O and a cross-application parallel I/O scheduling policy. The evaluation illustrates how coordination can bring a significant performance benefit with low overheads by adapting to load conditions and interference.

在当前的大规模HPC平台上，从计算节点到最终存储的数据路径要经过几个网络，这些网络连接着作为计算节点、I/O节点和文件系统服务器的分布式节点层次结构。尽管应用程序在不同的系统级别上竞争资源，但当前的系统软件没有提供全局协调数据流的机制，以获得最佳的资源使用，并对过载或干扰作出反应。在本文中，我们描述了CLARISSE，一个中间件，旨在增强HPC软件存储I/O堆栈中的数据分级协调和控制。CLARISSE将并行数据流暴露给更高层次的控制器，从而为基于运行时信息开发新的跨层优化提供了可能性。据我们所知，CLARISSE是第一个将软件I/O堆栈的策略层、控制层和数据层解耦的中间件，以简化大规模HPC平台上全局协调数据staging的任务。为了演示如何使用CLARISSE来增强性能，我们提供了两个案例研究:弹性负载感知集体I/O和跨应用程序并行I/O调度策略。评估说明了协调如何通过适应负载条件和干扰，在低开销的情况下带来显著的性能优势。

{"title":"CLARISSE: A Middleware for Data-Staging Coordination and Control on Large-Scale HPC Platforms","authors":"Florin Isaila, J. Carretero, R. Ross","doi":"10.1109/CCGrid.2016.24","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.24","url":null,"abstract":"On current large-scale HPC platforms the data path from compute nodes to final storage passes through several networks interconnecting a distributed hierarchy of nodes serving as compute nodes, I/O nodes, and file system servers. Although applications compete for resources at various system levels, the current system software offers no mechanisms for globally coordinating the data flow for attaining optimal resource usage and for reacting to overload or interference. In this paper we describe CLARISSE, a middleware designed to enhance data-staging coordination and control in the HPC software storage I/O stack. CLARISSE exposes the parallel data flows to a higher-level hierarchy of controllers, thereby opening up the possibility of developing novel cross-layer optimizations, based on the run-time information. To the best of our knowledge, CLARISSE is the first middleware that decouples the policy, control, and data layers of the software I/O stack in order to simplify the task of globally coordinating the data staging on large-scale HPC platforms. To demonstrate how CLARISSE can be used for performance enhancement, we present two case studies: an elastic load-aware collective I/O and a cross-application parallel I/O scheduling policy. The evaluation illustrates how coordination can bring a significant performance benefit with low overheads by adapting to load conditions and interference.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134641837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters 基于CUDA内核的大规模GPU集群集体约简操作

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Pub Date : 2016-05-16 DOI: 10.1109/CCGrid.2016.111

Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, A. Awan, D. Panda

Accelerators like NVIDIA GPUs have changed the landscape of current HPC clusters to a great extent. Massive heterogeneous parallelism offered by these accelerators have led to GPU-Aware MPI libraries that are widely used for writing distributed parallel scientific applications. Compute-oriented collective operations like MPI_Reduce perform computation on data in addition to the usual communication performed by collectives. Historically, these collectives, due to their compute requirements have been implemented on CPU (or Host) only. However, with the advent of GPU technologies it has become important for MPI libraries to provide better design for their GPU (or Device) based versions. In this paper, we tackle the above challenges and provide designs and implementations for most commonly used compute-oriented collectives - MPI_Reduce, MPI_Allreduce, and MPI_Scan - for GPU clusters. We propose extensions to the state-of-the-art algorithms to fully take advantage of the GPU capabilities like GPUDirect RDMA (GDR) and CUDA compute kernel to efficiently perform these operations. With our new designs, we report reduced execution time for all compute-based collectives up to 96 GPUs. Experimental results show an improvement of 50% for small messages and 85% for large messages using MPI_Reduce. For MPI_Allreduce and MPI_Scan, we report more than 40% reduction in time for large messages. Furthermore, analytical models are developed and evaluated to understand and predict the performance of proposed designs for extremely large-scale GPU clusters.

像NVIDIA gpu这样的加速器在很大程度上改变了当前HPC集群的格局。这些加速器提供的大量异构并行性导致了gpu感知的MPI库，广泛用于编写分布式并行科学应用程序。面向计算的集体操作，如MPI_Reduce，除了集体执行通常的通信之外，还对数据执行计算。从历史上看，由于它们的计算需求，这些集合只能在CPU(或主机)上实现。然而，随着GPU技术的出现，MPI库为基于GPU(或设备)的版本提供更好的设计变得非常重要。在本文中，我们解决了上述挑战，并为GPU集群提供了最常用的面向计算的集合- MPI_Reduce, MPI_Allreduce和MPI_Scan的设计和实现。我们建议对最先进的算法进行扩展，以充分利用GPU功能，如GPUDirect RDMA (GDR)和CUDA计算内核，以有效地执行这些操作。使用我们的新设计，我们报告减少了所有基于计算机的集群(最多96个gpu)的执行时间。实验结果表明，使用MPI_Reduce处理小消息的效率提高了50%，处理大消息的效率提高了85%。对于MPI_Allreduce和MPI_Scan，我们报告大消息的处理时间减少了40%以上。此外，还开发和评估了分析模型，以了解和预测超大规模GPU集群的拟议设计的性能。

{"title":"CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters","authors":"Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, A. Awan, D. Panda","doi":"10.1109/CCGrid.2016.111","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.111","url":null,"abstract":"Accelerators like NVIDIA GPUs have changed the landscape of current HPC clusters to a great extent. Massive heterogeneous parallelism offered by these accelerators have led to GPU-Aware MPI libraries that are widely used for writing distributed parallel scientific applications. Compute-oriented collective operations like MPI_Reduce perform computation on data in addition to the usual communication performed by collectives. Historically, these collectives, due to their compute requirements have been implemented on CPU (or Host) only. However, with the advent of GPU technologies it has become important for MPI libraries to provide better design for their GPU (or Device) based versions. In this paper, we tackle the above challenges and provide designs and implementations for most commonly used compute-oriented collectives - MPI_Reduce, MPI_Allreduce, and MPI_Scan - for GPU clusters. We propose extensions to the state-of-the-art algorithms to fully take advantage of the GPU capabilities like GPUDirect RDMA (GDR) and CUDA compute kernel to efficiently perform these operations. With our new designs, we report reduced execution time for all compute-based collectives up to 96 GPUs. Experimental results show an improvement of 50% for small messages and 85% for large messages using MPI_Reduce. For MPI_Allreduce and MPI_Scan, we report more than 40% reduction in time for large messages. Furthermore, analytical models are developed and evaluated to understand and predict the performance of proposed designs for extremely large-scale GPU clusters.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"75 15","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134196959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16