2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)最新文献_第3页

HPDV:A Highly Parallel Deduplication Cluster for Virtual Machine Images HPDV:用于虚拟机映像的高度并行重复数据删除集群

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00074

Chuan Lin, Q. Cao, Jianzhong Huang, Jie Yao, Xiaoqian Li, C. Xie

Data deduplication has been widely introduced to effectively reduce storage requirement of virtual machine (VM) images running on VM servers in the virtualized cloud platforms. Nevertheless, the existing state-of-the-art deduplication for VM images approaches can not sufficiently exploit the potential of underlying hardware with consideration of the interference of deduplication on the foreground VM services, which could affect the quality of VM services. In this paper, we present HPDV, a highly parallel deduplication cluster for VM images, which well utilizes the parallelism to achieve high throughput with minimum interference on the foreground VM services. The main idea behind HPDV is to exploit idle CPU resource of VM servers to parallelize the compute-intensive chunking and fingerprinting, and to parallelize the I/O-intensive fingerprint indexing in the deduplication servers by dividing the globally shared fingerprint index into multiple independent sub-indexes according to the operating systems of VM images. To ensure the quality of VM services, a resource-aware scheduler is proposed to dynamically adjust the number of parallel chunking and fingerprinting threads according to the CPU utilization of VM servers. Our evaluation results demonstrate that compared to a state-of-the-art deduplication system for VM images called Light, HPDV achieves up to 67% deduplication throughput improvement.

为了有效降低虚拟化云平台中运行在虚拟机服务器上的虚拟机映像的存储需求，重复数据删除技术被广泛引入。然而，现有的最先进的虚拟机镜像重复数据删除方法不能充分利用底层硬件的潜力，考虑到重复数据删除对前台虚拟机服务的干扰，这可能会影响虚拟机服务的质量。在本文中，我们提出了一个高度并行的VM镜像重复数据删除集群HPDV，它很好地利用了并行性来实现高吞吐量，同时对前台VM服务的干扰最小。HPDV的主要思想是利用虚拟机服务器的空闲CPU资源来并行化计算密集型的分块和指纹，并根据虚拟机映像的操作系统将全局共享的指纹索引划分为多个独立的子索引，从而并行化重复数据删除服务器上的I/ o密集型指纹索引。为了保证虚拟机的服务质量，提出了一种资源感知调度器，根据虚拟机服务器的CPU利用率动态调整并行分块和指纹线程的数量。我们的评估结果表明，与最先进的用于VM映像的重复数据删除系统Light相比，HPDV实现了高达67%的重复数据删除吞吐量改进。

{"title":"HPDV:A Highly Parallel Deduplication Cluster for Virtual Machine Images","authors":"Chuan Lin, Q. Cao, Jianzhong Huang, Jie Yao, Xiaoqian Li, C. Xie","doi":"10.1109/CCGRID.2018.00074","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00074","url":null,"abstract":"Data deduplication has been widely introduced to effectively reduce storage requirement of virtual machine (VM) images running on VM servers in the virtualized cloud platforms. Nevertheless, the existing state-of-the-art deduplication for VM images approaches can not sufficiently exploit the potential of underlying hardware with consideration of the interference of deduplication on the foreground VM services, which could affect the quality of VM services. In this paper, we present HPDV, a highly parallel deduplication cluster for VM images, which well utilizes the parallelism to achieve high throughput with minimum interference on the foreground VM services. The main idea behind HPDV is to exploit idle CPU resource of VM servers to parallelize the compute-intensive chunking and fingerprinting, and to parallelize the I/O-intensive fingerprint indexing in the deduplication servers by dividing the globally shared fingerprint index into multiple independent sub-indexes according to the operating systems of VM images. To ensure the quality of VM services, a resource-aware scheduler is proposed to dynamically adjust the number of parallel chunking and fingerprinting threads according to the CPU utilization of VM servers. Our evaluation results demonstrate that compared to a state-of-the-art deduplication system for VM images called Light, HPDV achieves up to 67% deduplication throughput improvement.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117074283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Decentralized Admission Control for High-Throughput Key-Value Data Stores 高吞吐量键值数据存储的分散准入控制

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00077

Young Ki Kim, M. HoseinyFarahabady, Young Choon Lee, Albert Y. Zomaya

Workload surges are a serious hindrance to per-formance of even high-throughput key-value data stores, such as Cassandra, MongoDB, and more recently Aerospike. In this paper, we present a decentralized admission controller for high-throughput key-value data stores. The proposed controller dynamically regulates the release time of incoming requests explicitly taking into account different Quality of Service (QoS) classes. In particular, an instance of such controller is assigned to each client for its autonomous admission control specific to the client's QoS requirements. These controllers operate in a decentralized manner with only local performance metrics, response time and queue waiting time. Despite the use of such "minimal" run-time state information, our decentralized admission controller is capable of coping with workload surges respecting QoS requirements. The performance evaluation is carried out by comparing the proposed admission controller with the default scheduling policy of Aerospike, in a testbed cluster under various workload intensity rates. Experimental results confirm that the proposed controller improves QoS satisfaction in terms of end-to-end response time by nearly 12 times, on average, compared with that of Aerospike's, in high-rate workload. Results also show decreases of the average and standard deviation of latency up to 31% and 50%, respectively, during workload surges (peak load) in high-rate workload.

即使是高吞吐量的键值数据存储(如Cassandra、MongoDB和最近的Aerospike)，工作负载激增也会严重阻碍其性能。在本文中，我们提出了一种用于高吞吐量键值数据存储的分散准入控制器。该控制器考虑到不同的服务质量(QoS)类别，显式地动态调节传入请求的释放时间。特别是，将这种控制器的实例分配给每个客户端，以实现其特定于客户端QoS需求的自主准入控制。这些控制器以分散的方式运行，只有本地性能指标、响应时间和队列等待时间。尽管使用了这种“最小”的运行时状态信息，我们的分散式许可控制器能够处理符合QoS要求的工作负载激增。在不同工作负载强度下的测试平台集群中，将所提出的接纳控制器与Aerospike的默认调度策略进行了性能评估。实验结果证实，在高速率工作负载下，与Aerospike的控制器相比，所提出的控制器在端到端响应时间方面的QoS满意度平均提高了近12倍。结果还显示，在高速工作负载的工作负载激增(峰值负载)期间，延迟的平均和标准偏差分别减少了31%和50%。

{"title":"Decentralized Admission Control for High-Throughput Key-Value Data Stores","authors":"Young Ki Kim, M. HoseinyFarahabady, Young Choon Lee, Albert Y. Zomaya","doi":"10.1109/CCGRID.2018.00077","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00077","url":null,"abstract":"Workload surges are a serious hindrance to per-formance of even high-throughput key-value data stores, such as Cassandra, MongoDB, and more recently Aerospike. In this paper, we present a decentralized admission controller for high-throughput key-value data stores. The proposed controller dynamically regulates the release time of incoming requests explicitly taking into account different Quality of Service (QoS) classes. In particular, an instance of such controller is assigned to each client for its autonomous admission control specific to the client's QoS requirements. These controllers operate in a decentralized manner with only local performance metrics, response time and queue waiting time. Despite the use of such \"minimal\" run-time state information, our decentralized admission controller is capable of coping with workload surges respecting QoS requirements. The performance evaluation is carried out by comparing the proposed admission controller with the default scheduling policy of Aerospike, in a testbed cluster under various workload intensity rates. Experimental results confirm that the proposed controller improves QoS satisfaction in terms of end-to-end response time by nearly 12 times, on average, compared with that of Aerospike's, in high-rate workload. Results also show decreases of the average and standard deviation of latency up to 31% and 50%, respectively, during workload surges (peak load) in high-rate workload.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125571712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An Overview of Cloud Simulation Enhancement Using the Monte-Carlo Method 利用蒙特卡罗方法增强云模拟的概述

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/ccgrid.2018.00064

Luke Bertot, S. Genaud, J. Gossa

In the cloud computing model, cloud providers invoice clients for resource consumption. Hence, tools helping the client to budget the cost of running their application are of pre-eminent importance. However, the opaque and multi-tenant nature of clouds, make job runtimes both variable and hard to predict. In this paper, we propose an improved simulation framework that takes into account this variability using the Monte-Carlo method. We consider the execution of batch jobs on an actual platform, scheduled using typical heuristics based on the user estimates of tasks' runtimes. We model the observed variability through simple distributions to use as inputs to the Monte-Carlo simulation. We show that, our method can capture over 90% of the empirical observations of total execution times.

在云计算模型中，云提供商根据资源消耗向客户开具发票。因此，帮助客户预算运行其应用程序的成本的工具非常重要。然而，云的不透明和多租户特性使得作业运行时既可变又难以预测。在本文中，我们提出了一个改进的模拟框架，考虑到使用蒙特卡罗方法的这种可变性。我们考虑在实际平台上执行批处理作业，使用基于用户对任务运行时的估计的典型启发式方法进行调度。我们通过简单的分布对观察到的可变性进行建模，作为蒙特卡罗模拟的输入。我们表明，我们的方法可以捕获90%以上的总执行时间的经验观察值。

引用次数: 6

High-Cold Environment Joint Observation and Research Cloud of China 中国高寒环境联合观测研究云

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00060

Yufang Min, Yaonan Zhang, J. Huo, Keting Feng, Jianfang Kang, Guohui Zhao

In recent years, the scientific research model of Data - Model - Simulating has become one of the main methods to support the surface process research in the high-cold environment (alpine cold area and high latitude cold area). This kind of research mode needs the e-Geoscience environment based on data, models, high performance computing, and visualization and collaborative to support. In this paper, a highly efficient platform named High-cold environment joint Observation and Research cloud of China (HeorCloud) is established for Geoscience research in high and cold regions of china based on cloud computing technologies. HeorCloud implemented the unified service system named Gateway, be used to achieve the resources of data, model, computing, visualization combination and optimization configuration. Ultimately, besides providing the basic services of data, model and computing resource sharing, the platform also constructs online research community of some professional field contains data, analytical tools, models and computing resources based on Gateway. So far, the platform has realized the atmosphere, hydrology, remote sensing, permafrost research community applicable to the high-cold environment of China, and has been constantly expanding resources.

近年来，数据-模型-模拟的科研模式已成为支持高寒环境(高寒地区和高纬度寒冷地区)地表过程研究的主要方法之一。这种研究模式需要基于数据、模型、高性能计算、可视化和协同的e-Geoscience环境的支持。本文基于云计算技术，构建了中国高寒地区地学研究的高效平台——中国高寒环境联合观测研究云(HeorCloud)。HeorCloud实现了名为Gateway的统一服务系统，用于实现数据、模型、计算、可视化资源的组合和优化配置。最终，该平台除了提供数据、模型和计算资源共享的基础服务外，还基于Gateway构建了包含数据、分析工具、模型和计算资源的某些专业领域的在线研究社区。目前，该平台已实现了适用于中国高寒环境的大气、水文、遥感、冻土研究共同体，并不断拓展资源。

{"title":"High-Cold Environment Joint Observation and Research Cloud of China","authors":"Yufang Min, Yaonan Zhang, J. Huo, Keting Feng, Jianfang Kang, Guohui Zhao","doi":"10.1109/CCGRID.2018.00060","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00060","url":null,"abstract":"In recent years, the scientific research model of Data - Model - Simulating has become one of the main methods to support the surface process research in the high-cold environment (alpine cold area and high latitude cold area). This kind of research mode needs the e-Geoscience environment based on data, models, high performance computing, and visualization and collaborative to support. In this paper, a highly efficient platform named High-cold environment joint Observation and Research cloud of China (HeorCloud) is established for Geoscience research in high and cold regions of china based on cloud computing technologies. HeorCloud implemented the unified service system named Gateway, be used to achieve the resources of data, model, computing, visualization combination and optimization configuration. Ultimately, besides providing the basic services of data, model and computing resource sharing, the platform also constructs online research community of some professional field contains data, analytical tools, models and computing resources based on Gateway. So far, the platform has realized the atmosphere, hydrology, remote sensing, permafrost research community applicable to the high-cold environment of China, and has been constantly expanding resources.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"783 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126952558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling Trade-offs Between Accuracy and Computational Cost: Adaptive Algorithms to Reduce Time to Clinical Insight 使准确性和计算成本之间的权衡:自适应算法，以减少时间到临床洞察

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00005

J. Dakka, Kristof Farkas-Pall, Vivek Balasubramanian, M. Turilli, S. Wan, D. Wright, S. Zasada, P. Coveney, S. Jha

The efficacy of drug treatments depends on how tightly small molecules bind to their target proteins. Quantifying the strength of these interactions (the so called ‘binding affinity’) is a grand challenge of computational chemistry, surmounting which could revolutionize drug design and provide the platform for patient specific medicine. Recently, evidence from blind challenge predictions and retrospective validation studies has suggested that molecular dynamics (MD) can now achieve useful predictive accuracy ( 1 kcal/mol) This accuracy is sufficient to greatly accelerate hit to lead and lead optimization. To translate these advances in predictive accuracy so as to impact clinical and/or industrial decision making requires that binding free energy results must be turned around on reduced timescales without loss of accuracy. This demands advances in algorithms, scalable software systems, and intelligent and efficient utilization of supercomputing resources. This work is motivated by the real world problem of providing insight from drug candidate data on a time scale that is as short as possible. Specifically, we reproduce results from a collaborative project between UCL and GlaxoSmithKline to study a congeneric series of drug candidates binding to the BRD4 protein – inhibitors of which have shown promising preclinical efficacy in pathologies ranging from cancer to inflammation. We demonstrate the use of a framework called HTBAC, designed to support the aforementioned requirements of accurate and rapid drug binding affinity calculations. HTBAC facilitates the execution of the numbers of simulations while supporting the adaptive execution of algorithms. Furthermore, HTBAC enables the selection of simulation parameters during runtime which can, in principle, optimize the use of computational resources whilst producing results within a target uncertainty.

药物治疗的效果取决于小分子与靶蛋白结合的紧密程度。量化这些相互作用的强度(所谓的“结合亲和力”)是计算化学的一个巨大挑战，超越它可以彻底改变药物设计并为患者特异性药物提供平台。最近，来自盲挑战预测和回顾性验证研究的证据表明，分子动力学(MD)现在可以达到有用的预测精度(1 kcal/mol)，这种精度足以大大加速命中先导和先导优化。为了将这些预测准确性的进步转化为影响临床和/或工业决策，需要在不损失准确性的情况下，在缩短的时间尺度上扭转结合自由能结果。这需要在算法、可扩展的软件系统以及超级计算资源的智能和高效利用方面取得进步。这项工作的动机是在尽可能短的时间内从候选药物数据中提供洞察力的现实世界问题。具体来说，我们重现了伦敦大学学院和葛兰素史克公司合作项目的结果，该项目研究了一系列与BRD4蛋白抑制剂结合的同源候选药物，这些药物在从癌症到炎症的病理中显示出有希望的临床前疗效。我们演示了一个名为HTBAC的框架的使用，旨在支持上述准确和快速的药物结合亲和力计算的要求。HTBAC促进了大量模拟的执行，同时支持算法的自适应执行。此外，HTBAC允许在运行时选择仿真参数，原则上可以优化计算资源的使用，同时在目标不确定性范围内产生结果。

{"title":"Enabling Trade-offs Between Accuracy and Computational Cost: Adaptive Algorithms to Reduce Time to Clinical Insight","authors":"J. Dakka, Kristof Farkas-Pall, Vivek Balasubramanian, M. Turilli, S. Wan, D. Wright, S. Zasada, P. Coveney, S. Jha","doi":"10.1109/CCGRID.2018.00005","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00005","url":null,"abstract":"The efficacy of drug treatments depends on how tightly small molecules bind to their target proteins. Quantifying the strength of these interactions (the so called ‘binding affinity’) is a grand challenge of computational chemistry, surmounting which could revolutionize drug design and provide the platform for patient specific medicine. Recently, evidence from blind challenge predictions and retrospective validation studies has suggested that molecular dynamics (MD) can now achieve useful predictive accuracy ( 1 kcal/mol) This accuracy is sufficient to greatly accelerate hit to lead and lead optimization. To translate these advances in predictive accuracy so as to impact clinical and/or industrial decision making requires that binding free energy results must be turned around on reduced timescales without loss of accuracy. This demands advances in algorithms, scalable software systems, and intelligent and efficient utilization of supercomputing resources. This work is motivated by the real world problem of providing insight from drug candidate data on a time scale that is as short as possible. Specifically, we reproduce results from a collaborative project between UCL and GlaxoSmithKline to study a congeneric series of drug candidates binding to the BRD4 protein – inhibitors of which have shown promising preclinical efficacy in pathologies ranging from cancer to inflammation. We demonstrate the use of a framework called HTBAC, designed to support the aforementioned requirements of accurate and rapid drug binding affinity calculations. HTBAC facilitates the execution of the numbers of simulations while supporting the adaptive execution of algorithms. Furthermore, HTBAC enables the selection of simulation parameters during runtime which can, in principle, optimize the use of computational resources whilst producing results within a target uncertainty.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116643065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Understanding scale-Dependent soft-Error Behavior of Scientific Applications 理解科学应用的尺度相关软误差行为

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00075

Gokcen Kestor, I. Peng, R. Gioiosa, S. Krishnamoorthy

Analyzing application fault behavior on large-scale systems is time-consuming and resource-demanding. Currently, researchers need to perform fault injection campaigns at full scale to understand the effects of soft errors on applications and whether these faults result in silent data corruption. Both time and resource requirements greatly limit the scope of the resilience studies that can be currently performed. In this work, we propose a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. We employ machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale. Our methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale. We show that our methodology can accurately model application fault behavior at large scale by using only small scale experiments. In some cases, we can model the fault behavior of a parallel application running on 4,096 cores with about 90% accuracy based on experiments on a single core.

在大规模系统中分析应用程序故障行为是一项耗时且耗费资源的工作。目前，研究人员需要全面执行故障注入活动，以了解软错误对应用程序的影响，以及这些错误是否会导致静默数据损坏。时间和资源需求极大地限制了目前可以进行的弹性研究的范围。在这项工作中，我们提出了一种基于在小规模进行的简化实验集来模拟大规模应用程序故障行为的方法。我们采用机器学习技术，使用一组可以在小规模并行执行的实验来准确地建模应用程序故障行为。我们的方法大大减少了需要进行的故障注入实验的数量和规模，为大规模研究应用程序故障行为提供了一种有效的方法。通过小规模的实验表明，我们的方法可以准确地模拟大规模的应用程序故障行为。在某些情况下，我们可以对运行在4096个核上的并行应用程序的故障行为进行建模，基于单核上的实验，准确率约为90%。

{"title":"Understanding scale-Dependent soft-Error Behavior of Scientific Applications","authors":"Gokcen Kestor, I. Peng, R. Gioiosa, S. Krishnamoorthy","doi":"10.1109/CCGRID.2018.00075","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00075","url":null,"abstract":"Analyzing application fault behavior on large-scale systems is time-consuming and resource-demanding. Currently, researchers need to perform fault injection campaigns at full scale to understand the effects of soft errors on applications and whether these faults result in silent data corruption. Both time and resource requirements greatly limit the scope of the resilience studies that can be currently performed. In this work, we propose a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. We employ machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale. Our methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale. We show that our methodology can accurately model application fault behavior at large scale by using only small scale experiments. In some cases, we can model the fault behavior of a parallel application running on 4,096 cores with about 90% accuracy based on experiments on a single core.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132128780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

The Impact of Task Runtime Estimate Accuracy on Scheduling Workloads of Workflows 任务运行时估算精度对工作流工作负载调度的影响

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00048

A. Ilyushkin, D. Epema

Workflow schedulers often rely on task runtime estimates when making scheduling decisions, and they usually target the scheduling of a single workflow or batches of workflows. In contrast, in this paper, we evaluate the impact of the absence or limited accuracy of task runtime estimates on slowdown when scheduling complete workloads of workflows that arrive over time. We study a total of seven scheduling policies: four of these are popular existing policies for (batches of) workloads from the literature, including a simple backfilling policy which is not aware of task runtime estimates, two are novel workloadoriented policies, including one which targets fairness, and one is the well-known HEFT policy for a single workflow adapted to the online workload scenario. We simulate homogeneous and heterogeneous distributed systems to evaluate the performance of these policies under varying accuracy of task runtime estimates. Our results show that for high utilizations, the order in which workflows are processed is more important than the knowledge of correct task runtime estimates. Under low utilizations, all policies considered show good results, even a policy which does not use task runtime estimates. We also show that our Fair Workflow Prioritization (FWP) policy effectively decreases the variance of workflow slowdown and thus achieves fairness, and that the planbased scheduling policy derived from HEFT does not show much performance improvement while bringing extra complexity to the scheduling process.

工作流调度器在制定调度决策时通常依赖于任务运行时估计，并且它们通常针对单个工作流或批量工作流的调度。相反，在本文中，我们评估了在调度随时间到达的工作流的完整工作负载时，任务运行时估计的缺失或有限准确性对减速的影响。我们总共研究了七种调度策略:其中四个是文献中(批量)工作负载的流行策略，包括一个不知道任务运行时估计的简单回填策略，两个是新的面向工作负载的策略，包括一个目标公平的策略，一个是针对适应在线工作负载场景的单个工作流的著名的HEFT策略。我们模拟同构和异构分布式系统，以评估这些策略在不同任务运行时估计精度下的性能。我们的结果表明，对于高利用率，处理工作流的顺序比正确的任务运行时估计的知识更重要。在低利用率下，考虑的所有策略都显示出良好的结果，即使是不使用任务运行时估计的策略。我们的公平工作流优先级(FWP)策略有效地降低了工作流速度的变化，从而实现了公平性，而由HEFT衍生的基于计划的调度策略在给调度过程带来额外的复杂性的同时并没有显示出太大的性能改进。

{"title":"The Impact of Task Runtime Estimate Accuracy on Scheduling Workloads of Workflows","authors":"A. Ilyushkin, D. Epema","doi":"10.1109/CCGRID.2018.00048","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00048","url":null,"abstract":"Workflow schedulers often rely on task runtime estimates when making scheduling decisions, and they usually target the scheduling of a single workflow or batches of workflows. In contrast, in this paper, we evaluate the impact of the absence or limited accuracy of task runtime estimates on slowdown when scheduling complete workloads of workflows that arrive over time. We study a total of seven scheduling policies: four of these are popular existing policies for (batches of) workloads from the literature, including a simple backfilling policy which is not aware of task runtime estimates, two are novel workloadoriented policies, including one which targets fairness, and one is the well-known HEFT policy for a single workflow adapted to the online workload scenario. We simulate homogeneous and heterogeneous distributed systems to evaluate the performance of these policies under varying accuracy of task runtime estimates. Our results show that for high utilizations, the order in which workflows are processed is more important than the knowledge of correct task runtime estimates. Under low utilizations, all policies considered show good results, even a policy which does not use task runtime estimates. We also show that our Fair Workflow Prioritization (FWP) policy effectively decreases the variance of workflow slowdown and thus achieves fairness, and that the planbased scheduling policy derived from HEFT does not show much performance improvement while bringing extra complexity to the scheduling process.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128948531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Towards Resource and Contract Heterogeneity Aware Rescaling for Cloud-Hosted Applications 面向云托管应用程序的资源和契约异构的重新伸缩

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00030

Mohan Baruwal Chhetri, Quoc Bao Vo, R. Kowalczyk, S. Nepal

Cloud infrastructure providers are offering consumers a wide range of resource and contract options to choose from, yet most elasticity management solutions are incapable of leveraging this to optimize the cost and performance of cloudhosted applications. To address this problem, in this paper, we propose a novel resource scaling approach that exploits both resource and contract heterogeneity to achieve optimal resource allocations and better cost control. We model resource allocation as an Unbounded Knapsack Problem, and resource scaling as an one-step ahead resource allocation problem. Based on this, we present two scaling strategies, namely delta scale optimization and full scale optimization. Delta scale optimization supports the traditional notion of scaling resources horizontally, i.e., it computes an optimal allocation (or deallocation) of resources to increase (or decrease) the total compute capacity based on the current allocation and the forecast application workload. Full scale optimization, on the other hand, supports the notion of cost-optimal resource rescaling, i.e., the simultaneous allocation and deallocation of resources to meet the forecast workload irrespective of the decision to increase, decrease or maintain capacity. Both strategies provide users greater flexibility in managing trade offs between cost and performance. We motivate our research work by using a realistic and non-trivial scenario of resource scaling for a cloud-hosted IoT platform and use simple use cases to illustrate the benefit of our proposed approach.

云基础设施提供商为消费者提供了广泛的资源和合同选项，但大多数弹性管理解决方案无法利用这一点来优化云托管应用程序的成本和性能。为了解决这个问题，在本文中，我们提出了一种新的资源扩展方法，利用资源和契约的异质性来实现最优的资源分配和更好的成本控制。我们将资源分配建模为一个无界背包问题，并将资源扩展作为一个超前一步的资源分配问题。在此基础上，本文提出了delta尺度优化和full尺度优化两种优化策略。增量规模优化支持水平扩展资源的传统概念，即，它计算资源的最佳分配(或重新分配)，以基于当前分配和预测应用程序工作负载来增加(或减少)总计算容量。另一方面，全面优化支持成本最优资源重新分配的概念，即同时分配和重新分配资源以满足预测的工作量，而不考虑增加、减少或维持容量的决定。这两种策略都为用户在管理成本和性能之间的权衡方面提供了更大的灵活性。我们通过使用云托管物联网平台的现实和非平凡的资源扩展场景来激励我们的研究工作，并使用简单的用例来说明我们提出的方法的好处。

{"title":"Towards Resource and Contract Heterogeneity Aware Rescaling for Cloud-Hosted Applications","authors":"Mohan Baruwal Chhetri, Quoc Bao Vo, R. Kowalczyk, S. Nepal","doi":"10.1109/CCGRID.2018.00030","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00030","url":null,"abstract":"Cloud infrastructure providers are offering consumers a wide range of resource and contract options to choose from, yet most elasticity management solutions are incapable of leveraging this to optimize the cost and performance of cloudhosted applications. To address this problem, in this paper, we propose a novel resource scaling approach that exploits both resource and contract heterogeneity to achieve optimal resource allocations and better cost control. We model resource allocation as an Unbounded Knapsack Problem, and resource scaling as an one-step ahead resource allocation problem. Based on this, we present two scaling strategies, namely delta scale optimization and full scale optimization. Delta scale optimization supports the traditional notion of scaling resources horizontally, i.e., it computes an optimal allocation (or deallocation) of resources to increase (or decrease) the total compute capacity based on the current allocation and the forecast application workload. Full scale optimization, on the other hand, supports the notion of cost-optimal resource rescaling, i.e., the simultaneous allocation and deallocation of resources to meet the forecast workload irrespective of the decision to increase, decrease or maintain capacity. Both strategies provide users greater flexibility in managing trade offs between cost and performance. We motivate our research work by using a realistic and non-trivial scenario of resource scaling for a cloud-hosted IoT platform and use simple use cases to illustrate the benefit of our proposed approach.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128759636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Nitro: Network-Aware Virtual Machine Image Management in Geo-Distributed Clouds Nitro:地理分布云中的网络感知虚拟机映像管理

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00082

Jad Darrous, Shadi Ibrahim, Amelie Chi Zhou, Christian Pérez

Recently, most large cloud providers, like Amazon and Microsoft, replicate their Virtual Machine Images (VMIs) on multiple geographically distributed data centers to offer fast service provisioning. Provisioning a service may require to transfer a VMI over the wide-area network (WAN) and therefore is dictated by the distribution of VMIs and the network bandwidth in-between sites. Nevertheless, existing methods to facilitate VMI management (i.e., retrieving VMIs) overlook network heterogeneity in geo-distributed clouds. In this paper, we design, implement and evaluate Nitro, a novel VMI management system that helps to minimize the transfer time of VMIs over a heterogeneous WAN. To achieve this goal, Nitro incorporates two complementary features. First, it makes use of deduplication to reduce the amount of data which will be transferred due to the high similarities within an image and in-between images. Second, Nitro is equipped with a network-aware data transfer strategy to effectively exploit links with high bandwidth when acquiring data and thus expedites the provisioning time. Experimental results show that our network-aware data transfer strategy offers the optimal solution when acquiring VMIs while introducing minimal overhead. Moreover, Nitro outperforms state-of-the-art VMI storage systems (e.g., OpenStack Swift) by up to 77%.

最近，大多数大型云提供商，如亚马逊和微软，在多个地理分布的数据中心上复制他们的虚拟机映像(vmi)，以提供快速的服务供应。提供服务可能需要通过广域网(WAN)传输VMI，因此由VMI的分布和站点之间的网络带宽决定。然而，现有的促进VMI管理(即检索VMI)的方法忽略了地理分布云中的网络异构性。在本文中，我们设计，实现和评估Nitro，一个新的VMI管理系统，有助于减少VMI在异构广域网上的传输时间。为了实现这一目标，Nitro结合了两个互补的功能。首先，它利用重复数据删除来减少由于图像内部和图像之间的高度相似性而传输的数据量。其次，Nitro配备了网络感知的数据传输策略，在获取数据时有效地利用高带宽链路，从而加快了提供时间。实验结果表明，我们的网络感知数据传输策略在获取虚拟机的同时引入最小的开销，提供了最优的解决方案。此外，Nitro的性能比最先进的VMI存储系统(例如OpenStack Swift)高出77%。

{"title":"Nitro: Network-Aware Virtual Machine Image Management in Geo-Distributed Clouds","authors":"Jad Darrous, Shadi Ibrahim, Amelie Chi Zhou, Christian Pérez","doi":"10.1109/CCGRID.2018.00082","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00082","url":null,"abstract":"Recently, most large cloud providers, like Amazon and Microsoft, replicate their Virtual Machine Images (VMIs) on multiple geographically distributed data centers to offer fast service provisioning. Provisioning a service may require to transfer a VMI over the wide-area network (WAN) and therefore is dictated by the distribution of VMIs and the network bandwidth in-between sites. Nevertheless, existing methods to facilitate VMI management (i.e., retrieving VMIs) overlook network heterogeneity in geo-distributed clouds. In this paper, we design, implement and evaluate Nitro, a novel VMI management system that helps to minimize the transfer time of VMIs over a heterogeneous WAN. To achieve this goal, Nitro incorporates two complementary features. First, it makes use of deduplication to reduce the amount of data which will be transferred due to the high similarities within an image and in-between images. Second, Nitro is equipped with a network-aware data transfer strategy to effectively exploit links with high bandwidth when acquiring data and thus expedites the provisioning time. Experimental results show that our network-aware data transfer strategy offers the optimal solution when acquiring VMIs while introducing minimal overhead. Moreover, Nitro outperforms state-of-the-art VMI storage systems (e.g., OpenStack Swift) by up to 77%.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114529698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

A Comparative Study of Topology Design Approaches for HPC Interconnects 高性能计算互连拓扑设计方法的比较研究

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00066

Md Atiqul Mollah, Peyman Faizian, Md. Shafayat Rahman, Xin Yuan, S. Pakin, M. Lang

The recent interconnect topology designs for High Performance Computing (HPC) systems have followed two directions, one characterized by low diameter and the other by high path diversity. The low diameter design focuses on building large networks with small diameters, guaranteeing one short path between each pair of nodes. Examples include Slim Fly and Dragonfly. The high path diversity design takes into account not only other topological metrics such as diameter but also path diversity between pairs of nodes. Examples include fat-tree, Random Regular Graph (RRG) and Generalized De Bruin Graph (GDBG). Topologies designed from these two approaches have distinct features and require very different routing schemes to exploit the network capacity. In this work, we study the performance-related topological features of representative topologies of the two design approaches, including Slim Fly, Dragonfly, RRG, and GDBG, and compare HPC application performance on these topologies with a set of routing schemes. The study uncovers new knowledge about the topologies designed by these two approaches. Findings of the study include (1) the load balance routing technique designed for low diameter topologies, known as the Universal Globally Adaptive Load-balanced routing (UGAL), can be effectively adapted for the high path diversity topologies, and (2) high path diversity topologies in general achieve higher performance than low diameter topologies for networks built by a similar number of the same type of switches.

近年来，高性能计算(HPC)系统的互连拓扑设计主要有两个方向，一是低直径互连，二是高路径分集互连。低直径设计侧重于用小直径构建大网络，保证每对节点之间有一条短路径。例如Slim Fly和Dragonfly。高路径分集设计不仅考虑了其他拓扑指标如直径，而且考虑了节点对之间的路径分集。例如:胖树图、随机正则图(RRG)和广义德布鲁因图(GDBG)。从这两种方法设计的拓扑具有不同的特性，并且需要非常不同的路由方案来利用网络容量。在这项工作中，我们研究了两种设计方法的代表性拓扑(包括Slim Fly, Dragonfly, RRG和GDBG)的性能相关拓扑特征，并将这些拓扑与一组路由方案进行了比较。该研究揭示了关于这两种方法设计的拓扑的新知识。研究结果包括:(1)为低直径拓扑设计的负载均衡路由技术，称为通用全局自适应负载均衡路由(UGAL)，可以有效地适用于高路径分集拓扑;(2)对于由相似数量的相同类型交换机构建的网络，高路径分集拓扑通常比低直径拓扑具有更高的性能。

{"title":"A Comparative Study of Topology Design Approaches for HPC Interconnects","authors":"Md Atiqul Mollah, Peyman Faizian, Md. Shafayat Rahman, Xin Yuan, S. Pakin, M. Lang","doi":"10.1109/CCGRID.2018.00066","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00066","url":null,"abstract":"The recent interconnect topology designs for High Performance Computing (HPC) systems have followed two directions, one characterized by low diameter and the other by high path diversity. The low diameter design focuses on building large networks with small diameters, guaranteeing one short path between each pair of nodes. Examples include Slim Fly and Dragonfly. The high path diversity design takes into account not only other topological metrics such as diameter but also path diversity between pairs of nodes. Examples include fat-tree, Random Regular Graph (RRG) and Generalized De Bruin Graph (GDBG). Topologies designed from these two approaches have distinct features and require very different routing schemes to exploit the network capacity. In this work, we study the performance-related topological features of representative topologies of the two design approaches, including Slim Fly, Dragonfly, RRG, and GDBG, and compare HPC application performance on these topologies with a set of routing schemes. The study uncovers new knowledge about the topologies designed by these two approaches. Findings of the study include (1) the load balance routing technique designed for low diameter topologies, known as the Universal Globally Adaptive Load-balanced routing (UGAL), can be effectively adapted for the high path diversity topologies, and (2) high path diversity topologies in general achieve higher performance than low diameter topologies for networks built by a similar number of the same type of switches.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126097528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7