2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)最新文献_第7页

A Machine Learning Auditing Model for Detection of Multi-Tenancy Issues Within Tenant Domain 一种用于租户域中多租户问题检测的机器学习审计模型

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00081

Cleverton Vicentini, A. Santin, E. Viegas, Vilmar Abreu

Cloud computing is intrinsically based on multi-tenancy, which enables a physical host to be shared amongst several tenants (customers). In this context, for several reasons, a cloud provider may overload the physical machine by hosting more tenants that it can adequately handle. In such a case, a tenant may experience application performance issues. However, the tenant is not able to identify the causes, since most cloud providers do not provide performance metrics for customer monitoring, or when they do, the metrics can be biased. This study proposes a two-tier auditing model for the identification of multi-tenancy issues within the tenant domain. Our proposal relies on machine learning techniques fed with application and virtual resource metrics, gathered within the tenant domain, for identifying overloading resources in a distributed application context. The evaluation using Apache Storm as a case study, has shown that our proposal is able to identify a node experiencing multi-tenancy interference of at least 6%, with less than 1% false-positive or false-negative rates, regardless of the affected resource. Nonetheless, our model was able to generalize the multi-tenancy interference behavior based on private cloud testbed monitoring, for different hardware configurations. Thus, a system administrator can monitor an application in a public cloud provider, without possessing any hardware-level performance metrics.

云计算本质上是基于多租户的，这使得物理主机可以在多个租户(客户)之间共享。在这种情况下，由于几个原因，云提供商可能会通过托管更多的租户而使物理机器过载。在这种情况下，租户可能会遇到应用程序性能问题。但是，租户无法确定原因，因为大多数云提供商不提供用于客户监控的性能指标，或者当他们提供性能指标时，这些指标可能存在偏差。本研究提出了一个两层审计模型，用于识别租户域中的多租户问题。我们的建议依赖于与应用程序和虚拟资源度量(在租户域中收集)一起提供的机器学习技术，以识别分布式应用程序上下文中的过载资源。使用Apache Storm作为案例研究的评估表明，无论受影响的资源如何，我们的建议都能够识别出经历至少6%多租户干扰的节点，假阳性或假阴性率低于1%。尽管如此，我们的模型还是能够针对不同的硬件配置，基于私有云测试平台监控推广多租户干扰行为。因此，系统管理员可以监视公共云提供商中的应用程序，而无需拥有任何硬件级性能指标。

{"title":"A Machine Learning Auditing Model for Detection of Multi-Tenancy Issues Within Tenant Domain","authors":"Cleverton Vicentini, A. Santin, E. Viegas, Vilmar Abreu","doi":"10.1109/CCGRID.2018.00081","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00081","url":null,"abstract":"Cloud computing is intrinsically based on multi-tenancy, which enables a physical host to be shared amongst several tenants (customers). In this context, for several reasons, a cloud provider may overload the physical machine by hosting more tenants that it can adequately handle. In such a case, a tenant may experience application performance issues. However, the tenant is not able to identify the causes, since most cloud providers do not provide performance metrics for customer monitoring, or when they do, the metrics can be biased. This study proposes a two-tier auditing model for the identification of multi-tenancy issues within the tenant domain. Our proposal relies on machine learning techniques fed with application and virtual resource metrics, gathered within the tenant domain, for identifying overloading resources in a distributed application context. The evaluation using Apache Storm as a case study, has shown that our proposal is able to identify a node experiencing multi-tenancy interference of at least 6%, with less than 1% false-positive or false-negative rates, regardless of the affected resource. Nonetheless, our model was able to generalize the multi-tenancy interference behavior based on private cloud testbed monitoring, for different hardware configurations. Thus, a system administrator can monitor an application in a public cloud provider, without possessing any hardware-level performance metrics.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"286 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124565354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

An Empirical Evaluation of Allgatherv on Multi-GPU Systems Allgatherv在多gpu系统上的实证评价

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00027

Thomas B. Rolinger, T. Simon, Christopher D. Krieger

Applications for deep learning and big data analytics have compute and memory requirements that exceed the limits of a single GPU. However, effectively scaling out an application to multiple GPUs is challenging due to the complexities of communication between the GPUs, particularly for collective communication with irregular message sizes. In this work, we provide a performance evaluation of the Allgatherv routine on multi-GPU systems, focusing on GPU network topology and the communication library used. We present results from the OSU-micro benchmark as well as conduct a case study for sparse tensor factorization, one application that uses Allgatherv with highly irregular message sizes. We extend our existing tensor factorization tool to run on systems with different node counts and varying number of GPUs per node. We then evaluate the communication performance of our tool when using traditional MPI, CUDA-aware MVAPICH and NCCL across a suite of real-world data sets on three different systems: a 16-node cluster with one GPU per node, NVIDIA's DGX-1 with 8 GPUs and Cray's CS-Storm with 16 GPUs. Our results show that irregularity in the tensor data sets produce trends that contradict those in the OSU micro-benchmark, as well as trends that are absent from the benchmark.

深度学习和大数据分析的应用程序对计算和内存的要求超过了单个GPU的限制。然而，由于gpu之间通信的复杂性，有效地将应用程序扩展到多个gpu是具有挑战性的，特别是对于具有不规则消息大小的集体通信。在这项工作中，我们提供了Allgatherv例程在多GPU系统上的性能评估，重点关注GPU网络拓扑和使用的通信库。我们展示了来自OSU-micro基准测试的结果，并对稀疏张量分解进行了案例研究，稀疏张量分解是一个使用Allgatherv处理高度不规则消息大小的应用程序。我们扩展了现有的张量分解工具，以在具有不同节点计数和每个节点的不同gpu数量的系统上运行。然后，我们在三个不同系统上的一套真实世界数据集上使用传统的MPI, cuda感知MVAPICH和NCCL时评估我们的工具的通信性能:一个16节点集群，每个节点一个GPU, NVIDIA的DGX-1有8个GPU, Cray的CS-Storm有16个GPU。我们的研究结果表明，张量数据集的不规则性产生了与OSU微基准相矛盾的趋势，以及基准中没有的趋势。

{"title":"An Empirical Evaluation of Allgatherv on Multi-GPU Systems","authors":"Thomas B. Rolinger, T. Simon, Christopher D. Krieger","doi":"10.1109/CCGRID.2018.00027","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00027","url":null,"abstract":"Applications for deep learning and big data analytics have compute and memory requirements that exceed the limits of a single GPU. However, effectively scaling out an application to multiple GPUs is challenging due to the complexities of communication between the GPUs, particularly for collective communication with irregular message sizes. In this work, we provide a performance evaluation of the Allgatherv routine on multi-GPU systems, focusing on GPU network topology and the communication library used. We present results from the OSU-micro benchmark as well as conduct a case study for sparse tensor factorization, one application that uses Allgatherv with highly irregular message sizes. We extend our existing tensor factorization tool to run on systems with different node counts and varying number of GPUs per node. We then evaluate the communication performance of our tool when using traditional MPI, CUDA-aware MVAPICH and NCCL across a suite of real-world data sets on three different systems: a 16-node cluster with one GPU per node, NVIDIA's DGX-1 with 8 GPUs and Cray's CS-Storm with 16 GPUs. Our results show that irregularity in the tensor data sets produce trends that contradict those in the OSU micro-benchmark, as well as trends that are absent from the benchmark.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134572638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Towards Massive Consolidation in Data Centers with SEaMLESS 实现无缝数据中心的大规模整合

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00038

A. Segalini, Dino Lopez Pacheco, Quentin Jacquemart

In Data Centers (DCs), an abundance of virtual machines (VMs) remain idle due to network services awaiting for incoming connections, or due to established-and-idling sessions. These VMs lead to wastage of RAM – the scarcest resource in DCs – as they lock their allocated memory. In this paper, we introduce SEaMLESS, a solution designed to (i) transform fully-fledged idle VMs into lightweight and resourceless virtual network functions (VNFs), then (ii) reduces the allocated memory to those idle VMs. By replacing idle VMs with VNFs, SEaMLESS provides fast VM restoration upon user activity detection, thereby introducing limited impact on the Quality of Experience (QoE). Our results show that SEaMLESS can consolidate hundreds of VMs as VNFs onto one single machine. SEaMLESS is thus able to release the majority of the memory allocated to idle VMs. This freed memory can then be reassigned to new VMs, or lead to massive consolidation, to enable a better utilization of DC resources.

在数据中心(dc)中，由于网络服务等待传入的连接，或者由于建立和空闲会话，大量虚拟机(vm)处于空闲状态。这些vm会导致RAM(数据中心中最稀缺的资源)的浪费，因为它们会锁定已分配的内存。在本文中，我们介绍了SEaMLESS，一个旨在(i)将完全空闲的vm转换为轻量级和无资源的虚拟网络功能(VNFs)的解决方案，然后(ii)减少分配给这些空闲vm的内存。通过将空闲的虚拟机替换为VNFs, SEaMLESS可以在检测到用户活动时快速恢复虚拟机，从而减少对QoE (Quality of Experience)的影响。我们的结果表明，SEaMLESS可以将数百个vm作为VNFs整合到一台机器上。因此，SEaMLESS能够释放分配给空闲vm的大部分内存。然后可以将释放的内存重新分配给新的vm，或者进行大规模整合，以便更好地利用DC资源。

引用次数: 1

SuperCell: Adaptive Software-Defined Storage for Cloud Storage Workloads SuperCell:针对云存储工作负载的自适应软件定义存储

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00025

K. Uehara, Yu Xiang, Y. Chen, M. Hiltunen, Kaustubh R. Joshi, R. Schlichting

The explosive growth of data due to the increasing adoption of cloud technologies in the enterprise has created a strong demand for more flexible, cost-effective, and scalable storage solutions. Many storage systems, however, are not well matched to the workloads they service due to the difficulty of configuring the storage system optimally a priori with only approximate knowledge of the workload characteristics. This paper shows how cloud-based orchestration can be leveraged to create flexible storage solutions that use continuous adaptation to tailor themselves to their target application workloads, and in doing so, provide superior performance, cost, and scalability over traditional fixed designs. To demonstrate this approach, we have built "SuperCell," a Ceph-based distributed storage solution with a recommendation engine for the storage configuration. SuperCell provides storage operators with real-time recommendations on how to reconfigure the storage system to optimize its performance, cost, and efficiency based on statistical storage modeling and data analysis of the actual workload. Using real cloud storage workloads, we experimentally demonstrate that SuperCell reduces the cost of storage systems by up to 48%, while meeting service level agreement (SLA) 99% of the time, a level that any static design fails to meet for the workloads.

由于企业越来越多地采用云技术，数据的爆炸式增长产生了对更灵活、更经济、更可扩展的存储解决方案的强烈需求。然而，许多存储系统并不能很好地匹配它们所服务的工作负载，这是由于仅通过对工作负载特征的粗略了解就难以先验地优化配置存储系统。本文展示了如何利用基于云的编排来创建灵活的存储解决方案，这些解决方案使用持续的自适应来定制其目标应用程序工作负载，并在此过程中提供优于传统固定设计的性能、成本和可伸缩性。为了演示这种方法，我们构建了“SuperCell”，这是一个基于ceph的分布式存储解决方案，带有存储配置的推荐引擎。SuperCell根据统计存储建模和实际工作量的数据分析，为存储运营商提供关于如何重新配置存储系统以优化其性能、成本和效率的实时建议。通过使用真实的云存储工作负载，我们通过实验证明SuperCell将存储系统的成本降低了48%，同时在99%的时间内满足服务水平协议(SLA)，这是任何静态设计都无法满足工作负载的水平。

{"title":"SuperCell: Adaptive Software-Defined Storage for Cloud Storage Workloads","authors":"K. Uehara, Yu Xiang, Y. Chen, M. Hiltunen, Kaustubh R. Joshi, R. Schlichting","doi":"10.1109/CCGRID.2018.00025","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00025","url":null,"abstract":"The explosive growth of data due to the increasing adoption of cloud technologies in the enterprise has created a strong demand for more flexible, cost-effective, and scalable storage solutions. Many storage systems, however, are not well matched to the workloads they service due to the difficulty of configuring the storage system optimally a priori with only approximate knowledge of the workload characteristics. This paper shows how cloud-based orchestration can be leveraged to create flexible storage solutions that use continuous adaptation to tailor themselves to their target application workloads, and in doing so, provide superior performance, cost, and scalability over traditional fixed designs. To demonstrate this approach, we have built \"SuperCell,\" a Ceph-based distributed storage solution with a recommendation engine for the storage configuration. SuperCell provides storage operators with real-time recommendations on how to reconfigure the storage system to optimize its performance, cost, and efficiency based on statistical storage modeling and data analysis of the actual workload. Using real cloud storage workloads, we experimentally demonstrate that SuperCell reduces the cost of storage systems by up to 48%, while meeting service level agreement (SLA) 99% of the time, a level that any static design fails to meet for the workloads.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131582213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Improving Energy Efficiency of Database Clusters Through Prefetching and Caching 通过预取和缓存提高数据库集群的能源效率

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00065

Yi Zhou, Shubbhi Taneja, Mohammed I. Alghamdi, X. Qin

The goal of this study is to optimize energy efficiency of database clusters through prefetching and caching strategies. We design a workload-skewness scheme to collectively manage a set of hot and cold nodes in a database cluster system. The prefetching mechanism fetches popular data tables to the hot nodes while keeping unpopular data in cold nodes. We leverage a power management module to aggressively turn cold nodes in the low-power mode to conserve energy consumption. We construct a prefetching model and an energy-saving model to govern the power management module in database lusters. The energy-efficient prefetching and caching mechanism is conducive to cutting back the number of power-state transitions, thereby offering high energy efficiency. We systematically evaluate energy conservation technique in the process of managing, fetching, and storing data on clusters supporting database applications. Our experimental results show that our prefetching/caching solution significantly improves energy efficiency of the existing PostgreSQL system.

本研究的目的是通过预取和缓存策略来优化数据库集群的能源效率。我们设计了一个工作负载偏度方案来共同管理数据库集群系统中的一组热节点和一组冷节点。预取机制将流行的数据表提取到热节点，而将不流行的数据保留在冷节点中。我们利用电源管理模块积极地将冷节点切换到低功耗模式，以节省能源消耗。我们构建了一个预取模型和一个节能模型来控制数据库集群中的电源管理模块。节能的预取和缓存机制有助于减少功率状态转换的次数，从而提供高能效。我们系统地评估了在支持数据库应用的集群上管理、获取和存储数据过程中的节能技术。实验结果表明，我们的预取/缓存解决方案显著提高了现有PostgreSQL系统的能源效率。

引用次数: 2

SHAD: The Scalable High-Performance Algorithms and Data-Structures Library SHAD:可扩展的高性能算法和数据结构库

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00071

Vito Giovanni Castellana, Marco Minutoli

The unprecedented amount of data that needs to be processed in emerging data analytics applications poses novel challenges to industry and academia. Scalability and high performance become more than a desirable feature because, due to the scale and the nature of the problems, they draw the line between what is achievable and what is unfeasible. In this paper, we propose SHAD, the Scalable High-performance Algorithms and Data-structures library. SHAD adopts a modular design that confines low level details and promotes reuse. SHAD's core is built on an Abstract Runtime Interface which enhances portability and identifies the minimal set of features of the underlying system required by the framework. The core library includes common data-structures such as: Array, Vector, Map and Set. These are designed to accommodate significant amount of data which can be accessed in massively parallel environments, and used as building blocks for SHAD extensions, i.e. higher level software libraries. We have validated and evaluated our design with a performance and scalability study of the core components of the library. We have validated the design flexibility by proposing a Graph Library as an example of SHAD extension, which implements two different graph data-structures; we evaluate their performance with a set of graph applications. Experimental results show that the approach is promising in terms of both performance and scalability. On a distributed system with 320 cores, SHAD Arrays are able to sustain a throughput of 65 billion operations per second, while SHAD Maps sustain 1 billion of operations per second. Algorithms implemented using the Graph Library exhibit performance and scalability comparable to a custom solution, but with smaller development effort.

新兴的数据分析应用程序需要处理前所未有的数据量，这给工业界和学术界带来了新的挑战。可伸缩性和高性能不仅仅是一个理想的特性，因为由于问题的规模和性质，它们在可实现和不可实现之间划清了界限。在本文中，我们提出了SHAD，可扩展的高性能算法和数据结构库。SHAD采用模块化设计，限制底层细节并促进重用。SHAD的核心是建立在一个抽象运行时接口上的，它增强了可移植性，并识别了框架所需的底层系统的最小特性集。核心库包括常用的数据结构，如:Array、Vector、Map和Set。它们的设计是为了容纳大量的数据，这些数据可以在大规模并行环境中访问，并用作SHAD扩展的构建块，即更高级别的软件库。我们通过对库的核心组件进行性能和可扩展性研究来验证和评估我们的设计。我们通过提出一个图库作为SHAD扩展的一个例子来验证设计的灵活性，它实现了两种不同的图数据结构;我们用一组图形应用程序来评估它们的性能。实验结果表明，该方法在性能和可扩展性方面都是有希望的。在具有320核的分布式系统上，SHAD阵列能够维持每秒650亿次操作的吞吐量，而SHAD映射能够维持每秒10亿次操作。使用Graph Library实现的算法表现出与定制解决方案相当的性能和可伸缩性，但开发工作量更小。

{"title":"SHAD: The Scalable High-Performance Algorithms and Data-Structures Library","authors":"Vito Giovanni Castellana, Marco Minutoli","doi":"10.1109/CCGRID.2018.00071","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00071","url":null,"abstract":"The unprecedented amount of data that needs to be processed in emerging data analytics applications poses novel challenges to industry and academia. Scalability and high performance become more than a desirable feature because, due to the scale and the nature of the problems, they draw the line between what is achievable and what is unfeasible. In this paper, we propose SHAD, the Scalable High-performance Algorithms and Data-structures library. SHAD adopts a modular design that confines low level details and promotes reuse. SHAD's core is built on an Abstract Runtime Interface which enhances portability and identifies the minimal set of features of the underlying system required by the framework. The core library includes common data-structures such as: Array, Vector, Map and Set. These are designed to accommodate significant amount of data which can be accessed in massively parallel environments, and used as building blocks for SHAD extensions, i.e. higher level software libraries. We have validated and evaluated our design with a performance and scalability study of the core components of the library. We have validated the design flexibility by proposing a Graph Library as an example of SHAD extension, which implements two different graph data-structures; we evaluate their performance with a set of graph applications. Experimental results show that the approach is promising in terms of both performance and scalability. On a distributed system with 320 cores, SHAD Arrays are able to sustain a throughput of 65 billion operations per second, while SHAD Maps sustain 1 billion of operations per second. Algorithms implemented using the Graph Library exhibit performance and scalability comparable to a custom solution, but with smaller development effort.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124036661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Evaluation of Highly Available Cloud Streaming Systems for Performance and Price 高可用性云流系统的性能和价格评估

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00056

Dung Nguyen, André Luckow, Edward B. Duffy, Ken E. Kennedy, A. Apon

This paper presents a systematic evaluation of Amazon Kinesis and Apache Kafka for meeting highly demanding application requirements. Results show that Kinesis and Kafka can provide high reliability, performance and scalability. Cost and performance trade-offs of Kinesis and Kafka are presented for a variety of application data rates, resource utilization, and resource configurations.

本文对Amazon Kinesis和Apache Kafka进行了系统评估，以满足高要求的应用需求。结果表明，Kinesis和Kafka能够提供高可靠性、高性能和可扩展性。对于各种应用程序数据速率、资源利用率和资源配置，给出了Kinesis和Kafka的成本和性能权衡。

引用次数: 10

Adaptive Communication for Distributed Deep Learning on Commodity GPU Cluster 基于商用GPU集群的分布式深度学习自适应通信

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00043

Li-Yung Ho, Jan-Jan Wu, Pangfeng Liu

Deep learning is now the most promising approach to develop human-intelligent computer systems. To speedup the development of neural networks, researchers have designed many distributed learning algorithms to facilitate the training process. In these algorithms, people use a constant to indicate the communication period for model/gradient exchange. We find that this type of communication pattern could incur unnecessary and inefficient data transmission for some training methods e.g., elastic SGD and gossiping SGD. In this paper, we propose an adaptive communication method to improve the performance of gossiping SGD. Instead of using a fixed period for model exchange, we exchange the models with other machines according to the change of the local model. This makes the communication more efficient and thus improves the performance. The experiment results show that our method reduces the communication traffic by 92%, which results in 52% reduction in training time while preserving the prediction accuracy compared with gossiping SGD.

深度学习是目前开发人类智能计算机系统最有前途的方法。为了加速神经网络的发展，研究人员设计了许多分布式学习算法来促进训练过程。在这些算法中，人们使用一个常数来表示通信周期，用于模型/梯度交换。我们发现这种类型的通信模式可能会导致一些训练方法(如弹性SGD和八卦SGD)产生不必要和低效的数据传输。在本文中，我们提出了一种自适应通信方法来提高流言SGD的性能。我们不是用固定的时间交换模型，而是根据本地模型的变化与其他机器交换模型。这使得通信更有效，从而提高了性能。实验结果表明，与八卦SGD相比，我们的方法减少了92%的通信流量，在保持预测精度的同时，训练时间减少了52%。

引用次数: 4

Achieving Performance Balance Among Spark Frameworks with Two-Level Schedulers 用两级调度器实现Spark框架之间的性能平衡

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00028

Aleksandra Kuzmanovska, H. V. D. Bogert, R. H. Mak, D. Epema

When multiple data-processing frameworks with time-varying workloads are simultaneously present in a single cluster or data-center, an apparent goal is to have them experience equal performance, expressed in whatever performance metrics are applicable. In modern data-center environments, Two-Level Schedulers (TLSs) that leave the scheduling of individual jobs to the schedulers within the data-processing frameworks are typically used for managing the resources of data-processing frameworks. Two such TLSs with opposite designs are Mesos and Koala-F. Mesos employs fine-grained resource allocation and aims at Dominant Resource Fairness (DRF) among framework instances by offering resources to them for the duration of a single task. In contrast, Koala-F aims at performance fairness among framework instances by employing dynamic coarse-grained resource allocation of sets of complete nodes based on performance feedback from individual instances. The goal of this paper is to explore the trade-offs between these two TLS designs when trying to achieve performance balance among frameworks. We select Apache Spark as a representative of data-processing frameworks, and perform experiments on a modest-sized cluster, using jobs chosen from commonly used data-processing benchmarks. Our results reveal that achieving performance balance among framework instances is a challenge for both TLS designs, despite their opposite design choices. Moreover, we exhibit design flaws in the DRF allocation policy that prevent Mesos from achieving performance balance. Finally, to remedy these flaws, we propose a feedback controller for Mesos that dynamically adapts framework weights, as used in Weighted DRF (W-DRF), based on their performance.

当在单个集群或数据中心中同时存在具有时变工作负载的多个数据处理框架时，一个明显的目标是让它们体验相同的性能，用任何适用的性能指标表示。在现代数据中心环境中，将单个作业的调度留给数据处理框架内的调度程序的两级调度器(TLSs)通常用于管理数据处理框架的资源。两种设计相反的tls是Mesos和考拉- f。Mesos采用细粒度资源分配，旨在通过在单个任务期间向框架实例提供资源来实现主导资源公平(DRF)。相比之下，考拉- f通过基于单个实例的性能反馈对完整节点集进行动态粗粒度资源分配，旨在实现框架实例之间的性能公平性。本文的目标是在尝试实现框架之间的性能平衡时，探索这两种TLS设计之间的权衡。我们选择Apache Spark作为数据处理框架的代表，并在一个中等规模的集群上执行实验，使用从常用数据处理基准中选择的作业。我们的结果表明，实现框架实例之间的性能平衡对于两种TLS设计来说都是一个挑战，尽管它们的设计选择是相反的。此外，我们还展示了DRF分配策略中的设计缺陷，这些缺陷会阻止Mesos实现性能平衡。最后，为了弥补这些缺陷，我们为Mesos提出了一个反馈控制器，该控制器根据加权DRF (W-DRF)中使用的框架权重的性能动态适应框架权重。

{"title":"Achieving Performance Balance Among Spark Frameworks with Two-Level Schedulers","authors":"Aleksandra Kuzmanovska, H. V. D. Bogert, R. H. Mak, D. Epema","doi":"10.1109/CCGRID.2018.00028","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00028","url":null,"abstract":"When multiple data-processing frameworks with time-varying workloads are simultaneously present in a single cluster or data-center, an apparent goal is to have them experience equal performance, expressed in whatever performance metrics are applicable. In modern data-center environments, Two-Level Schedulers (TLSs) that leave the scheduling of individual jobs to the schedulers within the data-processing frameworks are typically used for managing the resources of data-processing frameworks. Two such TLSs with opposite designs are Mesos and Koala-F. Mesos employs fine-grained resource allocation and aims at Dominant Resource Fairness (DRF) among framework instances by offering resources to them for the duration of a single task. In contrast, Koala-F aims at performance fairness among framework instances by employing dynamic coarse-grained resource allocation of sets of complete nodes based on performance feedback from individual instances. The goal of this paper is to explore the trade-offs between these two TLS designs when trying to achieve performance balance among frameworks. We select Apache Spark as a representative of data-processing frameworks, and perform experiments on a modest-sized cluster, using jobs chosen from commonly used data-processing benchmarks. Our results reveal that achieving performance balance among framework instances is a challenge for both TLS designs, despite their opposite design choices. Moreover, we exhibit design flaws in the DRF allocation policy that prevent Mesos from achieving performance balance. Finally, to remedy these flaws, we propose a feedback controller for Mesos that dynamically adapts framework weights, as used in Weighted DRF (W-DRF), based on their performance.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116725946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Scalable Unified Model for Dynamic Data Structures in Message Passing (Clusters) and Shared Memory (multicore CPUs) Computing environments 消息传递(集群)和共享内存(多核cpu)计算环境下动态数据结构的可扩展统一模型

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Pub Date : 2018-05-01 DOI: 10.1109/CCGRID.2018.00007

G. Laccetti, M. Lapegna, R. Montella

Concurrent data structures are widely used in many software stack levels, ranging from high level parallel scientific applications to low level operating systems. The key issue of these objects is their concurrent use by several computing units (threads or process) so that the design of these structures is much more difficult compared to their sequential counterpart, because of their extremely dynamic nature requiring protocols to ensure data consistency, with a significant cost overhead. At this regard, several studies emphasize a tension between the needs of sequential correctness of the concurrent data structures and scalability of the algorithms, and in many cases it is evident the need to rethink the data structure design, using approaches based on randomization and/or redistribution techniques in order to fully exploit the computational power of the recent computing environments. The problem is grown in importance with the new generation High Performance Computing systems aimed to achieve extreme performance. It is easy to observe that such systems are based on heterogeneous architectures integrating several independent nodes in the form of clusters or MPP systems, where each node is composed by powerful computing elements (CPU core, GPUs or other acceleration devices) sharing resources in a single node. These systems therefore make massive use of communication libraries to exchange data among the nodes, as well as other tools for the management of the shared resources inside a single node. For such a reason, the development of algorithms and scientific software for dynamic data structures on these heterogeneous systems implies a suitable combination of several methodologies and tools to deal with the different kinds of parallelism corresponding to each specific device, so that to be aware of the underlying platform. The present work is aimed to introduce a scalable model to manage a special class of dynamic data structure known as heap based priority queue (or simply heap) on these heterogeneous architectures. A heap is generally used when the applications needs set of data not requiring a complete ordering, but only the access to some items tagged with high priority. In order to ensure a tradeoff between the correct access to high priority items by the several computing units with a low communication and synchronization overhead, a suitable reorganization of the heap is needed. More precisely we introduce a unified scalable model that can be used, with no modifications, to redeploy the items of a heap both in message passing environments (such as clusters and or MMP multicomputers with several nodes) as well as in shared memory environments (such as CPUs and multiprocessors with several cores) with an overhead independent of the number of computing units. Computational results related to the application of the proposed strategy on some numerical case studies are presented for different types of computing environments.

并发数据结构广泛应用于许多软件堆栈级别，从高级并行科学应用程序到低级操作系统。这些对象的关键问题是它们被多个计算单元(线程或进程)并发使用，因此这些结构的设计比顺序结构的设计要困难得多，因为它们具有极其动态的特性，需要协议来确保数据一致性，并且成本开销很大。在这方面，一些研究强调了并发数据结构的顺序正确性和算法的可扩展性之间的紧张关系，并且在许多情况下，显然需要重新考虑数据结构设计，使用基于随机化和/或再分配技术的方法，以便充分利用最新计算环境的计算能力。随着新一代高性能计算系统的发展，这个问题变得越来越重要。很容易观察到，这样的系统是基于异构架构，以集群或MPP系统的形式集成了几个独立的节点，其中每个节点由强大的计算元素(CPU核心，gpu或其他加速设备)组成，共享单个节点中的资源。因此，这些系统大量使用通信库在节点之间交换数据，并使用其他工具管理单个节点内的共享资源。因此，在这些异构系统上开发用于动态数据结构的算法和科学软件意味着需要将几种方法和工具适当地结合起来，以处理对应于每个特定设备的不同类型的并行性，从而了解底层平台。目前的工作旨在引入一个可扩展的模型来管理这些异构架构上的一类特殊的动态数据结构，称为基于堆的优先队列(或简称堆)。当应用程序需要一组不需要完整排序的数据，而只需要访问一些标记为高优先级的项目时，通常使用堆。为了确保几个计算单元对高优先级项的正确访问与低通信和同步开销之间的权衡，需要对堆进行适当的重组。更准确地说，我们引入了一个统一的可扩展模型，该模型无需修改即可用于在消息传递环境(例如具有多个节点的集群和/或MMP多计算机)以及开销与计算单元数量无关的共享内存环境(例如具有多个内核的cpu和多处理器)中重新部署堆的项。针对不同类型的计算环境，给出了与所提出的策略应用有关的一些数值案例研究的计算结果。

{"title":"A Scalable Unified Model for Dynamic Data Structures in Message Passing (Clusters) and Shared Memory (multicore CPUs) Computing environments","authors":"G. Laccetti, M. Lapegna, R. Montella","doi":"10.1109/CCGRID.2018.00007","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00007","url":null,"abstract":"Concurrent data structures are widely used in many software stack levels, ranging from high level parallel scientific applications to low level operating systems. The key issue of these objects is their concurrent use by several computing units (threads or process) so that the design of these structures is much more difficult compared to their sequential counterpart, because of their extremely dynamic nature requiring protocols to ensure data consistency, with a significant cost overhead. At this regard, several studies emphasize a tension between the needs of sequential correctness of the concurrent data structures and scalability of the algorithms, and in many cases it is evident the need to rethink the data structure design, using approaches based on randomization and/or redistribution techniques in order to fully exploit the computational power of the recent computing environments. The problem is grown in importance with the new generation High Performance Computing systems aimed to achieve extreme performance. It is easy to observe that such systems are based on heterogeneous architectures integrating several independent nodes in the form of clusters or MPP systems, where each node is composed by powerful computing elements (CPU core, GPUs or other acceleration devices) sharing resources in a single node. These systems therefore make massive use of communication libraries to exchange data among the nodes, as well as other tools for the management of the shared resources inside a single node. For such a reason, the development of algorithms and scientific software for dynamic data structures on these heterogeneous systems implies a suitable combination of several methodologies and tools to deal with the different kinds of parallelism corresponding to each specific device, so that to be aware of the underlying platform. The present work is aimed to introduce a scalable model to manage a special class of dynamic data structure known as heap based priority queue (or simply heap) on these heterogeneous architectures. A heap is generally used when the applications needs set of data not requiring a complete ordering, but only the access to some items tagged with high priority. In order to ensure a tradeoff between the correct access to high priority items by the several computing units with a low communication and synchronization overhead, a suitable reorganization of the heap is needed. More precisely we introduce a unified scalable model that can be used, with no modifications, to redeploy the items of a heap both in message passing environments (such as clusters and or MMP multicomputers with several nodes) as well as in shared memory environments (such as CPUs and multiprocessors with several cores) with an overhead independent of the number of computing units. Computational results related to the application of the proposed strategy on some numerical case studies are presented for different types of computing environments.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"378 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126972784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1