IEEE Cloud Computing最新文献_第4页

Towards complete dis-aggregation of data center rack power using light-weight mechanisms 使用轻量级机制实现数据中心机架电源的完全分解

Q1 Computer Science

IEEE Cloud Computing

Pub Date : 2022-07-01 DOI: 10.1109/CLOUD55607.2022.00052

Kalyan Dasgupta, Umamaheswari Devi, Aanchal Goyal

Enterprises world-wide are increasingly prioritizing sustainability due to the growing focus on carbon neutrality as well as the requirement to adhere to emerging strict regulations from governments across the globe. With many enterprise workloads deployed on cloud and data centers, to fulfill the mandatory carbon reporting requirements of their clients, it is becoming inevitable for cloud providers and data center operators to quantify each client’s share of the total carbon emission from their facility. Accurate carbon quantification requires power measurements to be available at the lowest level of the hardware infrastructure such as physical servers and network switches. However, power sensing is quite limited in many data centers, with measurements normally available only at an aggregated level such as the rack level. To drill down to the level of a workload to capture the correct power usage per workload, it is very important to dis-aggregate this power across servers. In this paper, we propose a software based non-linear model using the Newton-Raphson method to estimate the power model parameters of individual servers using server utilizations when the overall rack level power measurements are given. The methodology is applicable to data centers with multiple types of servers in a rack and is light-weight in the sense that it does not require mechanisms such as shutting down individual servers in order to estimate idle power. The method is also generalized to account for the real world scenario where the time granularity of rack power and server utilization measurements may not match. We have conducted detailed evaluations of the methods proposed and find good convergence for parameter estimation even when tested with multiple different initial conditions.

由于对碳中和的日益关注以及遵守全球各国政府新出台的严格法规的要求，世界各地的企业越来越重视可持续性。随着许多企业工作负载部署在云和数据中心上，为了满足客户的强制性碳报告要求，云提供商和数据中心运营商不可避免地要量化每个客户在其设施的总碳排放中所占的份额。准确的碳量化要求在硬件基础设施(如物理服务器和网络交换机)的最低级别提供功率测量。然而，在许多数据中心中，功率传感非常有限，通常只能在聚合级别(如机架级别)进行测量。要深入到工作负载级别以捕获每个工作负载的正确电量使用情况，跨服务器分解这种电量是非常重要的。在本文中，我们提出了一个基于软件的非线性模型，使用牛顿-拉夫森方法来估计单个服务器的功率模型参数，当整个机架级功率测量给定时，使用服务器利用率。该方法适用于机架中有多种类型服务器的数据中心，并且是轻量级的，因为它不需要关闭单个服务器来估计空闲功率。该方法还可以推广到机架功率和服务器利用率测量的时间粒度可能不匹配的实际场景。我们对所提出的方法进行了详细的评估，并发现即使在多个不同的初始条件下进行测试，参数估计也具有良好的收敛性。

{"title":"Towards complete dis-aggregation of data center rack power using light-weight mechanisms","authors":"Kalyan Dasgupta, Umamaheswari Devi, Aanchal Goyal","doi":"10.1109/CLOUD55607.2022.00052","DOIUrl":"https://doi.org/10.1109/CLOUD55607.2022.00052","url":null,"abstract":"Enterprises world-wide are increasingly prioritizing sustainability due to the growing focus on carbon neutrality as well as the requirement to adhere to emerging strict regulations from governments across the globe. With many enterprise workloads deployed on cloud and data centers, to fulfill the mandatory carbon reporting requirements of their clients, it is becoming inevitable for cloud providers and data center operators to quantify each client’s share of the total carbon emission from their facility. Accurate carbon quantification requires power measurements to be available at the lowest level of the hardware infrastructure such as physical servers and network switches. However, power sensing is quite limited in many data centers, with measurements normally available only at an aggregated level such as the rack level. To drill down to the level of a workload to capture the correct power usage per workload, it is very important to dis-aggregate this power across servers. In this paper, we propose a software based non-linear model using the Newton-Raphson method to estimate the power model parameters of individual servers using server utilizations when the overall rack level power measurements are given. The methodology is applicable to data centers with multiple types of servers in a rack and is light-weight in the sense that it does not require mechanisms such as shutting down individual servers in order to estimate idle power. The method is also generalized to account for the real world scenario where the time granularity of rack power and server utilization measurements may not match. We have conducted detailed evaluations of the methods proposed and find good convergence for parameter estimation even when tested with multiple different initial conditions.","PeriodicalId":54281,"journal":{"name":"IEEE Cloud Computing","volume":"270 1","pages":"299-308"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75013760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Xonar: Profiling-based Job Orderer for Distributed Deep Learning Xonar:分布式深度学习的基于分析的作业排序器

Q1 Computer Science

IEEE Cloud Computing

Pub Date : 2022-07-01 DOI: 10.1109/CLOUD55607.2022.00030

Changyong Shin, Gyeongsik Yang, Yeonho Yoo, J. Lee, C. Yoo

Deep learning models have a wide spectrum of GPU execution time and memory size. When running distributed training jobs, however, their GPU execution time and memory size have not been taken into account, which leads to the high variance of job completion time (JCT). Moreover, the jobs often run into the GPU out-of-memory (OoM) problem so that the unlucky job has to restart all over. To address the problems, we propose Xonar to profile the deep learning jobs and order them in the queue. The experiments show that Xonar with TensorFlow v1.6 reduces the tail JCT by 44% with the OoM problem eliminated.

深度学习模型具有广泛的GPU执行时间和内存大小。然而，当运行分布式训练作业时，它们的GPU执行时间和内存大小没有考虑在内，这导致作业完成时间(JCT)的高方差。此外，作业经常遇到GPU内存不足(OoM)问题，因此不幸的作业必须重新启动。为了解决这些问题，我们建议Xonar分析深度学习任务并在队列中排序。实验表明，Xonar与TensorFlow v1.6在消除了OoM问题的情况下，尾部JCT减少了44%。

引用次数: 3

Cloud Data Center Fabric Virtualization 云数据中心结构虚拟化

Q1 Computer Science

IEEE Cloud Computing

Pub Date : 2022-07-01 DOI: 10.1109/CLOUD55607.2022.00048

Ali Sydney, A. Alim, Chris Ward, C. Basso, B. Karaçali

Cloud networks support workloads with diverse characteristics. A key challenge facing cloud providers is how to meet the stringent performance and security needs of these diverse applications running over a shared infrastructure. To address this challenge, some providers build for peak capacity or even build dedicated clusters with specialized networks, which may be under-utilized at times. We propose a virtualization approach that customizes data center network resources to the needs of applications. Our approach is based on slicing data center network resources on-demand and customizing these slices to target workloads. Such slices can grow or shrink dynamically and programmatically based on workload demands. This elasticity provides a more efficient solution over building dedicated clusters. In our approach, a slice can be customized to a given set of workloads with similar security and performance requirements carved out of the underlying network. It leverages a software-defined underlay network controller and segment routing for fine-grained path control and service chaining. We have implemented a prototype of our fabric virtualization solution based on network slicing. In this paper, we first present the architecture of our prototype. Second, we present empirical results of slice provisioning times in networks of varying sizes and switch operating systems. Empirical results indicate that our prototype can support slice provisioning in the order of tens to hundreds of seconds and can meet the provisioning requirements of production networks.

云网络支持具有多种特征的工作负载。云提供商面临的一个关键挑战是如何满足在共享基础设施上运行的这些不同应用程序的严格性能和安全需求。为了应对这一挑战，一些提供商构建峰值容量，甚至构建带有专用网络的专用集群，这些网络有时可能未得到充分利用。我们提出了一种虚拟化方法，可以根据应用程序的需要定制数据中心网络资源。我们的方法基于按需切片数据中心网络资源，并根据工作负载定制这些切片。这些片可以根据工作负载需求动态地、程序化地增长或缩小。这种弹性提供了比构建专用集群更有效的解决方案。在我们的方法中，可以针对一组给定的工作负载定制一个片，这些工作负载具有类似的安全性和性能要求，并从底层网络中分割出来。它利用软件定义的底层网络控制器和分段路由来实现细粒度的路径控制和服务链。我们已经实现了基于网络切片的fabric虚拟化解决方案的原型。在本文中，我们首先介绍了我们的原型的架构。其次，我们给出了在不同大小的网络和交换操作系统中切片供应时间的实证结果。实验结果表明，我们的原型可以支持几十到几百秒的切片配置，可以满足生产网络的配置要求。

{"title":"Cloud Data Center Fabric Virtualization","authors":"Ali Sydney, A. Alim, Chris Ward, C. Basso, B. Karaçali","doi":"10.1109/CLOUD55607.2022.00048","DOIUrl":"https://doi.org/10.1109/CLOUD55607.2022.00048","url":null,"abstract":"Cloud networks support workloads with diverse characteristics. A key challenge facing cloud providers is how to meet the stringent performance and security needs of these diverse applications running over a shared infrastructure. To address this challenge, some providers build for peak capacity or even build dedicated clusters with specialized networks, which may be under-utilized at times. We propose a virtualization approach that customizes data center network resources to the needs of applications. Our approach is based on slicing data center network resources on-demand and customizing these slices to target workloads. Such slices can grow or shrink dynamically and programmatically based on workload demands. This elasticity provides a more efficient solution over building dedicated clusters. In our approach, a slice can be customized to a given set of workloads with similar security and performance requirements carved out of the underlying network. It leverages a software-defined underlay network controller and segment routing for fine-grained path control and service chaining. We have implemented a prototype of our fabric virtualization solution based on network slicing. In this paper, we first present the architecture of our prototype. Second, we present empirical results of slice provisioning times in networks of varying sizes and switch operating systems. Empirical results indicate that our prototype can support slice provisioning in the order of tens to hundreds of seconds and can meet the provisioning requirements of production networks.","PeriodicalId":54281,"journal":{"name":"IEEE Cloud Computing","volume":"122 1","pages":"263-272"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88010399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Data-Loader Tunable Knob to Shorten GPU Idleness for Distributed Deep Learning 一个数据加载器可调旋钮来缩短分布式深度学习的GPU空闲时间

Q1 Computer Science

IEEE Cloud Computing

Pub Date : 2022-07-01 DOI: 10.1109/CLOUD55607.2022.00068

Danlin Jia, Geng Yuan, Xue Lin, N. Mi

Deep Neural Network (DNN) has been applied as an effective machine learning algorithm to tackle problems in different domains. However, training a sophisticated DNN model takes days to weeks and becomes a challenge in constructing research on large-scale DNN models. Distributed Deep Learning (DDL) contributes to accelerating DNN training by distributing training workloads across multiple computation accelerators (e.g., GPUs). Although a surge of research works has been devoted to optimizing DDL training, the impact of data-loading on GPU usage and training performance has been relatively under-explored. It is non-trivial to optimize data-loading in DDL applications that need intensive CPU and I/O resources to process enormous training data. When multiple DDL applications are deployed on a system (e.g., Cloud and HPC), the lack of a practical and efficient technique for data-loader allocation incurs GPU idleness and degrades the training throughput. Therefore, our work first focuses on investigating the impact of data-loading on the global training throughput. We then propose a throughput prediction model to predict the maximum throughput for an individual DDL training application. By leveraging the predicted results, A-Dloader is designed to dynamically allocate CPU and I/O resources to concurrently running DDL applications and use the data-loader allocation as a knob to reduce GPU idle intervals and thus improve the overall training throughput. We implement and evaluate A-Dloader in a DDL framework for a series of DDL applications arriving and completing across the runtime. Our experimental results show that A-Dloader can achieve a 23.5% throughput improvement and a 10% makespan improvement, compared to allocating resources evenly across applications.

深度神经网络(Deep Neural Network, DNN)作为一种有效的机器学习算法已被应用于解决不同领域的问题。然而，训练一个复杂的深度神经网络模型需要数天到数周的时间，这对于构建大规模深度神经网络模型的研究来说是一个挑战。分布式深度学习(DDL)通过将训练工作负载分布在多个计算加速器(例如gpu)上，有助于加速DNN训练。尽管大量的研究工作致力于优化DDL训练，但数据加载对GPU使用和训练性能的影响却相对较少。在需要大量CPU和I/O资源来处理大量训练数据的DDL应用程序中，优化数据加载是非常重要的。当多个DDL应用程序部署在一个系统上(例如，云和HPC)时，缺乏实用且有效的数据加载器分配技术会导致GPU空闲并降低训练吞吐量。因此，我们的工作首先侧重于调查数据加载对全局训练吞吐量的影响。然后，我们提出了一个吞吐量预测模型来预测单个DDL训练应用程序的最大吞吐量。通过利用预测结果，a- dloader被设计为动态分配CPU和I/O资源给并发运行的DDL应用程序，并使用数据加载器分配作为旋环来减少GPU空闲间隔，从而提高整体训练吞吐量。我们在一个DDL框架中实现并评估了一系列跨运行时到达和完成的DDL应用程序的a - dloader。我们的实验结果表明，与在应用程序之间均匀分配资源相比，a - dloader可以实现23.5%的吞吐量改进和10%的完工时间改进。

{"title":"A Data-Loader Tunable Knob to Shorten GPU Idleness for Distributed Deep Learning","authors":"Danlin Jia, Geng Yuan, Xue Lin, N. Mi","doi":"10.1109/CLOUD55607.2022.00068","DOIUrl":"https://doi.org/10.1109/CLOUD55607.2022.00068","url":null,"abstract":"Deep Neural Network (DNN) has been applied as an effective machine learning algorithm to tackle problems in different domains. However, training a sophisticated DNN model takes days to weeks and becomes a challenge in constructing research on large-scale DNN models. Distributed Deep Learning (DDL) contributes to accelerating DNN training by distributing training workloads across multiple computation accelerators (e.g., GPUs). Although a surge of research works has been devoted to optimizing DDL training, the impact of data-loading on GPU usage and training performance has been relatively under-explored. It is non-trivial to optimize data-loading in DDL applications that need intensive CPU and I/O resources to process enormous training data. When multiple DDL applications are deployed on a system (e.g., Cloud and HPC), the lack of a practical and efficient technique for data-loader allocation incurs GPU idleness and degrades the training throughput. Therefore, our work first focuses on investigating the impact of data-loading on the global training throughput. We then propose a throughput prediction model to predict the maximum throughput for an individual DDL training application. By leveraging the predicted results, A-Dloader is designed to dynamically allocate CPU and I/O resources to concurrently running DDL applications and use the data-loader allocation as a knob to reduce GPU idle intervals and thus improve the overall training throughput. We implement and evaluate A-Dloader in a DDL framework for a series of DDL applications arriving and completing across the runtime. Our experimental results show that A-Dloader can achieve a 23.5% throughput improvement and a 10% makespan improvement, compared to allocating resources evenly across applications.","PeriodicalId":54281,"journal":{"name":"IEEE Cloud Computing","volume":"135 1","pages":"449-458"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89489596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Q-percentile Bandwidth Billing Based Geo-Scheduling Algorithm 基于q百分位带宽计费的地理调度算法

Q1 Computer Science

IEEE Cloud Computing

Pub Date : 2022-07-01 DOI: 10.1109/CLOUD55607.2022.00042

Yaoyin You, Binbin Feng, Zhijun Ding

Current IaaS providers deploy cheaper computing resources in newly built data centers and provide cross-regional network services to improve the interoperability of computing resources in different regions. Third-party service providers can use part of their budget to purchase cross-regional communication resources to use cheaper resources in remote areas to reduce the cost of processing massive task requests. The Q-percentile charging model is widely used in cross-regional communication resources billing, but there is little task scheduling research on that billing method. Therefore, this paper studies a geo-distributed task scheduling scenario using the Q-percentile charging model. We design a geo-scheduling algorithm specifically for Q-percentile charging model to allocate resources in the two dimensions of computing resources and communication resources. Furthermore, referring to three existing communication resource allocation strategies, we design three bandwidth allocation algorithms considering the Q-percentile charging characteristics to provide suitable solutions for different scenarios. We conducted experiments based on public well-known datasets such as LIGO workflow. Results show that, compared with the baseline, the scheduling algorithm proposed in this paper can reduce the task scheduling cost between geo-distributed data centers by 10%-20% based on various task loads and show differences in the applicability of different communication resource allocation strategies.

目前的IaaS提供商在新建的数据中心部署更便宜的计算资源，并提供跨区域的网络服务，以提高不同区域计算资源的互操作性。第三方服务提供商可以使用部分预算购买跨区域通信资源，在偏远地区使用更便宜的资源，降低处理海量任务请求的成本。q -百分位计费模型在跨区域通信资源计费中应用广泛，但针对该计费方法的任务调度研究较少。因此，本文研究了一种基于q -百分位计费模型的地理分布式任务调度方案。针对q -百分位计费模型设计了一种地理调度算法，从计算资源和通信资源两个维度进行资源分配。此外，参考现有的三种通信资源分配策略，我们设计了三种考虑q -百分位计费特性的带宽分配算法，为不同场景提供合适的解决方案。我们的实验基于公众熟知的数据集，如LIGO工作流。结果表明，与基线相比，本文提出的调度算法可将基于不同任务负载的地理分布数据中心之间的任务调度成本降低10% ~ 20%，且不同通信资源分配策略的适用性存在差异。

{"title":"Q-percentile Bandwidth Billing Based Geo-Scheduling Algorithm","authors":"Yaoyin You, Binbin Feng, Zhijun Ding","doi":"10.1109/CLOUD55607.2022.00042","DOIUrl":"https://doi.org/10.1109/CLOUD55607.2022.00042","url":null,"abstract":"Current IaaS providers deploy cheaper computing resources in newly built data centers and provide cross-regional network services to improve the interoperability of computing resources in different regions. Third-party service providers can use part of their budget to purchase cross-regional communication resources to use cheaper resources in remote areas to reduce the cost of processing massive task requests. The Q-percentile charging model is widely used in cross-regional communication resources billing, but there is little task scheduling research on that billing method. Therefore, this paper studies a geo-distributed task scheduling scenario using the Q-percentile charging model. We design a geo-scheduling algorithm specifically for Q-percentile charging model to allocate resources in the two dimensions of computing resources and communication resources. Furthermore, referring to three existing communication resource allocation strategies, we design three bandwidth allocation algorithms considering the Q-percentile charging characteristics to provide suitable solutions for different scenarios. We conducted experiments based on public well-known datasets such as LIGO workflow. Results show that, compared with the baseline, the scheduling algorithm proposed in this paper can reduce the task scheduling cost between geo-distributed data centers by 10%-20% based on various task loads and show differences in the applicability of different communication resource allocation strategies.","PeriodicalId":54281,"journal":{"name":"IEEE Cloud Computing","volume":"10 1","pages":"219-229"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78925696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Building Golden Signal Based Signatures for Log Anomaly Detection 构建基于黄金信号的日志异常检测签名

Q1 Computer Science

IEEE Cloud Computing

Pub Date : 2022-07-01 DOI: 10.1109/CLOUD55607.2022.00040

Seema Nagar, Suranjana Samanta, P. Mohapatra, Debanjana Kar

As an increasing number of organizations migrate to the cloud, the main challenge before an operations team is how to effectively use an overwhelming amount of information derivable from multiple data sources like logs, metrics, and traces to help maintain the robustness and availability of cloud services. Site Reliability Engineers (SRE) depend on periodic log data to understand the state of an application and to diagnose the potential root cause of a problem. Despite best practices, service outages happen and result in the loss of billions of dollars in revenue. Many a times, indicators of these outages are buried in the flood of alerts which an SRE receives. Therefore, it is important to reduce noisy alerts so that an SRE can focus on what is critical. Log Anomaly Detection detects anomalous system behaviours and finds patterns (anomalies) in data that do not conform to expected behaviour. Different anomaly detection techniques have been incorporated into various AIOps platforms, but they all suffer from a large number of false positives. Also, some anomalies are transient and resolve on their own. In this paper, we propose an unsupervised model-agnostic persistent anomaly detector based on golden signal based signatures, as a post-processing filtering step on detected anomalies, so we don’t have to interfere with the existing deployed anomaly detector in a system.

随着越来越多的组织迁移到云，运营团队面临的主要挑战是如何有效地使用来自多个数据源(如日志、度量和跟踪)的大量信息，以帮助维护云服务的健壮性和可用性。站点可靠性工程师(Site Reliability Engineers, SRE)依靠定期的日志数据来了解应用程序的状态，并诊断问题的潜在根源。尽管有最佳实践，服务中断还是会发生，并导致数十亿美元的收入损失。很多时候，这些中断的指示器隐藏在SRE接收到的大量警报中。因此，减少嘈杂的警报是很重要的，这样SRE才能专注于关键的事情。日志异常检测检测系统异常行为，发现数据中不符合预期行为的模式(异常)。不同的异常检测技术已经被纳入到各种AIOps平台中，但它们都存在大量的误报。此外，有些异常是短暂的，可以自行解决。在本文中，我们提出了一种基于黄金信号签名的无监督模型无关的持久异常检测器，作为检测到的异常的后处理过滤步骤，因此我们不必干扰系统中现有部署的异常检测器。

{"title":"Building Golden Signal Based Signatures for Log Anomaly Detection","authors":"Seema Nagar, Suranjana Samanta, P. Mohapatra, Debanjana Kar","doi":"10.1109/CLOUD55607.2022.00040","DOIUrl":"https://doi.org/10.1109/CLOUD55607.2022.00040","url":null,"abstract":"As an increasing number of organizations migrate to the cloud, the main challenge before an operations team is how to effectively use an overwhelming amount of information derivable from multiple data sources like logs, metrics, and traces to help maintain the robustness and availability of cloud services. Site Reliability Engineers (SRE) depend on periodic log data to understand the state of an application and to diagnose the potential root cause of a problem. Despite best practices, service outages happen and result in the loss of billions of dollars in revenue. Many a times, indicators of these outages are buried in the flood of alerts which an SRE receives. Therefore, it is important to reduce noisy alerts so that an SRE can focus on what is critical. Log Anomaly Detection detects anomalous system behaviours and finds patterns (anomalies) in data that do not conform to expected behaviour. Different anomaly detection techniques have been incorporated into various AIOps platforms, but they all suffer from a large number of false positives. Also, some anomalies are transient and resolve on their own. In this paper, we propose an unsupervised model-agnostic persistent anomaly detector based on golden signal based signatures, as a post-processing filtering step on detected anomalies, so we don’t have to interfere with the existing deployed anomaly detector in a system.","PeriodicalId":54281,"journal":{"name":"IEEE Cloud Computing","volume":"14 1","pages":"203-208"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79120876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Automated Configuration for Agile Software Environments 敏捷软件环境的自动化配置

Q1 Computer Science

IEEE Cloud Computing

Pub Date : 2022-07-01 DOI: 10.1109/CLOUD55607.2022.00074

Negar Mohammadi Koushki, Sanjeev Sondur, K. Kant

The increasing use of the DevOps paradigm in software systems has substantially increased the frequency of configuration parameter setting changes. Ensuring the correctness of such settings is generally a very challenging problem due to the complex interdependencies, and calls for an automated mechanism that can both run quickly and provide accurate settings. In this paper, we propose an efficient discrete combinatorial optimization technique that makes two unique contributions: (a) an improved and extended metaheuristic that exploits the application domain knowledge for fast convergence, and (b) the development and quantification of a discrete version of the classical tunneling mechanism to improve the accuracy of the solution. Our extensive evaluation using available workload traces that do include configuration information shows that the proposed technique can provide a lower-cost solution (by ~60%) with faster convergence (by ~48%) as compared to the traditional metaheuristic algorithms. Also, our solution succeeds in finding a feasible solution in approximately 30% more cases than the baseline algorithm.

在软件系统中越来越多地使用DevOps范例，大大增加了配置参数设置更改的频率。由于复杂的相互依赖性，确保这些设置的正确性通常是一个非常具有挑战性的问题，并且需要一种既能快速运行又能提供准确设置的自动化机制。在本文中，我们提出了一种有效的离散组合优化技术，它有两个独特的贡献:(a)改进和扩展的元启发式，利用应用领域知识实现快速收敛;(b)开发和量化经典隧道机制的离散版本，以提高解的准确性。我们使用包括配置信息的可用工作负载跟踪进行了广泛的评估，结果表明，与传统的元启发式算法相比，所提出的技术可以提供成本更低(降低约60%)、收敛速度更快(提高约48%)的解决方案。此外，我们的解决方案比基线算法在大约30%的情况下成功地找到了可行的解决方案。

引用次数: 2

Distributed online extraction of a fluid model for microservice applications using local tracing data 使用本地跟踪数据为微服务应用程序分布式在线提取流体模型

Q1 Computer Science

IEEE Cloud Computing

Pub Date : 2022-07-01 DOI: 10.1109/CLOUD55607.2022.00037

Johan Ruuskanen, A. Cervin

Dynamic resource management is a difficult problem in modern microservice applications. Many proposed methods rely on the availability of an analytical performance model, often based on queueing theory. Such models can always be hand-crafted, but this takes time and requires expert knowledge. Various methods have been proposed that can automatically extract models from logs or tracing data. However, they are often intricate, requiring off-line stages and advanced algorithms for retrieving the service-time distributions. Furthermore, the resulting models can be complex and unsuitable for online evaluation. Aiming for simplicity, we in this paper introduce a general queuing network model for microservice applications that can be (i) quickly and accurately solved using a refined mean-field fluid model and (ii) completely extracted at runtime in a distributed fashion from common local tracing data at each service. The fit of the model and the prediction accuracies under system perturbations are evaluated in a cloud-based microservice application and are found to be accurate.

动态资源管理是现代微服务应用中的一个难题。许多提出的方法依赖于分析性能模型的可用性，通常基于排队理论。这样的模型总是可以手工制作的，但这需要时间和专业知识。人们提出了各种从日志或跟踪数据中自动提取模型的方法。然而，它们通常很复杂，需要离线阶段和高级算法来检索服务时间分布。此外，所得到的模型可能很复杂，不适合在线评估。为了简单起见，我们在本文中为微服务应用程序引入了一个通用的排队网络模型，该模型可以(i)使用精炼的平均场流体模型快速准确地求解，(ii)在运行时以分布式方式从每个服务的公共本地跟踪数据中完全提取。在基于云的微服务应用中，对模型的拟合和系统扰动下的预测精度进行了评估，发现模型是准确的。

引用次数: 2

GenoPPML – a framework for genomic privacy-preserving machine learning GenoPPML -基因组隐私保护机器学习框架

Q1 Computer Science

IEEE Cloud Computing

Pub Date : 2022-07-01 DOI: 10.1109/CLOUD55607.2022.00076

Sergiu Carpov, Nicolas Gama, Mariya Georgieva, Dimitar Jetchev

We present a framework GenoPPML for privacy-preserving machine learning in the context of sensitive genomic data processing. The technology combines secure multiparty computation techniques based on the recently proposed Manticore framework for model training and fully homomorphic encryption based on TFHE for model inference. The framework was successfully used to solve breast cancer prediction problems on gene expression datasets coming from distinct private sources while preserving their privacy - the solution winning 1st place for both Tracks I and III of the genomic privacy competition iDASH’2020. Extensive benchmarks and comparisons to existing works are performed. Our 2-party logistic regression computation is 11× faster than the one in [1] on the same dataset and it uses only one CPU core.

我们提出了一个框架GenoPPML在敏感基因组数据处理的背景下保护隐私的机器学习。该技术将基于最近提出的Manticore框架的安全多方计算技术用于模型训练，基于TFHE的全同态加密技术用于模型推理。该框架成功地用于解决来自不同私人来源的基因表达数据集的乳腺癌预测问题，同时保护了它们的隐私-该解决方案在基因组隐私竞赛iDASH ' 2020的I和III轨道中获得了第一名。进行广泛的基准测试，并与现有的工作进行比较。在相同的数据集上，我们的2方逻辑回归计算比[1]中的计算快11倍，而且它只使用一个CPU核心。

引用次数: 9

Cop-Flash: Utilizing hybrid storage to construct a large, efficient, and durable computational storage for DNN training Cop-Flash:利用混合存储为深度神经网络训练构建一个大型、高效、持久的计算存储

Q1 Computer Science

IEEE Cloud Computing

Pub Date : 2022-07-01 DOI: 10.1109/CLOUD55607.2022.00041

Chunhua Xiao, S. Qiu, Dandan Xu

Traditional computing architectures that separate computing from storage face severe limitations when processing the data that is continuously produced in the cloud and at the edge. Recently, the computational storage device (CSD) is becoming one of the critical cloud infrastructures which can overcome these limitations. Many studies utilize CSD for DNN training to extract useful information and knowledge from the data quickly and efficiently. However, all previous work has used homogeneous storage, which is not fully considered the requirements of DNN training on CSD. Thus, we exploit the leverage of hybrid NAND flash memory to optimize this problem. Nevertheless, typical hybrid storage architectures have limitations when used for DNN training. Moreover, their management strategies can not fully exploit the heterogeneity of hybrid flash memory. To address this issue, we propose a novel SLC-TLC flash memory called Co-Partitioning Flash (Cop-Flash), which utilizes two different hybrid flash memory partitioning methods to divide storage into three different properties of flash memory. Meanwhile, two key technologies are included in Cop-Flash: 1) lifetime-based I/O identifier is proposed to identify data hotness according to data lifetime to maximize the benefits of heterogeneity and minimize the impact of garbage collection. 2) Erase-aware Adaptive Dual-zone Management is proposed to increase bandwidth utilization and guarantee system reliability. We compared Cop-Flash with two related state-of-the-art hybrid storage using hard partitioning and soft partitioning as well as TLC-only flash memory under real DNN training workloads. Experimental results show that Cop-Flash improves the performance by 29.1%, 38.8%, 56.6% and outperforms them by 2.3x, 1.29x, and 8.3x in terms of lifespan.

将计算与存储分离的传统计算架构在处理云和边缘不断产生的数据时面临严重的限制。最近，计算存储设备(CSD)正成为克服这些限制的关键云基础设施之一。许多研究利用CSD进行深度神经网络训练，以快速有效地从数据中提取有用的信息和知识。然而，以往的工作都是使用同构存储，没有充分考虑DNN在CSD上训练的要求。因此，我们利用混合NAND闪存的杠杆来优化这个问题。然而，典型的混合存储架构在用于深度神经网络训练时存在局限性。此外，它们的管理策略不能充分利用混合快闪记忆体的异质性。为了解决这一问题，我们提出了一种新型的SLC-TLC闪存，称为Co-Partitioning flash (Cop-Flash)，它利用两种不同的混合闪存分区方法将存储划分为三种不同属性的闪存。同时，Cop-Flash包含了两项关键技术:1)提出了基于生命周期的I/O标识符，根据数据生命周期识别数据热度，实现异构效益最大化和垃圾回收影响最小化。2)提出了擦除感知自适应双区管理，提高带宽利用率，保证系统可靠性。我们将Cop-Flash与两种相关的最先进的混合存储(使用硬分区和软分区)以及TLC-only闪存在真实DNN训练工作负载下进行了比较。实验结果表明，Cop-Flash的性能分别提高了29.1%、38.8%、56.6%，寿命分别提高了2.3倍、1.29倍和8.3倍。

{"title":"Cop-Flash: Utilizing hybrid storage to construct a large, efficient, and durable computational storage for DNN training","authors":"Chunhua Xiao, S. Qiu, Dandan Xu","doi":"10.1109/CLOUD55607.2022.00041","DOIUrl":"https://doi.org/10.1109/CLOUD55607.2022.00041","url":null,"abstract":"Traditional computing architectures that separate computing from storage face severe limitations when processing the data that is continuously produced in the cloud and at the edge. Recently, the computational storage device (CSD) is becoming one of the critical cloud infrastructures which can overcome these limitations. Many studies utilize CSD for DNN training to extract useful information and knowledge from the data quickly and efficiently. However, all previous work has used homogeneous storage, which is not fully considered the requirements of DNN training on CSD. Thus, we exploit the leverage of hybrid NAND flash memory to optimize this problem. Nevertheless, typical hybrid storage architectures have limitations when used for DNN training. Moreover, their management strategies can not fully exploit the heterogeneity of hybrid flash memory. To address this issue, we propose a novel SLC-TLC flash memory called Co-Partitioning Flash (Cop-Flash), which utilizes two different hybrid flash memory partitioning methods to divide storage into three different properties of flash memory. Meanwhile, two key technologies are included in Cop-Flash: 1) lifetime-based I/O identifier is proposed to identify data hotness according to data lifetime to maximize the benefits of heterogeneity and minimize the impact of garbage collection. 2) Erase-aware Adaptive Dual-zone Management is proposed to increase bandwidth utilization and guarantee system reliability. We compared Cop-Flash with two related state-of-the-art hybrid storage using hard partitioning and soft partitioning as well as TLC-only flash memory under real DNN training workloads. Experimental results show that Cop-Flash improves the performance by 29.1%, 38.8%, 56.6% and outperforms them by 2.3x, 1.29x, and 8.3x in terms of lifespan.","PeriodicalId":54281,"journal":{"name":"IEEE Cloud Computing","volume":"9 1","pages":"209-218"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79919590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0